[Return to syllabus page]

Statistics 200: Lab 4

Today's tasks:
Manipulation of matrices and arrays. Cross-tabulation of data. Factors.

Problem 1

Look at the object iris (and try ?iris). It is a 3-dimensional array. Notice how Splus displays the array (as stacks of matrices). What do its attributes tell you? Use the apply() function to create a 4 by 3 matrix iris2 whose entries give the means for each species and measurement type. (You will use this array as a test case for Problem 5.)

Problem 2

Create a data frame NCI from the data set NCI.data, which gives various population estimates for the state of Connecticut for a five-year period. For documentation see NCI.doc. Notice that the fields of interest are separated by white space. (Whole of CT = code 09000, Hartford County = 09003, New Haven County = 09009...).

Give the columns of the data frame more memorable names, such as year, county, race, and so on. Try to use seq() and paste() to construct names like "0--4", "5--9", for the age groups.

The codes 1 through 12 are not very descriptive for the race/ethnicity variable. Create a character vector with entries like "WnHM", "WnHF", and so on as abbreviations. (Hint: With clever use of paste you can build the vector up from a vector with entries like "WnH", "WH", and a vector c("M","F").) Use factor() to create a factor object, with levels "WnH", "WH", ..., to replace the race column of NCI.

Use the information in the list of counties to turn NCI$county into a factor as well, with labels being county name (or CT for the whole of Connecticut)

Look at the attributes() and codes() of NCI$race. How does Splus represent a factor? Try sort(levels(NCI$race))[codes(NCI$race)]. What do you notice?

Can you explain how factors work?

Problem 3

What good are factors? Try table() to find out for each year how many counties are represented. (Not too exciting.)

Use apply() for a submatrix of NCI to generate a vector pops of population totals for each row of NCI. Use tapply() with various factors or lists of factors to create a three-dimensional array called nci.array showing total population in each of the 12 race/ethnicity categories cross-classified by county and year . Use the aperm() function, if necessary, to ensure that nci.array prints to the screen as a sequence of matrices (one for each county) with rows labelled by year and columns by race. For example, one matrix should look like

, , New Haven
     WnHM   WnHF   WHM   WHF    BM    BF AmerM AmerF AsianM AsianF HispM HispF 
90 317902 344393 22754 23320 39310 44544   786   845   5455   5290 25344 25962
91 315996 342301 23413 24057 40003 45282   801   856   5714   5567 26142 26856
92 313944 340333 24012 24621 40184 45473   809   872   5897   5851 26845 27511
93 311333 337291 24737 25275 40728 46116   833   901   6110   6175 27699 28309
94 309159 334720 25128 25728 40966 46327   848   903   6285   6413 28176 28859

Problem 4

Print out (and hand in) a table of the total over-20 black populations for each county expressed as percentages (rounded to one decimal place) of the total over-20 populations, for each year. That is, you should create a table like:
            90  91  92  93  94 
        CT 7.1 7.2 7.3 7.4 7.4
 Fairfield 8.2 8.3 8.3 8.5 8.5
  Hartford 8.8 8.9 9.0 9.1 9.2
   Windham 1.0 1.0 1.0 1.0 1.0
Hint: The cross-tabulations generated by tapply can be assigned to objects then manipulated as arrays. Also: ?round

Problem 5

Write a function that takes an array (such as iris2 or nci.array) and creates a data frame with columns:

For example, here is the output from a function called matrix.to.factor that I wrote to solve this two-dimensional version of this problem. Hint: Try filling up an array with the factor names, then using factor().

Make sure you can recreate the original matrix (using tapply and the factors) from the output of your function.

Problem 6

(Optional, unless you finish very early. Hard at this stage of the course.)
For New Haven County, draw graphs for the cumulative proportions of white-nonhispanic, black, and hispanic populations. Put age on the horizontal axis and the fraction of each population younger than each age on the vertical axis. Draw the three curves on the same plot, using different line types for each group. Hint: build your function around matplot(). You might find it easier to create several functions to solve the whole problem.