Stat200 lab2

Statistics 200: Lab 2 (Friday 12 September)

Today's tasks:
Reading data from other sources; data from WWW sites (such as StatLib at Carnegie-Mellon University, and the U.S. Census Bureau). Data frames. Evaluation frames. Libraries. Search lists.

Getting Census data (using Netscape or other internet browser)

Click on the link to the U.S. Census Bureau. Navigate your way through to the data for New Haven County:

select 1990 Census Lookup;
select STF3A;
select Connecticut, and check the State-County level;
select New Haven County, and check State-County-County Subdivision;
check 'select/retrieve all ...';
(don't stare at the page for too long);
check table P13;
look at it in HTML;
back; then check 'tab-delimited format' ...
You should now be looking at a mess. Choose 'save-as' from the file menu, then save the mess as a text file called c:\user\NHage.txt

Now you are ready to have some fun with Splus.

Getting data into Splus

Try

> age <- read.table("NHage.txt")   # shouldn't work
> count.fields("NHage.txt")     # here's why
> age_read.table("NHage.txt",sep="\t")  #almost, but header messed up
> age_read.table("NHage.txt",sep="\t",header=T)

Look at age after each attempt. What happened? (Hint: ?read.table) See

HELP for other ways of getting data into Splus.

You now have a data frame. It acts like both a matrix and a list.

Save the first five rows of the data frame as age5. Use age5 as the test case for the first problem. It is a good idea to experiment with small data sets whenever you are trying to get a function to work.

Try dim(age), dimnames(age). Try running attributes() on each column of the data frame. (You can select columns by name or number, as for matrices, or by using the $ notation for lists.) Try

> lapply(age,attributes)       
 # applies the attributes() function
 # to each component of the data frame

What do all those attributes mean?

Problem 1

Need HELP with the use of fix() for writing functions?

Build yourself a function that takes a data frame as argument, and carries out the following operations. Assume that you will be feeding in a data frame whose rows have labels like "Ansonia town", ...

Kill the first three columns.
Create vectors under18 (sum the first twelve columns) and over18 (the remaining columns), giving the populations for each town under and over the age of 18 years. (Hint: ?apply or HELP)
Strip the characters " town" from the end of each row name. (Hint: ?nchar, ?substring)
Return a new data frame, with the stripped town names as row labels, and columns called under18, over18.

Run your function, saving the output as age.split.

Problem 2

Build yourself a function, which takes a data frame like age.split as argument, and which draws a picture with horizontal bars showing the under18 + over18 population for each town, with the town names used as labels for each bar. (Hint: ?barplot, ?as.matrix, ?t)

Problem 3

From the Census lookup, get the data for the white population of New Haven town in 1990, broken down by age and sex (an old joke). Write a function to construct a population pyramid (compare with an example that I prepared for a paper) for the town. Make sure you save this function, because next week you will be using it as the starting point for some fancier graphics.

Important HELP concerning saving your work.

Where Splus finds things

Try

 > x <- 3 
> foo_function(y){  x<- 15;  y + x  } 
> foo(4) 
[1] 19 
> foo(x) 
[1] 18
> x 
[1] 3

Explain.

I assume you still have the data frame age.split lying around. Try

> search()
# This function returns the list of directories, 
# in order, where Splus looks for objects.  
# It is called the "search path".
> objects()
> attach(age.split,1)
# puts the data frame age in the first position
# on the search list
> total = under18 + over18
> search()
> objects()
> detach(1,save="junk")
> objects()
> junk

Explain what happened.

Note: If you create an object called foo in your working directory, and if there is another object called foo further down in the search list, Splus will find your foo first. Your foo object masks the other foo object. Be careful that you don't accidently mask a system object, such as the t() function or the c() function. You should take heed of any warnings about masking. See ?masked.

Evaluation frames

Splus can attach the same name to different values in different contexts (as in the case of the foo() function). See Becker et al, section 5.4, or Venables & Ripley, section 4.5. I had originally thought this topic a tad esoteric--until I got tripped up by ignoring it. If you find yourself wanting to define a function within another function (yes, you can do that), or if Splus does not seem to be finding objects that you have created, see

HELP on frames.

Libraries

Directories of Splus objects that can be attached to the search list (?library), sort of like adding another _Data. Many authors distribute their new statistical software as Splus libraries. Check out the collection at StatLib.

We have created a special library for Statistic 200 at

H:\\courses\\stat200

The following information is subject to change.

The S-plus command

library(lib.loc="h:\\classes\\stat200")

gives a list of all the sections in the stat200 library, and also all the sections in the default system library. You will see that one of the available sections is called "nci". Nothing loaded yet.

If you want to find out about the nci section of the library, type

libary(help=nci,lib.loc="h:\\classes\\stat200")

If you want to attach the nci section of the library to your search path, type:

library(nci,lib.loc="h:\\classes\\stat200")

Then Splus has access to all the data in the nci library. (Try search() after you attach the library.)