Stat200 lab2

Statistics 200: Lab 2 (Friday, 23 January 1998)

Today's tasks:
Reading data from other sources; data from WWW sites (such as StatLib at Carnegie-Mellon University, and the U.S. Census Bureau). Data frames. Evaluation frames. Libraries. Search lists.

We frequently need to be able to read in data from external sources into Splus. It could be that we find data on the internet, or we have data on a disk that we need to analyze using Splus. Today we learn how to read data from such external sources. Also today, we learn about another type of Splus object called a data frame, and we learn more about were Splus stores the variables that you define.

Note that throughout the notes for this lab you will see ?commandname. When you see one of these we suggest that you check the help command for that command name in Splus. We want you to fully understand these commands and the help files are a great place to start.

Getting Census data (using Netscape or other internet browser)

The U.S. Census Bureau has a lot of data available from their web page. Much of the data recorded in the last census (1990) is available, we shall retrieve some of this data as an example of taking data from the web.

Click on the link to the U.S. Census Bureau. This is how you navigate your way through to the data for New Haven County:

select 1990 Census Lookup;
select STF3A;
select Connecticut, and check the State-County level;
select New Haven County, and check State-County-County Subdivision;
check 'select/retrieve all ...';
(don't stare at the page for too long);
check table P13;
look at it in HTML;
back; then check 'tab-delimited format' ...
You should now be looking at a mess. Choose 'save-as' from the file menu, then save the mess as a text file called c:\user\NHage.txt

Now you are ready to have some fun getting the data into Splus.

Getting data into Splus

Try

> age <- read.table("NHage.txt")

That didn't work. The next command might give you a reason why.

> count.fields("NHage.txt")

When you read the data from the Census Bureau they mentioned something about tab-delimited format. That was an important piece of information.

> age<-read.table("NHage.txt",sep="\t")

Almost! Yes, the "\t" denotes tab, we have a problem with the first line of the data set but it is easily rectified.

> age<-read.table("NHage.txt",sep="\t",header=T)
It finally worked on the last attempt, it wasn't that hard really. What happened? (Hint: ?read.table) See HELP for other ways of getting data into Splus.

Data frames

The read.table command stored the data in a data frame. A data frame, acts like both a matrix and a list. Data frames look like matrices and can be indexed in the same way. Data frames have both column and row names, so they're like a list of name indexed vectors of the same length. Whereas all the columns of a data frame must be of the same type, data frames allow the columns to have different types. To get to understand how data frames work try the following,

> attributes(age)
> age[5,]
> age[,4]
> age["New Haven town",]
> age[,"P0130001"]
> age$P0130001

As you can see the data frame behaves like a matrix and like a list. This can be very useful.

Save the first five rows of the data frame as age5. Use age5 as the test case for the first problem. It is a good idea to experiment with small data sets whenever you are trying to get a function to work.

Try dim(age), dimnames(age). Try running attributes() on each column of the data frame. (You can select columns by name or number, as for matrices, or by using the $ notation for lists.) Try

> lapply(age,attributes)

Yes, the command found the attributes of each column of the data frame What do all those attributes mean?

More about functions

Last week we used the up arrow key to recall the functions that we entered, so that we could edit them. An alternative approach is using the fix() command and we have a class help file set up for this. It's up to you which method you use, some people love fix, and some people hate it.

Need HELP with the use of fix() for writing functions?

Build yourself a function that takes a data frame as argument, and carries out the following operations. Assume that you will be feeding in a data frame whose rows have labels like "Ansonia town", ... When building the function we recommend that you start writing a function that does step 1, then edit it to do steps 1 and 2, etc.

Remove the first three columns of the data frame.
Create vectors under18 (sum the first twelve columns) and over18 (the remaining columns), giving the populations for each town under and over the age of 18 years. (Hint: ?apply or HELP)
Strip the characters " town" from the end of each row name. (Hint: ?nchar, ?substring)
Return a new data frame, with the stripped town names as row labels, and columns called under18, over18.

Run your function on the data frame, age, saving the output as age.split.

Problem 2

Build yourself a function, which takes a data frame like age.split as argument, and which draws a picture with horizontal bars showing the under18 + over18 population for each town, with the town names used as labels for each bar. (Hint: ?barplot, ?as.matrix, ?t)

Problem 3

From the Census lookup, get the data for the white population of New Haven town in 1990, broken down by age and sex (an old joke). Write a function to construct a population pyramid (compare with an example that David Pollard prepared for a paper) for the town. Make sure you save this function, because next week you will be using it as the starting point for some fancier graphics.

Important HELP concerning saving your work.

Where Splus finds things

In advance I warn you to follow this section carefully, don't hesitate to ask questions. We have a simple function written below, what is happening when you run the function? Does it give you the answers that you expect? Can you explain?

> x <- 3
> foo<-function(y){ x<- 15; y + x }
> foo(4)
[1] 19
> foo(x)
[1] 18
> x
[1] 3

Explain what happens. Pay attention to the Evaluation Frames section, for more details.

I assume you still have the data frame age.split lying around. Try

> search()

This function returns the list of directories, in order, where Splus looks for objects. It is called the "search path".

> objects()
> attach(age.split,1)

This puts the data frame age in the first position on the search list

> total = under18 + over18
> search()
> objects()
> detach(1,save="junk")
> objects()
> junk

Explain what happened.

Note: If you create an object called foo in your working directory, and if there is another object called foo further down in the search list, Splus will find your foo first. Your foo object masks the other foo object. Be careful that you don't accidentally mask a system object, such as the t() function or the c() function. You should take heed of any warnings about masking. See ?masked.

Evaluation frames

Splus can attach the same name to different values in different contexts (as in the case of the foo() function). See Becker et al, section 5.4, or Venables & Ripley, section 4.5. Many originally think that this topic a tad esoteric until they get tripped up by ignoring it. If you find yourself wanting to define a function within another function (yes, you can do that), or if Splus does not seem to be finding objects that you have created, see

HELP on frames.

Libraries

Directories of Splus objects that can be attached to the search list (?library), sort of like adding another _Data. Many authors distribute their new statistical software as Splus libraries. Check out the collection at StatLib.

We have created a special library for Stat200 at

H:\\courses\\stat200

The Splus command

library(lib.loc="h:\\classes\\stat200")

gives a list of all the sections in the stat200 library, and also all the sections in the default system library. You will see that one of the available sections is called "nci". Nothing loaded yet.

If you want to find out about the nci section of the library, type

library(help=nci,lib.loc="h:\\classes\\stat200")

If you want to attach the nci section of the library to your search path, type:

library(nci,lib.loc="h:\\classes\\stat200")

Then Splus has access to all the data in the nci library. (Try search() after you attach the library.)