[Return to tutorial page]
Statistics 200: Lab 2 (Friday 12 September)
Today's tasks:
Reading data from other sources; data from WWW sites (such as
StatLib at Carnegie-Mellon University, and the
U.S. Census Bureau). Data frames. Evaluation frames.
Libraries. Search lists.
Getting Census data (using Netscape or other internet browser)
Click on the link to the U.S. Census Bureau.
Navigate your way through to the data for New Haven County:
- select 1990 Census Lookup;
- select STF3A;
- select Connecticut, and check the State-County level;
- select New Haven County, and check State-County-County Subdivision;
- check 'select/retrieve all ...';
- (don't stare at the page for too long);
- check table P13;
- look at it in HTML;
- back; then check 'tab-delimited format' ...
- You should now be looking at a mess.
Choose 'save-as' from the file menu,
then save the mess as a text file called c:\user\NHage.txt
Now you are ready to have some fun with Splus.
Getting data into Splus
Try
> age <- read.table("NHage.txt") # shouldn't work
> count.fields("NHage.txt") # here's why
> age_read.table("NHage.txt",sep="\t") #almost, but header messed up
> age_read.table("NHage.txt",sep="\t",header=T)
Look at age after each attempt. What happened? (Hint: ?read.table)
See
HELP for other ways of
getting data into Splus.
You now have a data frame. It acts like both a matrix and a list.
Save the first five rows of the data frame as age5. Use age5 as the test case
for the first problem. It is a good idea to experiment with small
data sets whenever you are trying to get a function to work.
Try dim(age), dimnames(age). Try running attributes() on each column
of the data frame. (You can select columns by name or number, as for
matrices, or by using the $ notation for lists.) Try
> lapply(age,attributes)
# applies the attributes() function
# to each component of the data frame
What do all those attributes mean?
Problem 1
Need HELP with the use of fix() for writing functions?
Build yourself a function
that takes a data frame as argument, and
carries out the following operations. Assume that you will be feeding
in a data frame whose rows have labels like "Ansonia town", ...
- Kill the first three columns.
- Create vectors under18 (sum the first twelve columns)
and over18 (the remaining columns), giving the populations for each
town under and over the age of 18 years. (Hint: ?apply or
HELP)
-
Strip the characters " town" from the end of each row name.
(Hint: ?nchar, ?substring)
-
Return a new data frame, with the stripped town names as row labels,
and columns called under18, over18.
Run your function, saving the output as age.split.
Problem 2
Build yourself a function, which takes a data frame like age.split as
argument, and which draws a picture with horizontal bars showing the
under18 + over18 population for each town, with the town names used as
labels for each bar. (Hint: ?barplot, ?as.matrix, ?t)
Problem 3
From the Census lookup, get the data for the white
population of New Haven town in 1990, broken down by age and sex (an
old joke). Write a function to construct a population pyramid
(compare with an example that I prepared
for a paper) for the town. Make sure you save this function, because
next week you will be using it as the starting point for some fancier
graphics.
Important
HELP
concerning saving your
work.
Where Splus finds things
Try
> x <- 3
> foo_function(y){ x<- 15; y + x }
> foo(4)
[1] 19
> foo(x)
[1] 18
> x
[1] 3
Explain.
I assume you still have the data frame age.split lying around. Try
> search()
# This function returns the list of directories,
# in order, where Splus looks for objects.
# It is called the "search path".
> objects()
> attach(age.split,1)
# puts the data frame age in the first position
# on the search list
> total = under18 + over18
> search()
> objects()
> detach(1,save="junk")
> objects()
> junk
Explain what happened.
Note: If you create an object called foo in your working directory,
and if there is another object called foo further down in the search
list, Splus will find your foo first. Your foo object masks
the other foo object. Be careful that you don't accidently mask
a system object, such as the t() function or the c() function. You
should take heed of any warnings about masking. See ?masked.
Evaluation frames
Splus can attach the same name to different values in different
contexts (as in the case of the foo() function). See Becker et al,
section 5.4, or Venables & Ripley, section 4.5.
I had originally thought this topic a tad esoteric--until I got
tripped up by ignoring it. If you find yourself wanting to define a
function within another function (yes, you can do that),
or if Splus does not seem to be finding objects that you have created,
see
HELP on frames.
Libraries
Directories of Splus objects that can be attached to the search list
(?library), sort of like adding another _Data. Many authors distribute
their new statistical software as Splus libraries. Check out the
collection at StatLib.
We have created a
special library for Statistic 200 at
H:\\courses\\stat200
The following information is subject to change.
The S-plus command
library(lib.loc="h:\\classes\\stat200")
gives a list of all the sections in the stat200 library,
and also all the sections in the default system library.
You will see that one of the available sections is called "nci".
Nothing loaded yet.
If you want to find out about the nci section of the library, type
libary(help=nci,lib.loc="h:\\classes\\stat200")
If
you want to attach the nci section of the library to your search path, type:
library(nci,lib.loc="h:\\classes\\stat200")
Then Splus has access to all the data in the nci library. (Try
search() after you attach the library.)