[Return to tutorial page]

Statistics 200: Lab 6

Today's tasks:
  • Data structures. Regression and model fitting. Manipulation of lm objects.

    Problem 1

    Here are some data that are discussed in Section 14.5 of the text "Mathematical Statistics and Data Analysis" by Rice. For each of 12 children, the columns denote
    height in inches
    weight in pounds
    distance in cm (to pulmonary artery)
    
    42.8 40.0 37.0
    63.5 93.5 49.5
    37.5 35.5 34.5
    39.5 30.0 36.0
    45.5 52.0 43.0
    38.5 17.0 28.0
    43.0 38.5 37.0
    22.5  8.5 20.0
    37.0 33.0 33.5
    23.5  9.5 30.5
    33.0 21.0 38.5
    58.0 79.0 47.0
    
    EITHER:
    Copy the data to a file, then use read.table() to create a data frame called cath (short for catheter), with the variable names height, weight, dist.
    OR:
    Attach the ricedata section of the library to your searchpath and look for the dataframe cath.

    Use plot() and pairs() to get a rough idea of how the variables are related. What do you see?

    Problem 2

    Try
    lm(dist~height,cath)
    
    Look at what gets written to the screen. You will soon understand what it is saying. Be prepared to explain what is returned to either BM or DP.

    The function lm() fits a linear model by least squares. The second argument, cath, tells lm that the variables dist and height are components of the dataframe cath. The notation "dist~height" means that dist is to be predicted by a linear function of height: that is, the lm() function is finding constants c1 and c2 for which the sum of squared residuals

     
    
    is as small as possible. Note the presence of the "intercept" term c1; Splus includes it, by default. The "model formula"
    dist ~ height -1
    
    would force Splus to omit the intercept term.

    Splus calls the minimizing constants coefficients. The corresponding value c1+c2*height is called the vector of fitted values. The remainder, dist - c1 -c2*height, is called the vector of residuals (or residual values).

    Try

    lm(dist~height,cath)
    reg1 <- lm(dist~height,cath)
    print(reg1)
    
    What does the output tell you about the lm object reg1?

    Look at the attributes of reg1. Try to figure out the meaning of as many of the components of reg1 as you can. In particular, try to figure out how the "residual standard error" is calculated. Hint: ?lm ?lm.object ?print.lm ?summary (If you have studied regression before you might even try to figure out what anova(reg1) does.)

    Problem 3

    Try
    plot(reg1)
    
    then try
    plot(cath$height,cath$dist)
    abline(reg1)
    
    What happens? What is being plotted in each case? Where is Splus finding the necessary information?

    Homework Problem (pictures to be handed in)

    For the catheter data:
    1. Draw a plot of the residuals (on the vertical axis) versus the fitted values. Label the axes appropriately. Add a title.
    2. Try lm.influence(reg1). Save the $hat component as a new variable called hat. The vector of standardized residuals is defined as
      residuals/(residual.standard.error * sqrt(1-hat))
      
      Write a function that will accept an lm object as its argument and then draw a plot of standardized residuals versus fitted values.

    Problem 5

    The model formula
    dist ~ height + weight
    tells Splus to predict dist as a linear combination c1 + c2*height + c3*weight. Fit such a model, saving the output in an object reg2. Does the added predictor variable weight improve the fit much?

    Problem 6

    (If you get bored)
    Create a dataframe called cars by selecting the variables "Weight" "Disp." "Eng.Rev" "Price" "Country" "Mileage" "Type" from the dataframe car.all. See how well Mileage can be predicted by the other variables: try plots (or ?pairs if you are brave); look at summary statistics (?summary). Try fitting a linear model such as
    lm(Mileage ~ Weight)
    
    Figure out how to get rid of the problem with missing values. (?lm) Look at the output that is produced when you finally get lm() to work. What do you learn and see?