## Statistics 200: Lab 6 (Friday 20 February 1998)

Data structures. Regression and model fitting. Manipulation of lm objects.

#### Data entry, a reminder

Here are some data that are discussed in Section 14.5 of the text "Mathematical Statistics and Data Analysis" by Rice. For each of 12 children, the columns denote the height of the child (inches), the weight of the child (pounds) and the distance from the child's heart to their pulmonary artery (centimeters).
```
42.8 40.0 37.0
63.5 93.5 49.5
37.5 35.5 34.5
39.5 30.0 36.0
45.5 52.0 43.0
38.5 17.0 28.0
43.0 38.5 37.0
22.5  8.5 20.0
37.0 33.0 33.5
23.5  9.5 30.5
33.0 21.0 38.5
58.0 79.0 47.0
```
Task: Enter the data into Splus (there are two ways that we have available)

EITHER:

Copy the data to a file, then use read.table() to create a data frame called cath (short for catheter), with the column names height, weight, dist.

OR:

Attach the ricedata section of the library to your searchpath and look for the dataframe cath (if you can't remember how, check. the class FAQ on libraries

Use plot() and pairs() to get a rough idea of how the variables are related. What do you see?

#### Fitting linear models (the lm() command)

Try
```lm(dist~height,cath)
```
Look at what gets written to the screen. Do you understand what it is saying? Be prepared to explain what is returned to either BM or JH.

The function lm() fits a linear model by least squares. The second argument of the function, cath, tells lm that the variables dist and height are components of the dataframe cath.

The notation "dist~height" means that dist is to be predicted by a linear function of height: that is, the lm() function is finding constants c1 and c2 for which the sum of squared residuals

```
```
is as small as possible.

Note the presence of the "intercept" term c1; Splus includes it, by default.

The "model formula"

```dist ~ height -1
```
would force Splus to omit the intercept term.

Splus calls the minimizing constants coefficients. The corresponding value c1+c2*height is called the vector of fitted values. The remainder, dist - c1 -c2*height, is called the vector of residuals (or residual values).

Try

```lm(dist~height,cath)
reg1 <- lm(dist~height,cath)
print(reg1)
```
What does the output tell you about the lm object reg1?

Look at the attributes of reg1. Try to figure out the meaning of as many of the components of reg1 as you can. In particular, try to figure out how the "residual standard error" is calculated. Hint: ?lm ?lm.object ?print.lm ?summary (If you have studied regression before you might even try to figure out what anova(reg1) does.)

#### Splus plots for linear models

Try
```plot(reg1)
```
then try
```plot(cath\$height,cath\$dist)
abline(reg1)
```
What happens? What is being plotted in each case? Where is Splus finding the necessary information?

#### Homework Problem (pictures to be handed in)

For the cath data:
1. Draw a plot of the residuals (on the vertical axis) versus the fitted values. Label the axes appropriately. Add a title.
2. Try lm.influence(reg1). Save the \$hat component as a new variable called hat. The vector of standardized residuals is defined as
3. ```residuals/(residual.standard.error * sqrt(1-hat))
```
Write a function that will accept an lm object as its argument and then draw a plot of standardized residuals versus fitted values.

#### Bigger (and better?) models

The model formula
`dist ~ height + weight`
tells Splus to predict dist as a linear combination c1 + c2*height + c3*weight. Fit such a model, saving the output in an object reg2. Does the added predictor variable weight improve the fit much?

#### A more complicated regression problem (what to do with missing values)

Create a dataframe called cars by selecting the variables "Weight" "Disp." "Eng.Rev" "Price" "Country" "Mileage" "Type" from the dataframe car.all. See how well Mileage can be predicted by the other variables: try plot (or ?pairs if you are brave); look at summary statistics (?summary). Try fitting a linear model such as
```lm(Mileage ~ Weight)
```
Figure out how to get rid of the problem with missing values. (?lm) Look at the output that is produced when you finally get lm() to work. What do you learn and see?