---
title: "Longley"
author: "DP"
date: "September 6, 2016"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(digits=3)
```
To avoid filling up the handout with long summaries I'll use a shortened version,
which I call look().
The code is hidden in the pdf file, but is viewable in the Rmd file.
```{r echo=F}
look <- function(out){
  ss <- summary(out)
  cat("Residuals:\n")
  print(summary(ss$resid))
  cat(paste("Rsquared: ",round(ss$r.squared,3),"\n"))
  print( round(ss$coef,3))
}
```
### The famous Longley dataset
```{r}
data(longley) # we now have a 16 by 7 data frame called longley
# Also look at  pairs(longley)
```
```{r echo=F}
L0 <- longley # a copy to save on typing
head(longley,3) 
```
Longley used this small set of data to illustrate the 
potentially nasty effects of round-off errors when working
with predictors that are ``almost linearly dependent''.
I'll use it to illustrate the effects that small changes can 
have on a least squares fit when the matrix of predictors is "ill-conditioned".
```{r}
out0 <- lm(Employed ~. , data = longley) ; look(out0)
```
A lot of the work is being done by $\bold 1$ (a column of ones) and the longley\$Year predictor:
```{r}
outYear <- lm(Employed ~ Year, data = longley) ; look(outYear)   #summary(outYear)
```
The other predictors  improved the fit by only a small amount. Compare
totalSS $= \sum_i(y_i-\overline y)^2$ with the sum of squared residuals for both fits:
```{r echo=F}
y <- longley$Employed ; y.centered <- y - mean(y) # simplify typing
```
```{r echo=F}
c(
  totalSS = sum( y.centered^2 ), 
  outYearSS = sum( (outYear$res)^2 ),
  res0SS=sum( (out0$res)^2 )
 )
```

### Remove time trend from five predictors
Replace the columns "GNP.deflator", "GNP","Unemployed", "Armed.Forces", and "Population" by their components
orthogonal to $\bold 1$ and "Year".
```{r}
L1 <- longley
for (nn in names(longley)[1:5]){
  L1[[nn]] <- lm(longley[[nn]] ~ longley[["Year"]])$residual
}
out1 <- lm(Employed ~ . , L1)
```
Look at pairs(L1) to see the effect.
Then compare residuals and coefficients with out0:
```{r echo=F}
 round( rbind( out1=summary(out1$resid), out0 =summary(out0$resid) ),4)
 round( rbind(out1 = out1$coef, out0 = out0$coef),4)
```
The least squares fitted vector is the same but some of the coefficients have changed. Why?

### Effect of small perturbations
Now I'll  make some small changes that, I hope,
will have a bad effect on the bhat.
Manufacture a small perturbation as suggested on the SVD.pdf handout:

```{r}
Xsvd <- svd(L1[,1:5])
U <- Xsvd$u; U5 <- U[,5]
V <- Xsvd$v; V5 <- V[,5]

E <- U5  %*% t(V5)  ;   dimnames(E) <- dimnames(longley[,1:5])
yincr <- U5 %*% t(U5) %*% y 
```

```{r}
L3 <- L2 <- L1 # make changes on  copies
L2[,1:5] <- L1[,1:5] - E 
L3[,1:5] <- L1[,1:5] - 2*E 
L4 <- L2; L4$Employed <- y + yincr
L5 <- L3; L5$Employed <- y + 2*yincr
# compare:
#  round(L,2); round(LL,2)
out2 <- lm(Employed ~. , data = L2)
out3 <- lm(Employed ~. , data = L3)
out4 <- lm(Employed ~. , data = L4)
out5 <- lm(Employed ~. , data = L5)
```
Compare the original longley data with the fake data set
obtained by applying the L5 perturbations to longley. Do you 
think any of the differences are particularly noteworthy?
```{r echo=F}
fake <- longley
fake[,1:5] <- longley[,1:5] - 2*E
fake$Employed <- y + 2*yincr
outfake <- lm(Employed ~ . ,data= fake)
names(fake) <- paste("fake",substr(names(longley),1,5),sep="_")
both <- cbind(longley,fake)
therows <- rbind(1:7,8:14)
both <- both[,as.vector(therows)]
round(t(both[1:8,-(11:12)]),2); round(t(both[9:16,-(11:12)]),2)
```
Then compare their least squares fits. 
Larger values in the last column of the coefficients dispaly suggest we should not take a coefficient too seriously.  (More about that idea soon.)
First longley then fake:
```{r echo=F}
look(out0)
look(outfake)
```

The point is that some of the coefficients are very sensitive to changes
in the data almost at the level of round-off error.

Just for the comparison, here  are the coefficients for the fits with time trends removed:
```{r echo=F}
compare <- rbind(longley = out0$coef, L1=out1$coef,L2=out2$coef,
  L3=out3$coef,L4=out4$coef,L5=out5$coef)
round(compare,2)
```
You might also find it interesting to look at the least squares in more detail.
```{r}
#look(out0);look(out1);look(out2);look(out3);look(out4);look(out5)
```