--- title: "Longley" author: "DP" date: "September 6, 2016" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) options(digits=3) ``` To avoid filling up the handout with long summaries I'll use a shortened version, which I call look(). The code is hidden in the pdf file, but is viewable in the Rmd file. ```{r echo=F} look <- function(out){ ss <- summary(out) cat("Residuals:\n") print(summary(ss$resid)) cat(paste("Rsquared: ",round(ss$r.squared,3),"\n")) print( round(ss$coef,3)) } ``` ### The famous Longley dataset ```{r} data(longley) # we now have a 16 by 7 data frame called longley # Also look at pairs(longley) ``` ```{r echo=F} L0 <- longley # a copy to save on typing head(longley,3) ``` Longley used this small set of data to illustrate the potentially nasty effects of round-off errors when working with predictors that are ``almost linearly dependent''. I'll use it to illustrate the effects that small changes can have on a least squares fit when the matrix of predictors is "ill-conditioned". ```{r} out0 <- lm(Employed ~. , data = longley) ; look(out0) ``` A lot of the work is being done by $\bold 1$ (a column of ones) and the longley\$Year predictor: ```{r} outYear <- lm(Employed ~ Year, data = longley) ; look(outYear) #summary(outYear) ``` The other predictors improved the fit by only a small amount. Compare totalSS $= \sum_i(y_i-\overline y)^2$ with the sum of squared residuals for both fits: ```{r echo=F} y <- longley$Employed ; y.centered <- y - mean(y) # simplify typing ``` ```{r echo=F} c( totalSS = sum( y.centered^2 ), outYearSS = sum( (outYear$res)^2 ), res0SS=sum( (out0$res)^2 ) ) ``` ### Remove time trend from five predictors Replace the columns "GNP.deflator", "GNP","Unemployed", "Armed.Forces", and "Population" by their components orthogonal to $\bold 1$ and "Year". ```{r} L1 <- longley for (nn in names(longley)[1:5]){ L1[[nn]] <- lm(longley[[nn]] ~ longley[["Year"]])$residual } out1 <- lm(Employed ~ . , L1) ``` Look at pairs(L1) to see the effect. Then compare residuals and coefficients with out0: ```{r echo=F} round( rbind( out1=summary(out1$resid), out0 =summary(out0$resid) ),4) round( rbind(out1 = out1$coef, out0 = out0$coef),4) ``` The least squares fitted vector is the same but some of the coefficients have changed. Why? ### Effect of small perturbations Now I'll make some small changes that, I hope, will have a bad effect on the bhat. Manufacture a small perturbation as suggested on the SVD.pdf handout: ```{r} Xsvd <- svd(L1[,1:5]) U <- Xsvd$u; U5 <- U[,5] V <- Xsvd$v; V5 <- V[,5] E <- U5 %*% t(V5) ; dimnames(E) <- dimnames(longley[,1:5]) yincr <- U5 %*% t(U5) %*% y ``` ```{r} L3 <- L2 <- L1 # make changes on copies L2[,1:5] <- L1[,1:5] - E L3[,1:5] <- L1[,1:5] - 2*E L4 <- L2; L4$Employed <- y + yincr L5 <- L3; L5$Employed <- y + 2*yincr # compare: # round(L,2); round(LL,2) out2 <- lm(Employed ~. , data = L2) out3 <- lm(Employed ~. , data = L3) out4 <- lm(Employed ~. , data = L4) out5 <- lm(Employed ~. , data = L5) ``` Compare the original longley data with the fake data set obtained by applying the L5 perturbations to longley. Do you think any of the differences are particularly noteworthy? ```{r echo=F} fake <- longley fake[,1:5] <- longley[,1:5] - 2*E fake$Employed <- y + 2*yincr outfake <- lm(Employed ~ . ,data= fake) names(fake) <- paste("fake",substr(names(longley),1,5),sep="_") both <- cbind(longley,fake) therows <- rbind(1:7,8:14) both <- both[,as.vector(therows)] round(t(both[1:8,-(11:12)]),2); round(t(both[9:16,-(11:12)]),2) ``` Then compare their least squares fits. Larger values in the last column of the coefficients dispaly suggest we should not take a coefficient too seriously. (More about that idea soon.) First longley then fake: ```{r echo=F} look(out0) look(outfake) ``` The point is that some of the coefficients are very sensitive to changes in the data almost at the level of round-off error. Just for the comparison, here are the coefficients for the fits with time trends removed: ```{r echo=F} compare <- rbind(longley = out0$coef, L1=out1$coef,L2=out2$coef, L3=out3$coef,L4=out4$coef,L5=out5$coef) round(compare,2) ``` You might also find it interesting to look at the least squares in more detail. ```{r} #look(out0);look(out1);look(out2);look(out3);look(out4);look(out5) ```