ANOVA and Linear Models

Challenge 1

In this case lsfit is anticipating needing to create its own intercept. As we created one the model matrix the matrix becomes collinear when lsfit attempts to add its own term. This can be corrected either by leaving the intercept term out of the model matrix or setting the correct lsfit option.

Challenge 2: One Way ANOVA

To build a linear model in lm() with the independent variable Diver country we must first change the country variable to a factor. It was initially read in as a character variable (with the as.is = TRUE set on the read.csv).

d <- read.csv("http://www.stat.yale.edu/~jay/625/diving/Diving2000.csv", as.is = TRUE)
d$Country.f <- as.factor(d$Country)
lm.1 <- lm(d$JScore ~ d$Country.f)
summary(lm.1)
## 
## Call:
## lm(formula = d$JScore ~ d$Country.f)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.607 -0.609  0.197  0.841  3.027 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.6143     0.2089   22.09  < 2e-16 ***
## d$Country.fARM   0.6238     0.2829    2.21  0.02747 *  
## d$Country.fAUS   2.6886     0.2139   12.57  < 2e-16 ***
## d$Country.fAUT   1.8314     0.2289    8.00  1.3e-15 ***
## d$Country.fAZE   1.6119     0.2829    5.70  1.2e-08 ***
## d$Country.fBLR   2.0375     0.2394    8.51  < 2e-16 ***
## d$Country.fBRA   1.7772     0.2275    7.81  6.1e-15 ***
## d$Country.fCAN   2.8259     0.2154   13.12  < 2e-16 ***
## d$Country.fCHN   3.5447     0.2131   16.63  < 2e-16 ***
## d$Country.fCOL   1.2891     0.2377    5.42  6.0e-08 ***
## d$Country.fCUB   1.8724     0.2207    8.48  < 2e-16 ***
## d$Country.fCZE   0.8738     0.2829    3.09  0.00201 ** 
## d$Country.fESP   1.6290     0.2226    7.32  2.7e-13 ***
## d$Country.fFIN   1.8440     0.2487    7.42  1.3e-13 ***
## d$Country.fFRA   1.4951     0.2247    6.65  3.0e-11 ***
## d$Country.fGBR   1.7496     0.2169    8.06  8.1e-16 ***
## d$Country.fGEO   1.3857     0.2955    4.69  2.8e-06 ***
## d$Country.fGER   2.5985     0.2143   12.13  < 2e-16 ***
## d$Country.fGRE   0.9307     0.2275    4.09  4.3e-05 ***
## d$Country.fHKG   0.0524     0.2829    0.19  0.85310    
## d$Country.fHUN   1.8965     0.2242    8.46  < 2e-16 ***
## d$Country.fINA  -0.1411     0.2394   -0.59  0.55562    
## d$Country.fITA   2.1969     0.2210    9.94  < 2e-16 ***
## d$Country.fJPN   2.9766     0.2242   13.28  < 2e-16 ***
## d$Country.fKAZ   1.9922     0.2179    9.14  < 2e-16 ***
## d$Country.fKOR   1.2299     0.2315    5.31  1.1e-07 ***
## d$Country.fMAS   1.3959     0.2268    6.15  7.8e-10 ***
## d$Country.fMEX   2.2988     0.2175   10.57  < 2e-16 ***
## d$Country.fPER   1.4036     0.2487    5.64  1.7e-08 ***
## d$Country.fPHI   0.9896     0.2520    3.93  8.6e-05 ***
## d$Country.fPRK   2.0578     0.2173    9.47  < 2e-16 ***
## d$Country.fPUR   1.2173     0.2434    5.00  5.8e-07 ***
## d$Country.fROM   1.0481     0.2315    4.53  6.0e-06 ***
## d$Country.fRUS   3.0096     0.2135   14.10  < 2e-16 ***
## d$Country.fSUI   0.6260     0.2520    2.48  0.01300 *  
## d$Country.fSWE   3.0333     0.2412   12.57  < 2e-16 ***
## d$Country.fTHA   0.4929     0.2487    1.98  0.04751 *  
## d$Country.fTPE   0.5714     0.2289    2.50  0.01255 *  
## d$Country.fUKR   2.2103     0.2165   10.21  < 2e-16 ***
## d$Country.fUSA   2.8629     0.2133   13.42  < 2e-16 ***
## d$Country.fVEN   1.3205     0.2305    5.73  1.0e-08 ***
## d$Country.fZIM   0.9690     0.2829    3.43  0.00062 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 10745 degrees of freedom
## Multiple R-squared:   0.3,   Adjusted R-squared:  0.297 
## F-statistic:  112 on 41 and 10745 DF,  p-value: <2e-16

Notice the variables for each country except Argentina (which falls first alphabetically in the country list.)

We can recreate the same results by building our own model matrix, creating component dummys for every country except Argentina and then estimating the linear model by calculating \( \Theta = (X^TX)^{-1}X^TY \) Where the vector \( \Theta \) is an \( nx1 \) dimension vector containing our estimated coefficients \( \theta_1, \theta_2, ..., \theta_n \) where \( n \) is the number of independent variables. \( X \) is an \( mxn \) matrix containing our feature vectors \( x_1, x_2, ..., x_n \) for \( m \) observations.

y <- d$JScore
cols <- length(unique(d$Country))
countries <- sort(unique(d$Country))
mm <- matrix(nrow = length(y), ncol = cols)

# Build Matrix
for (i in 1:cols) {
    mm[, i] <- (d$Country == countries[i])
}
mm[, 1] <- 1

lm.2 <- solve(t(mm) %*% mm) %*% t(mm) %*% y
lm.2
##           [,1]
##  [1,]  4.61429
##  [2,]  0.62381
##  [3,]  2.68860
##  [4,]  1.83143
##  [5,]  1.61190
##  [6,]  2.03750
##  [7,]  1.77725
##  [8,]  2.82589
##  [9,]  3.54470
## [10,]  1.28908
## [11,]  1.87243
## [12,]  0.87381
## [13,]  1.62896
## [14,]  1.84405
## [15,]  1.49509
## [16,]  1.74955
## [17,]  1.38571
## [18,]  2.59851
## [19,]  0.93069
## [20,]  0.05238
## [21,]  1.89654
## [22,] -0.14107
## [23,]  2.19694
## [24,]  2.97662
## [25,]  1.99223
## [26,]  1.22987
## [27,]  1.39592
## [28,]  2.29881
## [29,]  1.40357
## [30,]  0.98961
## [31,]  2.05785
## [32,]  1.21735
## [33,]  1.04805
## [34,]  3.00961
## [35,]  0.62597
## [36,]  3.03333
## [37,]  0.49286
## [38,]  0.57143
## [39,]  2.21029
## [40,]  2.86291
## [41,]  1.32050
## [42,]  0.96905

We can see that the results for our manual attempt exactly match the lm() output. That said, this way of going about fitting the model is not very computationally efficient mainly because inverting \( X^TX \) is an expensive operation on the order of \( O(n^2) \) to \( O(n^3) \) This combined with the dense matrix multiplication, \( O(n^3) \), makes it very expensive.

Challenge 3: Does this model make sense

By creating the model in this fashion we are essentially testing every country's diving prowess against the Argentineans (as the Argentineans level has been absorbed by the intercept). Strictly speaking, the p-value checks the null hypothesis that the coefficient on an independent variable, \( \theta_i \) is equal to 0. Exactly the same test is being carried out here but the meaning is perturbed by the fact that the changed interpretation of the intercept.

Note that the mean JScore for Argentina is 4.614286 (the intercept term) and the mean JScore for the United States is 7.477191. The coefficient on the US dummy variable is equal to \( 7.477191-4.614286 \). So really, we are testing the null hypothesis that the average judge score for the United States is no different than that for Argentina.

Challenge 4A: Anova

anova(lm.1)
## Analysis of Variance Table
## 
## Response: d$JScore
##                Df Sum Sq Mean Sq F value Pr(>F)    
## d$Country.f    41   7025   171.3     112 <2e-16 ***
## Residuals   10745  16416     1.5                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we will use ANOVA to compare the group means represented in the regression above. The null hypothesis here is that the group means are all the same. The means model is passed to the ANOVA function. The function ouputs the degrees of freedom for the test, the Sum of Squares (the sum of squared differences of the mean of observations in group j from the overall mean), the Mean Square (MS) which is equal to the Sum of Squares over the DF, and the F value (\( MS_A/MS_\epsilon \), where \( A \) is the treatment group and \( \epsilon \) is the error)

Analyzing the output we see that, yes, the mean JScore differs across the countries.

Challenge 4b: Sum coding of Categorical variables

In this iteration we will change the form of our regression equation. We will code our independent variables such that the coefficients sum to zero. To accomplish this, we make one country the intercept and then fill rows for divers from that country with \( -1 \).

In this instance consider \( \mu_i \) the mean Judge Score for divers from country \( i \). When we sum code, \( \theta_0 + \theta_i = \mu_i \). In this case \( \theta_0 \) represents the average of average JScores by country. Functionally the null hypothesis we are testing is that the coefficient on any \( \theta_i \) is zero. However, given the interpretation of the intercept we can adjust our null hypothesis to be that divers from \( Country_i \) have JScores no different the average JScore across countries.

Consider our model matrix, \( X \), which is \( mx42 \). \( x_{42} \) will be our intercept and \( x_i = -1 \), if the diver is from Zimbawe, \( x_i = 1 \) if Country \( = C_i \) where \( C_i \) is the diver's country and \( 0 \) otherwise.

y <- d$JScore
cols <- length(unique(d$Country))
countries <- sort(unique(d$Country))
mm <- matrix(nrow = length(y), ncol = cols)
colnames(mm) <- countries
# Build Matrix
for (i in 1:cols) {
    mm[, i] <- (d$Country == countries[i])
}
mm[d$Country == "ZIM", ] <- -1
mm[, 42] <- 1

So we have coded the 42 Zimbawe divers row-wise as -1, column-wise as 1, and the other 10563 with 1/0 where appropriate. Now we will regress our indicator variables onto Judge Score using lsfit. Recall that, as we have provided an intercept in our model matrix, we must set the appropriate option in lsfit.

lm.3 <- lsfit(mm, y, intercept = FALSE)
ls.print(lm.3)
## Residual Standard Error=1.236
## R-Square=0.9689
## F-statistic (df=42, 10745)=7958
## p-value=0
## 
##     Estimate Std.Err  t-value Pr(>|t|)
## ARG  -1.6267  0.2047  -7.9472   0.0000
## ARM  -1.0028  0.1870  -5.3630   0.0000
## AUS   1.0619  0.0482  22.0364   0.0000
## AUT   0.2048  0.0929   2.2032   0.0276
## AZE  -0.0148  0.1870  -0.0789   0.9371
## BLR   0.4108  0.1154   3.5605   0.0004
## BRA   0.1506  0.0896   1.6813   0.0927
## CAN   1.1992  0.0541  22.1856   0.0000
## CHN   1.9180  0.0447  42.8893   0.0000
## COL  -0.3376  0.1120  -3.0133   0.0026
## CUB   0.2458  0.0718   3.4222   0.0006
## CZE  -0.7528  0.1870  -4.0260   0.0001
## ESP   0.0023  0.0771   0.0299   0.9762
## FIN   0.2174  0.1328   1.6365   0.1018
## FRA  -0.1316  0.0826  -1.5932   0.1111
## GBR   0.1229  0.0598   2.0565   0.0398
## GEO  -0.2409  0.2047  -1.1771   0.2392
## GER   0.9719  0.0499  19.4805   0.0000
## GRE  -0.6960  0.0896  -7.7704   0.0000
## HKG  -1.5743  0.1870  -8.4188   0.0000
## HUN   0.2699  0.0814   3.3164   0.0009
## INA  -1.7677  0.1154 -15.3196   0.0000
## ITA   0.5703  0.0726   7.8537   0.0000
## JPN   1.3500  0.0814  16.5888   0.0000
## KAZ   0.3656  0.0630   5.8018   0.0000
## KOR  -0.3968  0.0989  -4.0139   0.0001
## MAS  -0.2307  0.0880  -2.6215   0.0088
## MEX   0.6722  0.0615  10.9210   0.0000
## PER  -0.2231  0.1328  -1.6794   0.0931
## PHI  -0.6370  0.1386  -4.5951   0.0000
## PRK   0.4312  0.0611   7.0590   0.0000
## PUR  -0.4093  0.1232  -3.3231   0.0009
## ROM  -0.5786  0.0989  -5.8532   0.0000
## RUS   1.3830  0.0465  29.7348   0.0000
## SUI  -1.0007  0.1386  -7.2180   0.0000
## SWE   1.4067  0.1191  11.8125   0.0000
## THA  -1.1338  0.1328  -8.5354   0.0000
## TPE  -1.0552  0.0929 -11.3537   0.0000
## UKR   0.5836  0.0581  10.0382   0.0000
## USA   1.2362  0.0455  27.1694   0.0000
## VEN  -0.3062  0.0968  -3.1644   0.0016
## ZIM   6.2409  0.0180 346.8986   0.0000
which(countries == "FRA")
## [1] 15

Recall that the intercept is the 'mean of means' that is, the average of the average JScores for each country. We are, therefore, correctly comparing France to average dive scores of other countries (as opposed to earlier when we compared the US to Argentina).

The null hypothesis is that the mean JScore for a country is no different from the average JScore across countries. The p-value on France here is large so we will fail to reject our null hypothesis and conclude that yes, the French are more or less average in their diving ability as there is no significant difference from the averages of other countries.

As a sanity check, the French coefficient is \( -.1316 \), \( \theta_0 + \theta_15 = 6.2409 - .1316 = 6.1093 \) which is the mean JScore for French divers.

Challenge 5: Zimbabwe

We can adjust the above construction for divers from Zimbabwe.

y <- d$JScore
cols <- length(unique(d$Country))
countries <- sort(unique(d$Country), decreasing = TRUE)
mm <- matrix(nrow = length(y), ncol = cols)
colnames(mm) <- countries
# Build Matrix
for (i in 1:cols) {
    mm[, i] <- (d$Country == countries[i])
}
mm[d$Country == "ARG", ] <- -1
mm[, 42] <- 1
lm.4 <- lsfit(mm, y, intercept = FALSE)
ls.print(lm.4)
## Residual Standard Error=1.236
## R-Square=0.9689
## F-statistic (df=42, 10745)=7958
## p-value=0
## 
##     Estimate Std.Err  t-value Pr(>|t|)
## ZIM  -0.6576  0.1870  -3.5167   0.0004
## VEN  -0.3062  0.0968  -3.1644   0.0016
## USA   1.2362  0.0455  27.1694   0.0000
## UKR   0.5836  0.0581  10.0382   0.0000
## TPE  -1.0552  0.0929 -11.3537   0.0000
## THA  -1.1338  0.1328  -8.5354   0.0000
## SWE   1.4067  0.1191  11.8125   0.0000
## SUI  -1.0007  0.1386  -7.2180   0.0000
## RUS   1.3830  0.0465  29.7348   0.0000
## ROM  -0.5786  0.0989  -5.8532   0.0000
## PUR  -0.4093  0.1232  -3.3231   0.0009
## PRK   0.4312  0.0611   7.0590   0.0000
## PHI  -0.6370  0.1386  -4.5951   0.0000
## PER  -0.2231  0.1328  -1.6794   0.0931
## MEX   0.6722  0.0615  10.9210   0.0000
## MAS  -0.2307  0.0880  -2.6215   0.0088
## KOR  -0.3968  0.0989  -4.0139   0.0001
## KAZ   0.3656  0.0630   5.8018   0.0000
## JPN   1.3500  0.0814  16.5888   0.0000
## ITA   0.5703  0.0726   7.8537   0.0000
## INA  -1.7677  0.1154 -15.3196   0.0000
## HUN   0.2699  0.0814   3.3164   0.0009
## HKG  -1.5743  0.1870  -8.4188   0.0000
## GRE  -0.6960  0.0896  -7.7704   0.0000
## GER   0.9719  0.0499  19.4805   0.0000
## GEO  -0.2409  0.2047  -1.1771   0.2392
## GBR   0.1229  0.0598   2.0565   0.0398
## FRA  -0.1316  0.0826  -1.5932   0.1111
## FIN   0.2174  0.1328   1.6365   0.1018
## ESP   0.0023  0.0771   0.0299   0.9762
## CZE  -0.7528  0.1870  -4.0260   0.0001
## CUB   0.2458  0.0718   3.4222   0.0006
## COL  -0.3376  0.1120  -3.0133   0.0026
## CHN   1.9180  0.0447  42.8893   0.0000
## CAN   1.1992  0.0541  22.1856   0.0000
## BRA   0.1506  0.0896   1.6813   0.0927
## BLR   0.4108  0.1154   3.5605   0.0004
## AZE  -0.0148  0.1870  -0.0789   0.9371
## AUT   0.2048  0.0929   2.2032   0.0276
## AUS   1.0619  0.0482  22.0364   0.0000
## ARM  -1.0028  0.1870  -5.3630   0.0000
## ARG   6.2409  0.0180 346.8986   0.0000

In this case, we see that divers from Zimbabwe are fairly below average in ability. Moreover the p-value is significant. We can reject the null hypothesis and conclude that there is a difference (the coefficient on Zimbabwe is not zero) in the ability of divers from Zimbabwe.

A good check across the two models is to make sure that the coefficients on the intercepts match, which they do. Also, to save work, we could have used this model for both France and Zimbabwe.