Challenge 1
In this case lsfit is anticipating needing to create its own intercept. As we created one the model matrix the matrix becomes collinear when lsfit attempts to add its own term. This can be corrected either by leaving the intercept term out of the model matrix or setting the correct lsfit option.
Challenge 2: One Way ANOVA
To build a linear model in lm() with the independent variable Diver country we must first change the country variable to a factor. It was initially read in as a character variable (with the as.is = TRUE set on the read.csv).
d <- read.csv("http://www.stat.yale.edu/~jay/625/diving/Diving2000.csv", as.is = TRUE)
d$Country.f <- as.factor(d$Country)
lm.1 <- lm(d$JScore ~ d$Country.f)
summary(lm.1)
##
## Call:
## lm(formula = d$JScore ~ d$Country.f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.607 -0.609 0.197 0.841 3.027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6143 0.2089 22.09 < 2e-16 ***
## d$Country.fARM 0.6238 0.2829 2.21 0.02747 *
## d$Country.fAUS 2.6886 0.2139 12.57 < 2e-16 ***
## d$Country.fAUT 1.8314 0.2289 8.00 1.3e-15 ***
## d$Country.fAZE 1.6119 0.2829 5.70 1.2e-08 ***
## d$Country.fBLR 2.0375 0.2394 8.51 < 2e-16 ***
## d$Country.fBRA 1.7772 0.2275 7.81 6.1e-15 ***
## d$Country.fCAN 2.8259 0.2154 13.12 < 2e-16 ***
## d$Country.fCHN 3.5447 0.2131 16.63 < 2e-16 ***
## d$Country.fCOL 1.2891 0.2377 5.42 6.0e-08 ***
## d$Country.fCUB 1.8724 0.2207 8.48 < 2e-16 ***
## d$Country.fCZE 0.8738 0.2829 3.09 0.00201 **
## d$Country.fESP 1.6290 0.2226 7.32 2.7e-13 ***
## d$Country.fFIN 1.8440 0.2487 7.42 1.3e-13 ***
## d$Country.fFRA 1.4951 0.2247 6.65 3.0e-11 ***
## d$Country.fGBR 1.7496 0.2169 8.06 8.1e-16 ***
## d$Country.fGEO 1.3857 0.2955 4.69 2.8e-06 ***
## d$Country.fGER 2.5985 0.2143 12.13 < 2e-16 ***
## d$Country.fGRE 0.9307 0.2275 4.09 4.3e-05 ***
## d$Country.fHKG 0.0524 0.2829 0.19 0.85310
## d$Country.fHUN 1.8965 0.2242 8.46 < 2e-16 ***
## d$Country.fINA -0.1411 0.2394 -0.59 0.55562
## d$Country.fITA 2.1969 0.2210 9.94 < 2e-16 ***
## d$Country.fJPN 2.9766 0.2242 13.28 < 2e-16 ***
## d$Country.fKAZ 1.9922 0.2179 9.14 < 2e-16 ***
## d$Country.fKOR 1.2299 0.2315 5.31 1.1e-07 ***
## d$Country.fMAS 1.3959 0.2268 6.15 7.8e-10 ***
## d$Country.fMEX 2.2988 0.2175 10.57 < 2e-16 ***
## d$Country.fPER 1.4036 0.2487 5.64 1.7e-08 ***
## d$Country.fPHI 0.9896 0.2520 3.93 8.6e-05 ***
## d$Country.fPRK 2.0578 0.2173 9.47 < 2e-16 ***
## d$Country.fPUR 1.2173 0.2434 5.00 5.8e-07 ***
## d$Country.fROM 1.0481 0.2315 4.53 6.0e-06 ***
## d$Country.fRUS 3.0096 0.2135 14.10 < 2e-16 ***
## d$Country.fSUI 0.6260 0.2520 2.48 0.01300 *
## d$Country.fSWE 3.0333 0.2412 12.57 < 2e-16 ***
## d$Country.fTHA 0.4929 0.2487 1.98 0.04751 *
## d$Country.fTPE 0.5714 0.2289 2.50 0.01255 *
## d$Country.fUKR 2.2103 0.2165 10.21 < 2e-16 ***
## d$Country.fUSA 2.8629 0.2133 13.42 < 2e-16 ***
## d$Country.fVEN 1.3205 0.2305 5.73 1.0e-08 ***
## d$Country.fZIM 0.9690 0.2829 3.43 0.00062 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.24 on 10745 degrees of freedom
## Multiple R-squared: 0.3, Adjusted R-squared: 0.297
## F-statistic: 112 on 41 and 10745 DF, p-value: <2e-16
Notice the variables for each country except Argentina (which falls first alphabetically in the country list.)
We can recreate the same results by building our own model matrix, creating component dummys for every country except Argentina and then estimating the linear model by calculating \( \Theta = (X^TX)^{-1}X^TY \) Where the vector \( \Theta \) is an \( nx1 \) dimension vector containing our estimated coefficients \( \theta_1, \theta_2, ..., \theta_n \) where \( n \) is the number of independent variables. \( X \) is an \( mxn \) matrix containing our feature vectors \( x_1, x_2, ..., x_n \) for \( m \) observations.
y <- d$JScore
cols <- length(unique(d$Country))
countries <- sort(unique(d$Country))
mm <- matrix(nrow = length(y), ncol = cols)
# Build Matrix
for (i in 1:cols) {
mm[, i] <- (d$Country == countries[i])
}
mm[, 1] <- 1
lm.2 <- solve(t(mm) %*% mm) %*% t(mm) %*% y
lm.2
## [,1]
## [1,] 4.61429
## [2,] 0.62381
## [3,] 2.68860
## [4,] 1.83143
## [5,] 1.61190
## [6,] 2.03750
## [7,] 1.77725
## [8,] 2.82589
## [9,] 3.54470
## [10,] 1.28908
## [11,] 1.87243
## [12,] 0.87381
## [13,] 1.62896
## [14,] 1.84405
## [15,] 1.49509
## [16,] 1.74955
## [17,] 1.38571
## [18,] 2.59851
## [19,] 0.93069
## [20,] 0.05238
## [21,] 1.89654
## [22,] -0.14107
## [23,] 2.19694
## [24,] 2.97662
## [25,] 1.99223
## [26,] 1.22987
## [27,] 1.39592
## [28,] 2.29881
## [29,] 1.40357
## [30,] 0.98961
## [31,] 2.05785
## [32,] 1.21735
## [33,] 1.04805
## [34,] 3.00961
## [35,] 0.62597
## [36,] 3.03333
## [37,] 0.49286
## [38,] 0.57143
## [39,] 2.21029
## [40,] 2.86291
## [41,] 1.32050
## [42,] 0.96905
We can see that the results for our manual attempt exactly match the lm() output. That said, this way of going about fitting the model is not very computationally efficient mainly because inverting \( X^TX \) is an expensive operation on the order of \( O(n^2) \) to \( O(n^3) \) This combined with the dense matrix multiplication, \( O(n^3) \), makes it very expensive.
Challenge 3: Does this model make sense
By creating the model in this fashion we are essentially testing every country's diving prowess against the Argentineans (as the Argentineans level has been absorbed by the intercept). Strictly speaking, the p-value checks the null hypothesis that the coefficient on an independent variable, \( \theta_i \) is equal to 0. Exactly the same test is being carried out here but the meaning is perturbed by the fact that the changed interpretation of the intercept.
Note that the mean JScore for Argentina is 4.614286 (the intercept term) and the mean JScore for the United States is 7.477191. The coefficient on the US dummy variable is equal to \( 7.477191-4.614286 \). So really, we are testing the null hypothesis that the average judge score for the United States is no different than that for Argentina.
Challenge 4A: Anova
anova(lm.1)
## Analysis of Variance Table
##
## Response: d$JScore
## Df Sum Sq Mean Sq F value Pr(>F)
## d$Country.f 41 7025 171.3 112 <2e-16 ***
## Residuals 10745 16416 1.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we will use ANOVA to compare the group means represented in the regression above. The null hypothesis here is that the group means are all the same. The means model is passed to the ANOVA function. The function ouputs the degrees of freedom for the test, the Sum of Squares (the sum of squared differences of the mean of observations in group j from the overall mean), the Mean Square (MS) which is equal to the Sum of Squares over the DF, and the F value (\( MS_A/MS_\epsilon \), where \( A \) is the treatment group and \( \epsilon \) is the error)
Analyzing the output we see that, yes, the mean JScore differs across the countries.
Challenge 4b: Sum coding of Categorical variables
In this iteration we will change the form of our regression equation. We will code our independent variables such that the coefficients sum to zero. To accomplish this, we make one country the intercept and then fill rows for divers from that country with \( -1 \).
In this instance consider \( \mu_i \) the mean Judge Score for divers from country \( i \). When we sum code, \( \theta_0 + \theta_i = \mu_i \). In this case \( \theta_0 \) represents the average of average JScores by country. Functionally the null hypothesis we are testing is that the coefficient on any \( \theta_i \) is zero. However, given the interpretation of the intercept we can adjust our null hypothesis to be that divers from \( Country_i \) have JScores no different the average JScore across countries.
Consider our model matrix, \( X \), which is \( mx42 \). \( x_{42} \) will be our intercept and \( x_i = -1 \), if the diver is from Zimbawe, \( x_i = 1 \) if Country \( = C_i \) where \( C_i \) is the diver's country and \( 0 \) otherwise.
y <- d$JScore
cols <- length(unique(d$Country))
countries <- sort(unique(d$Country))
mm <- matrix(nrow = length(y), ncol = cols)
colnames(mm) <- countries
# Build Matrix
for (i in 1:cols) {
mm[, i] <- (d$Country == countries[i])
}
mm[d$Country == "ZIM", ] <- -1
mm[, 42] <- 1
So we have coded the 42 Zimbawe divers row-wise as -1, column-wise as 1, and the other 10563 with 1/0 where appropriate. Now we will regress our indicator variables onto Judge Score using lsfit. Recall that, as we have provided an intercept in our model matrix, we must set the appropriate option in lsfit.
lm.3 <- lsfit(mm, y, intercept = FALSE)
ls.print(lm.3)
## Residual Standard Error=1.236
## R-Square=0.9689
## F-statistic (df=42, 10745)=7958
## p-value=0
##
## Estimate Std.Err t-value Pr(>|t|)
## ARG -1.6267 0.2047 -7.9472 0.0000
## ARM -1.0028 0.1870 -5.3630 0.0000
## AUS 1.0619 0.0482 22.0364 0.0000
## AUT 0.2048 0.0929 2.2032 0.0276
## AZE -0.0148 0.1870 -0.0789 0.9371
## BLR 0.4108 0.1154 3.5605 0.0004
## BRA 0.1506 0.0896 1.6813 0.0927
## CAN 1.1992 0.0541 22.1856 0.0000
## CHN 1.9180 0.0447 42.8893 0.0000
## COL -0.3376 0.1120 -3.0133 0.0026
## CUB 0.2458 0.0718 3.4222 0.0006
## CZE -0.7528 0.1870 -4.0260 0.0001
## ESP 0.0023 0.0771 0.0299 0.9762
## FIN 0.2174 0.1328 1.6365 0.1018
## FRA -0.1316 0.0826 -1.5932 0.1111
## GBR 0.1229 0.0598 2.0565 0.0398
## GEO -0.2409 0.2047 -1.1771 0.2392
## GER 0.9719 0.0499 19.4805 0.0000
## GRE -0.6960 0.0896 -7.7704 0.0000
## HKG -1.5743 0.1870 -8.4188 0.0000
## HUN 0.2699 0.0814 3.3164 0.0009
## INA -1.7677 0.1154 -15.3196 0.0000
## ITA 0.5703 0.0726 7.8537 0.0000
## JPN 1.3500 0.0814 16.5888 0.0000
## KAZ 0.3656 0.0630 5.8018 0.0000
## KOR -0.3968 0.0989 -4.0139 0.0001
## MAS -0.2307 0.0880 -2.6215 0.0088
## MEX 0.6722 0.0615 10.9210 0.0000
## PER -0.2231 0.1328 -1.6794 0.0931
## PHI -0.6370 0.1386 -4.5951 0.0000
## PRK 0.4312 0.0611 7.0590 0.0000
## PUR -0.4093 0.1232 -3.3231 0.0009
## ROM -0.5786 0.0989 -5.8532 0.0000
## RUS 1.3830 0.0465 29.7348 0.0000
## SUI -1.0007 0.1386 -7.2180 0.0000
## SWE 1.4067 0.1191 11.8125 0.0000
## THA -1.1338 0.1328 -8.5354 0.0000
## TPE -1.0552 0.0929 -11.3537 0.0000
## UKR 0.5836 0.0581 10.0382 0.0000
## USA 1.2362 0.0455 27.1694 0.0000
## VEN -0.3062 0.0968 -3.1644 0.0016
## ZIM 6.2409 0.0180 346.8986 0.0000
which(countries == "FRA")
## [1] 15
Recall that the intercept is the 'mean of means' that is, the average of the average JScores for each country. We are, therefore, correctly comparing France to average dive scores of other countries (as opposed to earlier when we compared the US to Argentina).
The null hypothesis is that the mean JScore for a country is no different from the average JScore across countries. The p-value on France here is large so we will fail to reject our null hypothesis and conclude that yes, the French are more or less average in their diving ability as there is no significant difference from the averages of other countries.
As a sanity check, the French coefficient is \( -.1316 \), \( \theta_0 + \theta_15 = 6.2409 - .1316 = 6.1093 \) which is the mean JScore for French divers.
Challenge 5: Zimbabwe
We can adjust the above construction for divers from Zimbabwe.
y <- d$JScore
cols <- length(unique(d$Country))
countries <- sort(unique(d$Country), decreasing = TRUE)
mm <- matrix(nrow = length(y), ncol = cols)
colnames(mm) <- countries
# Build Matrix
for (i in 1:cols) {
mm[, i] <- (d$Country == countries[i])
}
mm[d$Country == "ARG", ] <- -1
mm[, 42] <- 1
lm.4 <- lsfit(mm, y, intercept = FALSE)
ls.print(lm.4)
## Residual Standard Error=1.236
## R-Square=0.9689
## F-statistic (df=42, 10745)=7958
## p-value=0
##
## Estimate Std.Err t-value Pr(>|t|)
## ZIM -0.6576 0.1870 -3.5167 0.0004
## VEN -0.3062 0.0968 -3.1644 0.0016
## USA 1.2362 0.0455 27.1694 0.0000
## UKR 0.5836 0.0581 10.0382 0.0000
## TPE -1.0552 0.0929 -11.3537 0.0000
## THA -1.1338 0.1328 -8.5354 0.0000
## SWE 1.4067 0.1191 11.8125 0.0000
## SUI -1.0007 0.1386 -7.2180 0.0000
## RUS 1.3830 0.0465 29.7348 0.0000
## ROM -0.5786 0.0989 -5.8532 0.0000
## PUR -0.4093 0.1232 -3.3231 0.0009
## PRK 0.4312 0.0611 7.0590 0.0000
## PHI -0.6370 0.1386 -4.5951 0.0000
## PER -0.2231 0.1328 -1.6794 0.0931
## MEX 0.6722 0.0615 10.9210 0.0000
## MAS -0.2307 0.0880 -2.6215 0.0088
## KOR -0.3968 0.0989 -4.0139 0.0001
## KAZ 0.3656 0.0630 5.8018 0.0000
## JPN 1.3500 0.0814 16.5888 0.0000
## ITA 0.5703 0.0726 7.8537 0.0000
## INA -1.7677 0.1154 -15.3196 0.0000
## HUN 0.2699 0.0814 3.3164 0.0009
## HKG -1.5743 0.1870 -8.4188 0.0000
## GRE -0.6960 0.0896 -7.7704 0.0000
## GER 0.9719 0.0499 19.4805 0.0000
## GEO -0.2409 0.2047 -1.1771 0.2392
## GBR 0.1229 0.0598 2.0565 0.0398
## FRA -0.1316 0.0826 -1.5932 0.1111
## FIN 0.2174 0.1328 1.6365 0.1018
## ESP 0.0023 0.0771 0.0299 0.9762
## CZE -0.7528 0.1870 -4.0260 0.0001
## CUB 0.2458 0.0718 3.4222 0.0006
## COL -0.3376 0.1120 -3.0133 0.0026
## CHN 1.9180 0.0447 42.8893 0.0000
## CAN 1.1992 0.0541 22.1856 0.0000
## BRA 0.1506 0.0896 1.6813 0.0927
## BLR 0.4108 0.1154 3.5605 0.0004
## AZE -0.0148 0.1870 -0.0789 0.9371
## AUT 0.2048 0.0929 2.2032 0.0276
## AUS 1.0619 0.0482 22.0364 0.0000
## ARM -1.0028 0.1870 -5.3630 0.0000
## ARG 6.2409 0.0180 346.8986 0.0000
In this case, we see that divers from Zimbabwe are fairly below average in ability. Moreover the p-value is significant. We can reject the null hypothesis and conclude that there is a difference (the coefficient on Zimbabwe is not zero) in the ability of divers from Zimbabwe.
A good check across the two models is to make sure that the coefficients on the intercepts match, which they do. Also, to save work, we could have used this model for both France and Zimbabwe.