Since the observed values for y vary about their means y,
the multiple regression model includes a term for this variation. In words, the model is expressed
as DATA = FIT + RESIDUAL, where the "FIT" term represents the expression
0 +
1x1 +
2x2 + ...
pxp.
The "RESIDUAL" term represents the deviations of the observed values y from their
means
y, which are normally distributed with mean
0 and variance
. The notation for the model deviations is
.
Formally, the model for multiple linear regression, given n observations, is
yi = 0 +
1xi1 +
2xi2 + ...
pxip +
i for i = 1,2, ... n.
In the least-squares model, the best-fitting line for the observed data is calculated by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values. The least-squares estimates b0, b1, ... bp are usually computed by statistical software.
The values fit by the equation b0 + b1xi1 +
... + bpxip are denoted i, and the residuals ei are equal to yi -
i, the difference between the observed and fitted values.
The sum of the residuals is equal to zero.
The variance ² may be estimated by s² =
, also known as the mean-squared error (or MSE).
The estimate of
the standard error s is the square root of the MSE.
A simple linear regression model considering "Sugars" as the explanatory variable and "Rating"
as the response variable produced the regression line
Rating = 59.3 - 2.40 Sugars, with the square of the correlation
r² = 0.577 (see Inference in Linear Regression for more
details on this regression).
The "Healthy Breakfast" dataset includes several other variables, including grams of fat per serving and grams of dietary fiber per serving. Is the model significantly improved when these variables are included?
Suppose we are first interested in adding the "Fat" variable. The correlation between "Fat" and "Rating" is equal to -0.409, while the correlation between "Sugars" and "Fat" is equal to 0.271. Since "Fat" and "Sugar" are not highly correlated, the addition of the "Fat" variable may significantly improve the model.
The MINITAB "Regress" command produced the following results:
Regression Analysis The regression equation is Rating = 61.1 - 3.07 Fat - 2.21 Sugars
After fitting the regression line, it is important to investigate the residuals to determine whether or not they appear to fit the assumption of a normal distribution. A normal quantile plot of the standardized residuals y -
The MINITAB output provides a great deal of information. Under the equation for the regression line, the output provides the least-squares estimates for each parameter, listed in the "Coef" column next to the variable to which it corresponds. The calculated standard deviations are provided in the second column.
Predictor Coef StDev T P Constant 61.089 1.953 31.28 0.000 Fat -3.066 1.036 -2.96 0.004 Sugars -2.2128 0.2347 -9.43 0.000 S = 8.755 R-Sq = 62.2% R-Sq(adj) = 61.2%
The test statistic t is equal to bj/sbj, the parameter estimate divided by its standard deviation. This value follows a t(n-p-1) distribution when p variables are included in the model.
In the example above, the parameter estimate for the "Fat" variable is -3.066 with standard deviation 1.036 The test statistic is t = -3.066/1.036 = -2.96, provided in the "T" column of the MINITAB output. For a two-sided test, the probability of interest is 2P(T>|-2.96|) for the t(77-2-1) = t(74) distribution, which is about 0.004. The "P" column of the MINITAB output provides the P-value associated with the two-sided test. Since the P-values for both "Fat" and "Sugar" are highly significant, both variables may be included in the model.
Continuing with the "Healthy Breakfast" example, suppose we choose to add the "Fiber" variable to our model. The MINITAB results are the following:
Regression Analysis The regression equation is Rating = 53.4 - 3.48 Fat + 2.95 Fiber - 1.96 Sugars Predictor Coef StDev T P Constant 53.437 1.342 39.82 0.000 Fat -3.4802 0.6209 -5.61 0.000 Fiber 2.9503 0.2549 11.57 0.000 Sugars -1.9640 0.1420 -13.83 0.000 S = 5.235 R-Sq = 86.7% R-Sq(adj) = 86.1%The squared multiple correlation R² is now equal to 0.861, and all of the variables are significant by the t tests. Examination of the residuals indicates no unusual patterns. The inclusion of the "Fat," "Fiber," and "Sugars" variables explains 86.7% of the variability of the data, a significant improvement over the smaller models.
For additional tests and a continuation of this example, see ANOVA for Multiple Linear Regression.