\documentclass[12pt]{book} \usepackage{boxedminipage} \input 312-fall2016.sty \newcommand{\bhat}{\widehat b} \newcommand{\chat}{\widehat c} \newcommand{\dhat}{\widehat d} \newcommand{\yhat}{\widehat y} \newcommand{\that}{\widehat t} \newcommand{\ybar}{\overline y} \newcommand{\zbar}{\overline z} \def\Rlang/{{\bfseries R}} \begin{document} \setcounter{chapter}{0} \chapter{Least squares} \today \TOC \bigskip \copyright David Pollard 2016 \section{Minimimize sums of squares} The least squares method was first proposed in the early years of the 19th century (or maybe earlier---\citealp[Chapter~1]{Stigler86book}). Statistician often think of it as just a way to fit a statistical model, even though it makes sense without any mention of randomness or a model. As a problem in computation, least squares starts from $y=[y_1,\dots,y_n]$, a column vector of numbers (an element of~$\RR^n$), and some finite set of ``predictors", column vectors $x_j=[x_{1j},\dots,x_{nj}]$ for $j=1,\dots, p$. We seek numbers $b_1,\dots, b_p$ to minimize the sum of squared differences: \beqN \SUM_{i\le n}\left( y_i -\SUM_{j\le p}x_{ij}b_j\right)^2 . \eeqN If we write $\norm{z}=\Sqrt{z_1^2+\dots+z_n^2}$ for the length of a vector~$z=[z_1,\dots,z_n]$ in~$\RR^n$, then the task can be reexpressed as: \beq\label{ls.problem} \text{minimize $\norm{y-x_1b_1-\dots-x_pb_p}^2$ \wrt/ $b$}. \eeq That is, we seek a linear combination of the vectors $x_1,\dots,x_p$ that best approximates~$y$, in the sense of minimizing the usual Euclidean distance. More compactly, we can take the $x_j$'s as the columns of an $n\times p$ matrix and think of $b=[b_1,\dots,b_p]$ as a vector in~$\RR^p$, then minimize $\norm{y-Xb}^2$. We could also think of~$y_1,\dots,y_n$ as measurements on each of~$n$ individuals and regard the $i$th row, $w_i'$, of~$X$ as providing auxiliary information about individual~$i$, then minimize $ \SUM_{i\le n}\left( y_i - w_i'b\right)^2 $. \Remark If the last paragraph gave you a headache then you might want to look at the first two chapters of the \citet{Axler2015LADR} book. It is pretty painful trying to learn about linear models if one does not know any linear algebra. \endRemark It is not hard to find a set of linear equations for~$b$ whose solution achieves the minimum. The solution is not unique unless the vectors $x_1,\dots,x_p$ are linearly independent, that is, if none of them can be written as a linear combination of the others (cf. \citealp[Section~2A]{Axler2015LADR}). In some simple cases it is possible to write out, without recourse to matrices, a closed form expression for the solution. It used to be a popular source of pleasure for statistical sadists to make students memorize that expression, despite the fact that computers are much better than humans at memorizing and applying formulas. Using~\Rlang/ it is particularly easy to find least squares solutions. Here is a fake example: \bgroup\small <<>>= set.seed(0) # gardening time xx <- matrix(1:20,ncol=2) # Create a 10 by 2 matrix xx[1:3,] # first three rows of xx # The columns of xx are known as xx[,1] and xx[,2] to R. yy <- rnorm(10) # just noise @ \egroup \noindent Now solve. \bgroup\small <<>>= out1 <- lm(yy ~ -1 + xx) # solve the least squares problem, # putting lots of info into a new object names(out1) # lots of good stuff in here # For example, out1$coefficients # too many decimal places @ \egroup \noindent In this case the minimizing~ value of~$b$ is $\bhat=$ [\Sexpr{round(out1$coefficients,2)}]. \Remark The $-1$ in the $lm()$ stops \Rlang/ from trying to be helpful by adding a vector of~$1$'s as a third column to~$X$, even though it is often a good idea. (It adds an intercept term to a regression.) If you want to see the effect, try: {\small <<>>= # summary(lm(yy ~ -1 + xx)) # summary(lm(yy ~ xx)) @ } \noindent without the comment characters (\#) at the start of the lines. You will see that~\Rlang/ has included an intercept term, by adding a column of $1$'s to the $X$ matrix, in the second case. Don't worry about the details. You'll learn in a week or so what it all means. \endRemark Usually it is better to keep all the data together in one object, a ``data frame": {\small <<>>= mydata <- data.frame(y=yy,xx) mydata[1:3,] out2 <- lm(y ~ -1 + X1 + X2, data=mydata) # out2 <- lm(y ~ -1 + . , data=mydata) does the same thing round(out2$coeff,4) @ } \section{Geometry} The vectors $x_1,\dots,x_p$ span some subspace~$\xx$ of~$\RR^n$. (That is, the set of all vectors of the form $x_1b_1+\dots x_pb_p$ is a vector subspace of~$\RR_n$). The least squares method minimizes $\norm{y-x}$ as~$x$ runs through all vectors in~$\xx$. The minimimizing vector in~$\xx$ is often denoted by~$\yhat$. It is sometimes called the \newdef{fitted vector}. As you will see, it is easy to find~$\yhat$ if we work in a good coordinate system. We then have the task of writing~$\yhat$ as a linear combination~$\bhat_1x_1+\dots+\bhat_px_p$, which also turns out to be easy if we choose a good coordinate system. If the $x_j$'s are linearly independent \cite[Section~2.17]{Axler2015LADR} then the subspace~$\xx$ has \newdef{dimension}~$p$ (and~$X$ has \newdef{rank}~$p$; the matrix is of \newdef{full rank}). If at least one of the $x_j$'s is a linear combination of some of the others then~$\xx$ has a dimension smaller than~$p$ (and~$X$ is not of full rank). \Remark There are many linear models for which it makes sense to choose linearly dependent~$x_j$'s. Stay tuned for ANOVA. \endRemark \citet[Chapter~6]{Axler2015LADR} would come in handy if the next paragraph intimidates you. Remember that $\RR^n$ can be equipped with an inner product (also known as a dot product). For column vectors $u=[u_1,\dots,u_n]$ and~$v=[v_1,\dots,v_n]$, \beqN \inner{u}{v} = u\cdot v = u'v=\SUM_{i\le n} u_i v_i. \eeqN In particular, $\norm{u}=\Sqrt{\inner u u}$. A vector~$u$ is said to be a unit vector if~$\norm{u}=1$. Two vectors, $u$ and~$v$, are said to be orthogonal if $\inner{u}{v} = 0$. A set of vectors $\{q_1,\dots, q_m\}$ is said to be an \newdef{orthonormal basis} for a subspace~$\xx$ if: \begin{enumerate} \item Each $q_i$ is a unit vector and $\inner{q_i}{q_{j}}=0$ for all~$i\ne j$. \item Each vector~$x$ in can be written as a linear combination of $q_1,\dots,q_m$. \end{enumerate} The second property means that, for each $x$ in~$\xx$ there exist numbers $t_1,\dots,t_m$ such that \beqN x= q_1t_1+\dots +q_mt_m \eeqN The coefficients $t_j$ are determined by \beqN \inner{x}{q_j} = \inner{q_1t_1+\dots +q_mt_m}{q_j} = t_1\inner{q_1}{q_j}+ \dots + t_j\inner{q_j}{q_j} + \dots + t_n\inner{q_n}{q_j}. \eeqN In the last sum, $\inner{q_j}{q_j}=1$ (because $q_j$ is a unit vector) and the other inner products are zero (by orthogonality). Thus $t_j = \inner{x}{q_j}$. The key fact is: \begin{boxedminipage}[t]{5in} For each subspace~$\xx$ of~$\RR^n$ of dimension~$m$ there exists an orthonormal basis $\{q_j: 1\le j\le n\}$ (not unique) for~$\RR^n$, such that $q_1,\dots,q_m$ is an orthonormal basis for~$\xx$. The remaining vectors,~$q_{m+1},\dots, q_n$ provide an orthonormal basis for the subspace \beqN \xx^\perp =\{z\in \RR^n: \inner xz=0\text{ for all $x$ in $\xx$}\}, \eeqN the orthogonal complement of~$\xx$. \end{boxedminipage} \bigskip It is a trivial task to find~$\yhat$ if $y$ and $x$ are expressed in this orthonormal basis. \bAlignL\label{new.onb} y &= s_1q_1+\dots s_mq_m + \left(s_{m+1}q_{m+1}+\dots s_nq_n\right) \\ x &= t_1q_1+\dots t_mq_m \notag\\ \norm{y-x}^2 &= (s_1-t_1)^2+ \dots + (s_1-t_1)^2 + \left(s_{m+1}^2 +\dots + s_n^2\right) \notag\eAlignL Clearly the solution is given by $t_i=s_i$ for $1\le i \le m$. The \newdef{fitted vector}~$\yhat$ equals~$s_1q_1+\dots s_mq_m $. The \newdef{residual vector} is defined as $r=y-\yhat=s_{m+1}q_{m+1}+\dots s_nq_n$, an element of~$\xx^\perp$. The fitted vector and the residual vector are orthogonal. The decomposition~$y=\yhat+r$ is the unique representation of~$y$ as a sum of a vector in~$\xx$ and a vector in~$\xx^\perp$. The fitted vector is also called the \newdef{orthogonal projection} of~$y$ onto~$\xx$ and the residual is called the projection of~$y$ orthogonal to~$\xx$ (or the orthogonal projection of~$y$ onto~$\xx^\perp$). \Example\label{2d} Most introductory courses discuss the problem of fitting a straight line by least squares: We have vectors $y$ and $z=[z_1,\dots,z_n]$ in~$\RR^n$; we seek constants~$c$ and~$d$ to minimize $\sum_{i\le n}(y_i-c-d z_i)^2$. The problem fits into the framework of~(\ref{ls.problem}) with~$x_1=\mathbbm{1}$, a column of~$1$'s, and~$x_2=z$---that is,~$X=(\bone,z)$---and~$b$ equal to the column vector~$[c,d]$. For the $q_i$'s choose $q_1$ as a unit vector in the direction of~$x_1$ and~$q_2$ as a unit vector in the direction of~$x_2-\inner{x_2}{q_1}q_1$, which is orthogonal to~$q_1$. Note that~$\inner{x_2}{q_1}=n^{-1/2}\sum_i z_i= \zbar\Sqrt{n}$, so that \bAlign q_1 &= n^{-1/2}\bone \\ q_2 &= \frac{x_2-\inner{x_2}{q_1}q_1}{\norm{x_2-\inner{x_2}{q_1}q_1}} = \frac{z-\zbar\bone}{\ell} \qt{where }\ell := \Sqrt{\SUM_i(z_i-\zbar)^2}. \eAlign We do not need to determine $q_3,\dots,q_n$ explicitly in order to determine the least squares fit. Indeed, with \beqN s_1=\inner{y}{q_1}= n^{1/2}\ybar \AND/ s_2 = \inner{y}{q_2} = \SUM_i y_i(z_i-\zbar)/\ell \eeqN we have \bAlign \yhat& = s_1q_1 + s_2q_2 = \chat \bone +\dhat z \\ &\qt{where }\dhat = \frac{\SUM_i y_i(z_i-\zbar)}{\SUM_i(z_i-\zbar)^2} \AND/\chat=\ybar - \dhat \zbar. %\AND/ %s_2 = \inner{y}{q_2} = \SUM_i y_i(z_i-\zbar)/\ell. %&= \ybar\bone + \frac{\SUM_i(z_i-\zbar)y_i}{ \eAlign \Remark Obviously something goes wrong if $\SUM_i(z_i-\zbar)^2=0$. That happens only when~$z$ is a constant multiple of~$\bone$, which means that~$z$ and~$\bone$ are linearly dependent and the coefficients~$c$ and~$d$ are not uniquely determined. Put another way, the rank of the matrix~$X$ and the dimension of the subspace~$\xx$ both equal~$1$. For uniquely determined~$c$ and~$d$ we need the rank and the dimension equal to~$2$. \endRemark In matrix notation, provided~$z$ is not a constant multiple of~$\bone$, \bAlignL\label{Q_X} Q=(q_1,q_2) &= (\bone ,z)W = XW \\ \qt{where } W &= \pMatrix w_{11}& w_{12} \\ 0& w_{22} \endpMatrix := \pMatrix n^{-1/2}&-\zbar/\ell \\ 0&1/\ell \endpMatrix . \notag \eAlignL The zero in the south-west corner makes it easy to find the inverse matrix~$R=W^{-1}$. We want \beqN \pMatrix 1& 0 \\ 0& 1 \endpMatrix = W R = \pMatrix w_{11}r_{11}+w_{12}r_{21}& w_{11}r_{12} + w_{12}r_{22} \\ w_{22}r_{21}& w_{22}r_{22} \endpMatrix . \eeqN with solution \bAlign r_{22}&= 1/w_{22} = \ell \\ r_{21} &= 0/w_{22} =0 \\ r_{11}&= (1-0)/w_{11}= n^{1/2} \\ r_{12} &= -w_{12}r_{22}/w_{11} = n^{1/2}\zbar. \eAlign That is \beqN R= \pMatrix n^{1/2}&n^{1/2}\zbar \\ 0& \ell \endpMatrix \eeqN and $X= XW W^{-1}=QR$. \Remark If I had started with a much larger \newdef{upper triangular matrix}, a square matrix with zeros everywhere below the diagonal, you would have seen clearly the pattern that gives the method the name \newdef{back-substitution} \cite[Section~3.1]{GolubVL2013}. \Rlang/ knows how to back-substitute; you do not have to go through the calculations. \endRemark The least squares calculations can be rewritten as \beqN QR\bhat= X\bhat = \yhat = \inner{y}{q_1}q_1 + \inner{y}{q_2}q_2 = QQ'y, \eeqN so that $\bhat = R^{-1} Q'y$. (Orthonormality gives $Q'Q=I_2$.) The $n\times n$ matrix~$H=QQ'$ is often called the \newdef{hat matrix}. (Where do you think that name came from?) It is the matrix for orthogonal projection onto the~$\xx$ subspace. Also \beqN \bhat =R^{-1}Q'y = (R'R)^{-1}(QR)'y = (X'X)^{-1}X'y. \eeqN The last equality gives the traditional textbook solution to the ``normal equations'', $X'(y-X\bhat)= r'\yhat=0$. Unfortunately this solution makes no sense when~$X$ is not of full rank. (The matrix~$R$ would be singular, that is neither~$R$ nor~$X'X$ has an inverse.) \endExample \section{QR decompositions and least squares} The method described in Example~\ref{2d} works more generally. For $n\ge p$, every $n\times p$ matrix~$X=(x_1,\dots,x_p)$ can be written in the form $X=QR$, where~$Q=(q_1,\dots,q_p)$ is an $n\times p$ matrix whose columns are orthogonal unit vectors and~$R$ is a $p\times p$ upper triangular matrix (the elements below the diagonal are all zero): \beqN R = \pMatrix r_{11}&r_{12}&r_{13}&\dots&r_{1p} \\ 0&r_{22}&r_{23}&\dots&r_{2p} \\ 0&0&r_{33}&\dots&r_{3p} \\ \vdots \\ 0&0&0&\dots&r_{pp} \endpMatrix \eeqN That is, \bAlignL\LABEL{x.from.q} x_1&= q_1 r_{11} \notag\\ x_2&= q_1 r_{12} + q_2 r_{22} \notag\\ \vdots \notag\\ x_p&= q_1 r_{1p} + q_2 r_{2p} + \dots + q_p r_{pp} \eAlignL Such a representation could be constructed by an analog of the procedure in Example~\ref{2d} (the Gram-Schmidt procedure) or, as \Rlang/ does, by an elegant method using Householder reflections \cite[page~9.13]{LINPACK}. The diagonal elements $r_{jj}$ for $1\le j\le p$ are all nonzero if and only if the matrix~$X$ has full rank. In that case, the $q_j$'s can all be expressed (uniquely) as linear combinations of the~$x_j$'s and $\{q_1,\dots,q_p\}$ is an orthonormal basis for~$\xx$. When $X$ is not of full rank, \Rlang/ can still make a least squares fit, but you need to be careful in interpreting the output. Here is an example where I ensure complications by deliberately making the third column of~$X$ a linear combination of two other columns, then \Rlang/ makes things even worse (so to speak) by adding in a column of ones. <<>>= mydata3 <- data.frame(y=yy, x1=1:10,x2= 11:20, x3= 0.5*(1:10)-3*(11:20)) out3 <- lm(y ~ ., data=mydata3) summary(out3) @ What happened? \Rlang/ figured out that (within some level of numerical accuracy):~$x_2$ and~$x_3$ are linear combinations of~$\bone$ and~$x_1$; the matrix $X=(\bone, x_1,x_2,x_3)$ has rank~$2$; the subspace~$\xx$ has dimension~$2$, not~$4$; \Rlang/ only needs two orthogonal unit vectors to span~$\xx$; but it still gave you four orthogonal unit vectors because it thought that's what you wanted. {\small <<>>= out3$rank round(qr.R(out3$qr),2) # notice the zeros on the diagonal Q1 <- qr.Q(out3$qr) round(Q1,2)[1:2,] Q2 <- Q1[,1:2] # just the first two columns, 10 by 2 HAT <- Q2 %*% t(Q2) # a 10 by 10 matrix round(HAT[1:2,],2) thefit <- HAT %*% mydata3$y # a 10 by 1 matrix round( rbind(t(HAT %*% mydata3$y),out3$fit) ,3) @ } The hat matrix $H = Q_2Q_2'$ is symmetric and $HH=Q_2Q_2'Q_2Q_2' = H$. It projects vectors in~$\RR^{10}$ onto the two-dimensional subspace~$\xx$ spanned by the columns of~$X$ or by the first two columns of the~$Q$ matrix. In fact, \Rlang/ would be just as happy to give a complete orthonormal basis for~$\RR^{10}$: <<>>= Q3 <- qr.Q(out3$qr,complete=T) round(Q1,2)[1:3,] round(Q3,2)[1:3,] @ \bibliographystyle{chicago} \bibliography{/Users/pollard/Dropbox/texmf/bibtex/bib/DBP} \end{document}