\documentclass[12pt]{book}
\usepackage{boxedminipage}
\input 312-fall2016.sty

\newcommand{\bhat}{\widehat b}
\newcommand{\chat}{\widehat c}
\newcommand{\dhat}{\widehat d}
\newcommand{\yhat}{\widehat y}
\newcommand{\that}{\widehat t}
\newcommand{\ybar}{\overline y}
\newcommand{\zbar}{\overline z}
\def\Rlang/{{\bfseries R}}


\begin{document}

\setcounter{chapter}{0}
\chapter{Least squares}
\today

\TOC
\bigskip

\copyright David Pollard 2016

\section{Minimimize sums of squares}
The least squares method was first proposed in the early years of the
19th century (or maybe earlier---\citealp[Chapter~1]{Stigler86book}).
Statistician often think of it as just a way to fit a statistical model,
even though it makes sense without any mention of randomness or a model. 

As a problem in computation, least squares starts from $y=[y_1,\dots,y_n]$, a column
vector of numbers (an element of~$\RR^n$),   and some finite set of ``predictors",
column vectors $x_j=[x_{1j},\dots,x_{nj}]$ for $j=1,\dots, p$.
We seek numbers $b_1,\dots, b_p$ to minimize the sum of squared differences: 
\beqN
\SUM_{i\le n}\left( y_i -\SUM_{j\le p}x_{ij}b_j\right)^2 .
\eeqN
If we  write $\norm{z}=\Sqrt{z_1^2+\dots+z_n^2}$ for the length of a vector~$z=[z_1,\dots,z_n]$ in~$\RR^n$, then the task can be reexpressed as:
\beq\label{ls.problem}
\text{minimize $\norm{y-x_1b_1-\dots-x_pb_p}^2$ \wrt/ $b$}.
\eeq
That is, we seek a linear combination of the vectors $x_1,\dots,x_p$
that best approximates~$y$, in the sense of minimizing the usual
Euclidean distance. More compactly, we can take the $x_j$'s as the
columns of an $n\times p$ matrix and think of $b=[b_1,\dots,b_p]$ as a
vector in~$\RR^p$, then minimize $\norm{y-Xb}^2$. We could also think 
of~$y_1,\dots,y_n$ as  measurements on each of~$n$ individuals and
regard  the $i$th row, $w_i'$, of~$X$ as providing auxiliary information
about individual~$i$, then  minimize $ \SUM_{i\le n}\left( y_i -
w_i'b\right)^2 $.


\Remark
If the last  paragraph gave you a headache then you
might want to look at  the first two chapters of
the \citet{Axler2015LADR} book. It is pretty painful trying to
learn about linear models if one does not know any linear algebra.
\endRemark

It is not hard to find a set of linear equations for~$b$ whose solution
achieves the minimum. The solution is not unique unless the vectors
$x_1,\dots,x_p$ are linearly independent, that is, if none of them can
be written as a linear combination of the others (cf.
\citealp[Section~2A]{Axler2015LADR}). In some simple cases it is 
possible to write out, without recourse to matrices, a closed form
expression for the solution. It used to be a popular source of pleasure
for statistical sadists to make students memorize that  expression,
despite the fact that computers are much better than humans at
memorizing and applying formulas.

Using~\Rlang/ it is particularly easy to find least squares solutions.
Here is a  fake example:
\bgroup\small
<<>>=
set.seed(0)   # gardening time
xx <- matrix(1:20,ncol=2) # Create a 10 by 2 matrix
xx[1:3,] # first three rows of xx
# The columns of xx are known as xx[,1] and xx[,2] to R.
yy <- rnorm(10) # just noise
@
\egroup
\noindent
Now solve.
\bgroup\small
<<>>=
out1 <- lm(yy ~ -1 + xx) # solve the least squares problem,
      # putting lots of info into a new object
names(out1)  # lots of good stuff in here
# For example,
out1$coefficients  
# too many decimal places
@
\egroup
\noindent
In this case the minimizing~ value of~$b$ is $\bhat=$ [\Sexpr{round(out1$coefficients,2)}].


\Remark
The $-1$ in the $lm()$  stops \Rlang/ from trying to be helpful by adding a vector of~$1$'s as a third column to~$X$, even though it is often a good idea. (It adds an intercept term to a regression.) If you want to see the effect, try:
{\small
<<>>=
# summary(lm(yy ~ -1 + xx))  
# summary(lm(yy ~ xx))  
@
}
\noindent
without the comment characters (\#) at the start of the lines.
You will see that~\Rlang/ has included an intercept term,
by adding a column of $1$'s to the $X$ matrix, in the second case.
Don't worry about the details.
You'll learn in a week or so what it all means. 
\endRemark

Usually it is better to keep all the data together in one object, a ``data frame":
{\small
<<>>=
mydata <- data.frame(y=yy,xx)
mydata[1:3,]
out2 <- lm(y ~ -1 + X1 + X2, data=mydata)
# out2 <- lm(y ~ -1 + . , data=mydata) does the same thing
round(out2$coeff,4)
@
}


\section{Geometry}
The vectors $x_1,\dots,x_p$ span some subspace~$\xx$ of~$\RR^n$. (That
is, the set of all vectors of the form $x_1b_1+\dots x_pb_p$ is a vector
subspace of~$\RR_n$). The least squares method minimizes $\norm{y-x}$
as~$x$ runs through all vectors in~$\xx$. The minimimizing vector
in~$\xx$ is often denoted by~$\yhat$. It is sometimes called the
\newdef{fitted vector}. As you will see, it is easy to find~$\yhat$ if
we work in a good coordinate system. We then have the task of
writing~$\yhat$ as a linear combination~$\bhat_1x_1+\dots+\bhat_px_p$,
which also turns out to be easy if we choose a good coordinate system.

If the $x_j$'s are linearly independent
\cite[Section~2.17]{Axler2015LADR} then the subspace~$\xx$ has
\newdef{dimension}~$p$ (and~$X$ has \newdef{rank}~$p$; the matrix is of
\newdef{full rank}). If at least one of the $x_j$'s is a linear
combination of some of the others then~$\xx$ has a dimension smaller
than~$p$ (and~$X$ is not of full rank). 

\Remark
There are many linear models for which it makes sense to
choose linearly dependent~$x_j$'s.  Stay tuned for ANOVA.
\endRemark

\citet[Chapter~6]{Axler2015LADR} would come in handy if the next paragraph
 intimidates you.

Remember that $\RR^n$ can be equipped with an inner product (also known as a dot product).  For column vectors $u=[u_1,\dots,u_n]$ and~$v=[v_1,\dots,v_n]$,
\beqN
\inner{u}{v} = u\cdot v = u'v=\SUM_{i\le n} u_i v_i.
\eeqN
In particular, $\norm{u}=\Sqrt{\inner u u}$. 
A vector~$u$ is said to be a unit vector if~$\norm{u}=1$.
Two vectors, $u$ and~$v$, are said to be orthogonal if $\inner{u}{v} = 0$.
A set of vectors $\{q_1,\dots, q_m\}$ is said to be an
\newdef{orthonormal basis} for a subspace~$\xx$ if:
\begin{enumerate}
\item
Each $q_i$ is a unit vector and $\inner{q_i}{q_{j}}=0$ for all~$i\ne j$.
\item
Each vector~$x$ in can be written as a linear combination of $q_1,\dots,q_m$.
\end{enumerate}


The second property means that, for each $x$ in~$\xx$ there exist numbers $t_1,\dots,t_m$ such that
\beqN
x= q_1t_1+\dots +q_mt_m
\eeqN
The coefficients $t_j$ are determined by
\beqN
\inner{x}{q_j} = \inner{q_1t_1+\dots +q_mt_m}{q_j}
= t_1\inner{q_1}{q_j}+ \dots + t_j\inner{q_j}{q_j} + \dots + t_n\inner{q_n}{q_j}.
\eeqN
In the last sum, $\inner{q_j}{q_j}=1$ (because $q_j$ is a unit vector) and the other inner products are zero (by orthogonality).
Thus $t_j = \inner{x}{q_j}$.

The key fact is: 

\begin{boxedminipage}[t]{5in}
For each subspace~$\xx$ of~$\RR^n$ of dimension~$m$ there exists 
an orthonormal basis $\{q_j: 1\le j\le n\}$ (not unique) for~$\RR^n$, such that
 $q_1,\dots,q_m$  is an orthonormal basis for~$\xx$. The remaining vectors,~$q_{m+1},\dots, q_n$ provide an orthonormal basis for the
 subspace
 \beqN
\xx^\perp =\{z\in \RR^n: \inner xz=0\text{ for all $x$ in $\xx$}\},
\eeqN
the orthogonal complement of~$\xx$.
\end{boxedminipage}

\bigskip
It is a trivial task to find~$\yhat$ if $y$ and $x$ are expressed in this orthonormal basis.
\bAlignL\label{new.onb}
y &= s_1q_1+\dots s_mq_m + \left(s_{m+1}q_{m+1}+\dots s_nq_n\right)
\\
x &= t_1q_1+\dots t_mq_m
\notag\\
\norm{y-x}^2 &= (s_1-t_1)^2+ \dots + (s_1-t_1)^2 + \left(s_{m+1}^2 +\dots + s_n^2\right)
\notag\eAlignL
Clearly the solution is given by $t_i=s_i$ for $1\le i \le m$. The
\newdef{fitted vector}~$\yhat$ equals~$s_1q_1+\dots s_mq_m $. The
\newdef{residual vector} is defined as $r=y-\yhat=s_{m+1}q_{m+1}+\dots
s_nq_n$, an element of~$\xx^\perp$. The fitted vector and the residual
vector are orthogonal. The decomposition~$y=\yhat+r$ is the unique
representation of~$y$ as a sum of a vector in~$\xx$ and a vector 
in~$\xx^\perp$. The fitted vector is also called the \newdef{orthogonal
projection} of~$y$ onto~$\xx$ and the residual is called the projection
of~$y$ orthogonal to~$\xx$ (or the orthogonal projection of~$y$
onto~$\xx^\perp$).

\Example\label{2d}
Most introductory courses discuss the  problem of fitting a straight line by least squares: We have vectors $y$ and $z=[z_1,\dots,z_n]$ in~$\RR^n$;
we seek constants~$c$ and~$d$ to minimize $\sum_{i\le n}(y_i-c-d z_i)^2$.

The problem fits into the framework of~(\ref{ls.problem}) with~$x_1=\mathbbm{1}$, a column of~$1$'s, and~$x_2=z$---that is,~$X=(\bone,z)$---and~$b$ equal to  the column vector~$[c,d]$. 

For the $q_i$'s choose $q_1$ as a unit vector in the direction of~$x_1$ and~$q_2$
as a unit vector in the direction of~$x_2-\inner{x_2}{q_1}q_1$, which is orthogonal to~$q_1$.  Note that~$\inner{x_2}{q_1}=n^{-1/2}\sum_i z_i= \zbar\Sqrt{n}$, so that
\bAlign
q_1 &= n^{-1/2}\bone
\\
q_2 &=  \frac{x_2-\inner{x_2}{q_1}q_1}{\norm{x_2-\inner{x_2}{q_1}q_1}} =
\frac{z-\zbar\bone}{\ell}
\qt{where }\ell := \Sqrt{\SUM_i(z_i-\zbar)^2}.
\eAlign
We do not need to determine $q_3,\dots,q_n$ explicitly in order to determine the least squares fit.
Indeed, with
\beqN
s_1=\inner{y}{q_1}= n^{1/2}\ybar
\AND/
s_2 = \inner{y}{q_2} = \SUM_i y_i(z_i-\zbar)/\ell
\eeqN
we have
\bAlign
\yhat& = s_1q_1 + s_2q_2 = \chat \bone +\dhat z
\\
&\qt{where }\dhat = \frac{\SUM_i y_i(z_i-\zbar)}{\SUM_i(z_i-\zbar)^2}
\AND/\chat=\ybar - \dhat \zbar.
%\AND/
%s_2 = \inner{y}{q_2} = \SUM_i y_i(z_i-\zbar)/\ell.
%&= \ybar\bone + \frac{\SUM_i(z_i-\zbar)y_i}{
\eAlign
\Remark
Obviously something goes wrong if $\SUM_i(z_i-\zbar)^2=0$. That happens only
when~$z$ is a constant multiple of~$\bone$, which means that~$z$ and~$\bone$ are linearly
dependent and the coefficients~$c$ and~$d$ are not uniquely determined. Put another way,
the rank of the matrix~$X$ and the dimension of the subspace~$\xx$ both equal~$1$.
For uniquely determined~$c$ and~$d$ we need the rank and the dimension equal to~$2$.
\endRemark

In matrix notation, provided~$z$ is not a constant multiple of~$\bone$,
\bAlignL\label{Q_X}
Q=(q_1,q_2) &= (\bone ,z)W = XW
\\
\qt{where }
W &=
\pMatrix
w_{11}& w_{12}
\\
0& w_{22}
\endpMatrix
:=
\pMatrix
n^{-1/2}&-\zbar/\ell
\\
0&1/\ell
\endpMatrix
.
\notag
\eAlignL
The zero in the south-west corner makes it easy to find the inverse matrix~$R=W^{-1}$. We want
\beqN
\pMatrix
1& 0
\\
0& 1
\endpMatrix
= W R
=
\pMatrix
w_{11}r_{11}+w_{12}r_{21}& w_{11}r_{12} + w_{12}r_{22}
\\
w_{22}r_{21}&  w_{22}r_{22}
\endpMatrix
.
\eeqN
with solution
\bAlign
r_{22}&= 1/w_{22} = \ell
\\
r_{21} &= 0/w_{22} =0
\\
r_{11}&= (1-0)/w_{11}= n^{1/2}
\\
r_{12} &= -w_{12}r_{22}/w_{11} = n^{1/2}\zbar.
\eAlign
That is
\beqN
R= 
\pMatrix
n^{1/2}&n^{1/2}\zbar
\\
0& \ell
\endpMatrix
\eeqN
and $X= XW W^{-1}=QR$.

\Remark
If I had started with a much larger \newdef{upper triangular matrix}, a square matrix with
zeros everywhere below the diagonal, you would have seen clearly the pattern that gives 
the method the name \newdef{back-substitution} \cite[Section~3.1]{GolubVL2013}.
\Rlang/ knows how to back-substitute; you do not have to go through the calculations.
\endRemark


The least squares calculations can be rewritten as
\beqN
QR\bhat= X\bhat = \yhat = \inner{y}{q_1}q_1 + \inner{y}{q_2}q_2 = QQ'y,
\eeqN
so that $\bhat = R^{-1} Q'y$. (Orthonormality gives $Q'Q=I_2$.)
The $n\times n$ matrix~$H=QQ'$ is often called the \newdef{hat matrix}. 
(Where do you think that name came from?)
It
is the matrix for orthogonal projection onto the~$\xx$ subspace. 
Also
\beqN
\bhat =R^{-1}Q'y = (R'R)^{-1}(QR)'y = (X'X)^{-1}X'y.
\eeqN
The last equality gives the traditional textbook solution to the ``normal equations'', $X'(y-X\bhat)= r'\yhat=0$. Unfortunately this solution makes no sense when~$X$ is not of full rank. (The matrix~$R$ would be singular, that is neither~$R$ nor~$X'X$ has an inverse.)
\endExample


\section{QR decompositions and least squares}
The method described in Example~\ref{2d} works more generally.
For $n\ge p$, every $n\times p$ matrix~$X=(x_1,\dots,x_p)$ can be written in the form $X=QR$, where~$Q=(q_1,\dots,q_p)$ is an $n\times p$ matrix whose columns are orthogonal unit vectors and~$R$ is a
$p\times p$ upper triangular matrix
(the elements below the diagonal are all zero):
\beqN
R = 
\pMatrix
r_{11}&r_{12}&r_{13}&\dots&r_{1p}
\\
0&r_{22}&r_{23}&\dots&r_{2p}
\\
0&0&r_{33}&\dots&r_{3p}
\\
\vdots
\\
0&0&0&\dots&r_{pp}
\endpMatrix
\eeqN
That is,
\bAlignL\LABEL{x.from.q}
x_1&= q_1 r_{11}
\notag\\
x_2&= q_1 r_{12} + q_2 r_{22}
\notag\\
\vdots
\notag\\
x_p&= q_1 r_{1p} + q_2 r_{2p} + \dots + q_p r_{pp}
\eAlignL
Such a representation could be constructed by an analog of the procedure  in Example~\ref{2d} (the Gram-Schmidt procedure) or, as \Rlang/ does, by an elegant method using
Householder reflections \cite[page~9.13]{LINPACK}.

The diagonal elements $r_{jj}$ for $1\le j\le p$ are all nonzero if and only if the matrix~$X$ has full rank.  In that case, the $q_j$'s can all be expressed (uniquely) as linear combinations of the~$x_j$'s and $\{q_1,\dots,q_p\}$ is an orthonormal basis for~$\xx$.


When $X$ is not of full rank, \Rlang/ can still make a least squares fit, but you need to 
be careful in interpreting the output. Here is an example where I ensure complications by deliberately making the third column of~$X$ a linear combination of two other columns, then \Rlang/ makes things even worse (so to speak) by adding in a column of ones.
<<>>=
mydata3 <- data.frame(y=yy,
	x1=1:10,x2= 11:20, x3= 0.5*(1:10)-3*(11:20))
out3 <- lm(y ~ ., data=mydata3)
summary(out3)
@
What happened?
\Rlang/ figured out that (within some level of numerical accuracy):~$x_2$ and~$x_3$
are linear combinations of~$\bone$ and~$x_1$; 
the matrix $X=(\bone, x_1,x_2,x_3)$ has
rank~$2$; 
the subspace~$\xx$ has dimension~$2$, not~$4$; 
\Rlang/ only needs two orthogonal unit vectors to span~$\xx$; 
but it still gave you four orthogonal unit vectors because it thought that's what you wanted. 
{\small
<<>>=
out3$rank
round(qr.R(out3$qr),2) # notice the zeros on the diagonal
Q1 <- qr.Q(out3$qr)
round(Q1,2)[1:2,]
Q2 <- Q1[,1:2] # just the first two columns, 10 by 2
HAT <- Q2 %*% t(Q2) # a 10 by 10 matrix
round(HAT[1:2,],2)
thefit <- HAT %*% mydata3$y # a 10 by 1 matrix
round( rbind(t(HAT %*% mydata3$y),out3$fit) ,3) 
@
}

The hat matrix $H = Q_2Q_2'$ is symmetric and $HH=Q_2Q_2'Q_2Q_2' = H$. It projects vectors in~$\RR^{10}$
onto the two-dimensional subspace~$\xx$ spanned by the columns of~$X$ or by the first two columns of the~$Q$ matrix.
In fact, \Rlang/ would be just as happy to give a complete 
orthonormal basis for~$\RR^{10}$:
<<>>=
Q3 <- qr.Q(out3$qr,complete=T)
round(Q1,2)[1:3,]
round(Q3,2)[1:3,]
@

\bibliographystyle{chicago}
\bibliography{/Users/pollard/Dropbox/texmf/bibtex/bib/DBP}

\end{document}