Stat200 lab4

Statistics 200: Lab 4 (Friday 6 February 98)

Today's tasks:
Matrices, arrays and factors.

We have seen matrices and how Splus deals with matrices in some of the previous labs. Today we shall learn more about matrices and some built in Splus functions for doing matrix manipulations. We shall see two new types of object: arrays (a multidimensional analogue of a matrix) and factors (a special type of vector). We shall return to the tapply command that we have seen in previous labs.

Matrix Revision

We construct a (3 x 3) matrix m1 as follows,

> v1<-c(1,1,2,3,5,8,13,21,35)
> m1<-matrix(v1,3,3)
> m1

To find the inverse of the matrix m1 we use the command solve (?solve).

> m2<-solve(m1)
> m2*m1
> m1*m2

I thought that m2 was the inverse of m1. Remember the two types of matrix product?

> m2%*%m1
> m1%*%m2

We can use Splus to easily solve systems of linear equations, for example:

x + 3y + 13z = 1

x + 5y + 21z = 2

2x + 8y + 35z = 3

This can easily be solved in Splus by finding the product of the inverse of the m1 matrix and the vector c(1,2,3).

> solve(m1)%*%c(1,2,3)

> solve(m1,c(1,2,3))

We can find many mathematical properties of matrices using Splus. For example, we can extract the diagonal (?diag), eigenvalues and eigenvectors (?eigen) and transpose (?t).

Check out the difference between giving the diag command a matrix and a vector. What output would you expect from diag(diag(m1))?

There is no inbuilt function for finding the determinant or trace of a matrix, but these are easily written.

> tr<-function(M){sum(diag(M))}
> det <- function(M){Re(prod(eigen(X)$values))}

Yes, you can calculate the determinant from the product of the eigenvalues.

Splus has many advanced matrix operations, QR decomposition, SVD decomposition for example. If these are not enough for you there are plenty more in the library of matrix functions that can be attached to Splus. If you are interested check library(help=Matrix).

Problem: Write a function called "identity", that generates an n-dimensional identity matrix when given an integer n as it's argument.

Arrays

Arrays are multi-dimensional analogues of matrices. Whereas vectors are 1-dimensional and matrices are 2-dimensional, arrays can be of any finite dimension (within the computers capabilities).

I like to think of 3-dimensional arrays as a stack of similar sized matrices on top of each other. Or similarly, they can be thought of as a stack of similarly sized matrices beside each other.

There is a 3-dimensional array called iris built into Splus, this is a very famous dataset in statistics.

> iris
> dim(iris)
> iris[1,2,3]
> iris[,(1:3),(2:4)]
> iris[,-3,2]

Yes, arrays are indexed just like matrices and vectors.

Try and rename the dimensions in the array using dimnames. Replace "Sepal W." with "Sepal Width", "Sepal L." with "Sepal Length", etc.

The aperm command is a generalization of the transpose command for arrays. You give it a vector as one of the arguments to give the new ordering of the variables in the array. The following examples will make it easier to understand.

> aperm(iris,c(1,2,3))
> aperm(iris,c(2,3,1))
> aperm(iris,c(3,2,1))

The apply command is valid for arrays as well as matrices, it takes a bit of thought to see exactly what is happening.

> apply(iris,c(2,3),mean)
> apply(iris,2,mean)
> apply(iris,3,mean)

It is sometimes useful to turn an array into a vector, or data frame and also to turn a vector or data frame into an array. This is a simple operation if you can keep track of how Splus constructs arrays and how it deconstructs them into vectors. Try the follwoing example for some ideas.

> a<-array(1:12,c(2,3,4))
> a
> va<-as.vector(a)
> va
> aa<-array(va,c(2,3,4))
> aa

By now you will have turned an array into a vector, and then turned it back into an array. Can you see how Splus constructs arrays?

Problem: How would you construct a data frame with the first column being the observations from an array, the second column recording the first indices of the values in the array, the second column recording the second indices, and the third column recording the third indices? A simple example will help you do this.

If you ever need to check if an object is of a certain type just type a command like, as.vector (to check if it's a vector), as.array (to see if it's an array), as.matrix, as.data.frame etc.

Factors

A factor is a special type of vector used to store categorical variables.

Example: Suppose we have a vector of country data relating to the citizenship of a sample of people. We can store this data in a text vector, but we will later see that a factor is a more useful form for storing this type of data.

> country <- c("USA","Ger","USA","UK","Aus","Fra","Can","Can","USA","Fra","USA")
> countryfac <- factor(country)
> countryfac

In many of its functions Splus will treat a factor like an indicator function, you would notice this when doing regression using categorical data. Did you notice that the country names had no quotes?

> codes(countryfac)

How did Splus come up with these codes for storing the factor? Try the following it should help.

> countrysfac2<-factor(country,levels=("USA","Can","Aus","Ger","Fra","UK","Bel"))
> countryfac2
> codes(countryfac2)

We can order the variables in a factor, just like some categorial data has a natural ordering.

Example: Suppose we just recorded the income class of a group of individuals, using the categories "Low", "Mid", "High". We can use a factor to store this information.

> income<-ordered(c("Mid","Low","High","Low","Mid","Mid","Low","High"))
> income

It's not quite right yet, try the following fix, for our problem.

> inc<-ordered(c("Mid","Low","High","Low","Mid","Mid","Low","High"),levels=c("Low","Mid","High"))
> inc

I tell you now, that you could have saved a bit of typing by just doing:

> ordered(income)<-c("Low","Mid","High")
> income

We shall give a quick example of why factors can be useful.

Example: Suppose that we want to find the means of the age of the recorded above, by country. We happen to have the ages recorded in a vector ages.

> age <- c(54,44,40,43,37,55,34,60,47,42,61)
> tapply(age,countryfac2,mean)

This is much easier than having an indicator variable for each country and doing a tapply using those. In later labs factors will prove to be very useful when doing analysis of variance or linear regression.

Problem:

1994 Winter Olympics

				Judges
			CZE	USA	CAN
	CZE	T	54	54	55
		A	57	54	56
Teams	USA	T	53	54	56
		A	55	55	57
	CAN	T	56	57	57
		A	57	58	58

Enter the data into an array
Compute the overall average score
Compute the average score for each judge, and each team
Compute the average score for each judge-team combination
Is there evidence that judges favour their own team?