[Return to syllabus page]

Statistics 200: Lab 10

Today's tasks: Can you trust qqnorm()?.

The Q-Q normal plot (?qqnorm) is a plot that is used as a graphical check on the normality of a sample: for samples from a normal distribution, the plot should be roughly a straight line. If the plot is to be a useful diagnostic, one must know what it means to be 'roughly straight'.

Jargon and fact: If xx is a sample of some sort then the values in sort(xx) are called the order statistics of the sample. The middle 50% of the standard normal distribution occupies an interval of length close to 1.35 (the IQR, the interquartile range).

Problem 1

What are the values returned by the (default version of the) qqnorm() function? How are the x-coordinates of the points calculated? Hint: feed them into pnorm(). Apparently the formula is slightly different for samples of size 10 or smaller.

Problem 2

Use rnorm() to generate 20 samples of size n (= 20 perhaps) from a standard normal distribution. Draw all 20 qqnorm plots on the same page (use mfrow). What do you see? Do the lines look straight? Experiment with different values for n to get some feel for what 'straight' means. Repeat the exercise with observations from rnorm with mean=3 and sd=2. What happens to the line?

Problem 3

Write a function

contaminate(n=100,prob.bad=0,mean=0,sd=1,badmean=0,badsd=1)
to generate samples of size n from a `contaminated normal distribution', as follows. Generate n observations from a normal distribution with mean `mean' and standard deviation `sd'. Also generate a value k from the Binomial(n,bad.prob) distribution. For the first k observations multiply the sample value by `badsd' then add on `badmean'. (If you want to disguise the bad values, you could return the sample in random order, or in sorted order.)

Problem 4

Repeat the exercise from Problem 2 for 20 samples of size n (= 100 perhaps) from a contaminated normal with various choices of bad, badmean, and badsd. For each sample, rescale the observations to have zero median and IQR equal to 1.35, to keep the plots on a common scale. (I started with bad=10 and badmean=1.) What do you see? Could you tell that the samples were not from a standard normal?

Problem 5

Problem 6

Write a function

drawband(n,repl=100,L=6,H=95)
that will draw a pair of curves (let me call them ylow(x) and yhigh(x) for the moment) showing a 'typical range' within which a qqnorm plot for standard normals should lie. For each x, the range from ylow(x) to yhigh(x) should be constructed so that about 90% of the qqnorm(rnorm(n)) plots should lie in the range. (Note: It would be a much stronger requirement to have the qqnorm plot lie completely between the curves ylow() and yhigh() with probability 90%.) Hint: Use matlines() to draw curves, interpolating between values constructed at the x values generated by qqnorm(rnorm(n),plot=F). Use your function from Problem 5 to generate ylow(x) and yhigh(x).

Problem 7

Write a function

QQnorm(x,low=???,high=???,rescale=F)
with suitable default values for low and high, to draw the qqnorm plot for the vector of data x, with ``error bands" added. If rescale is T, standardize x to have zero median and IQR 1.35 before drawing the plot.

Puzzle

Here is a 10 by 100 matrix (in data.dump format) of data. I generated each row by some random mechanism, then sorted the observations within each row. Can you guess which rows were generated as samples from a normal distribution? You are allowed to use any statistical techniques you like to test the data.

If you get desperate, you could cheat by finding out the methods I used.

What is the moral of today's lab session?