# Sampling in Statistical Inference

The use of randomization in sampling allows for the analysis of results using
the methods of *statistical inference*. Statistical inference is
based on the laws of probability, and allows analysts to infer conclusions
about a given population based on results observed through random sampling.
Two of the key terms in statistical inference are *parameter* and
*statistic*:
A *parameter* is a number describing a population, such as a
percentage or proportion.

A *statistic* is a number which may be computed from the data
observed in a random sample without requiring the use of any unknown
parameters, such as a sample mean.

__Example__

Suppose an analyst wishes to determine the percentage of defective items which are
produced by a factory over the course of a week. Since the factory produces
thousands of items per week, the analyst takes a sample 300 items and observes
that 15 of these are defective. Based on these results, the analyst computes
the *statistic* , 15/300 = 0.05,
as an estimate of the *parameter p* , or
true proportion of defective items in the entire population.

Suppose the analyst takes 200 samples, of size 300 each, from the same group
of items, and achieves the following results:

Number of Samples Percentage of Defective Items
20 3
30 4
50 5
45 6
35 7
30 8

The histogram corresponding to these results is shown below:

These results approximate a **sampling distribution** for the
statistic , or the distribution
of values taken by the statistic in all possible samples of the size 300
from the population of factory items. The distribution appears to be
approximately normal, with mean between 0.05 and 0.06. With repeated
sampling, the sampling distribution would more closely approximate a
normal distribution, although it would remain discontinuous because of the
granularity caused by rounding to percentage points.

### Bias and Variability

When a statistic is
systematically skewed away from the true parameter *p*, it is
considered to be a biased estimator of the parameter. In the factory
example above, if the true percentage of defective items was known
to be 8%, then our sampling distribution would be biased in the
direction of estimating too few defective items. An *unbiased
estimator* will have a sampling distribution whose mean
is equal to the true value of the parameter.
The *variability* of a statistic is determined by the
spread of its sampling distribution. In general, larger samples will
have smaller variability. This is because as the sample size increases,
the chance of observing extreme values decreases and the observed values for
the statistic will group more closely around the mean of the sampling distribution.
Furthermore, if the population size is significantly larger than the
sample size, then the size of the population will not affect the
variability of the sampling distribution (i.e., a sample of size 100 from
a population of size 100,000 will have the same variability as a sample
of size 100 from a population of size 1,000,000).