Making inferences using a random sample

Predicting population mean using a sample mean

Does random sampling always give a representative sample?

Let's see what happens when we take 100 random samples, each of size 10 from the population. Now, it's not convenient to show all distribution of all 100 samples, so we'll show you selected six of them. Please be assured that we didn't cheat. All these samples were actually generated by the computer from the given population.

[six distributions here]

Well, I think it's pretty clear that random sampling does not always produce a representative sample. There is always a small chance of getting a distribution that is quite different from the population distribution. Also, each sample seems to differ from the other.

So, how confident can we be about our estimation of the population that we inferred from a sample mean? And what factors influence the accuracy of our estimation? To answer these questions, we'll need to generate a new distribution called "sampling distribution".

What is a sampling distribution?

The sampling distribution is the distribution of the sample means. In the last example, we randomly generated 100 different samples of size 10. From each sample, we calculate the sample means. The distribution of these sampling means is called a sampling distribution. What's interesting about sampling distribution is that the mean of this sampling distribution is very close to the actual population distribution.

This is very useful because even if we don't know about the population mean, we can take samples of a certain size and use them to generate the sampling distribution and predict the mean of the population.

Below are the two distributions. The first one is the actual population distribution, of which we want to estimate the mean. In this case, we already know the population distribution, so we can use this knowledge to compare our estimation with the actual population mean. In real life, however, we often deal with situations where we do not know the population distribution. The second distribution is the sampling distribution, which we will use to predict the population mean.

[distributions]

So, what can we observe?

The sample mean changes from one sample to another. Every random sample is different, so the mean of the sample will also change. In the sampling distribution, some means are above the population mean, and some means are below the population mean.
Some sample means are quite far from the population mean. These samples probably did not represent the population. Again, a random sample may not always be a representative sample, however, a random selection is still the best method to get a representative sample.
In the long run, the sample means average out to the population mean

Some other observations:

The sampling distribution tends to have less variability than the individual values within the population distribution.
The shape of the sampling distribution tends to be more bell-shaped (unimodal and symmetric) than the population distribution.

Does the sample size influence our estimation of the population mean?

Let's look at the two sampling distributions, each of 100 samples. The first one uses the sample size of 10 and the second one uses the sample size of 25.

In the first sampling distribution, a large majority of the sample means lie between x an y. In contrast, a large majority of the sample means lies between x and y in the second sampling distribution. Clearly, if you were to use one of the random samples to predict the population mean, you are more likely to get a better estimate using a sampling distribution with a bigger sample size of 25.

In general, as the sample size gets bigger, the mean of a random sample is more likely to be closer to the mean of the population.

Is a random sample better at predicting the population mean when it is drawn from a population with less variability?

For samples of the same size, the one drawn from the population with less variability is more likely to have a mean that is close to the population mean.

Let’s understand this with an example. The first population has a mean of x and MAD of y. The second population has a mean of x and MAD of y+2. Now, let’s draw 100 samples from each population and draw the sampling distribution.

[distribution here]

You can see that when the population has high variability, the distribution of sample means is more spread out. We can be less confident about the sample mean we calculated from each of the samples. However, in both cases, as we take many samples, the sample means average out to the population mean.

The population distribution is skewed right, all words have be- tween 1 and 11 characters, and no two word lengths differ by more than 10 characters. The most frequent word length is 4 characters and occurred in 59 (22%) of the words. Approximately 60% of the words have lengths between 2 and 4 characters, whereas approximately 83% have word lengths between 2 and 6 characters. The mean word length is 4.3 characters.

Although the empirical sampling distribution is slightly skewed to the right, it is closer to bell-shaped than the population distribu- tion. All the samples have means between 2.5 and 6.4 characters, and no two sample means differ by more than 3.9 characters. Twenty-eight percent of the samples have a sample mean between 4.0 and 4.4, 69% have a sample mean between 3.5 and 4.9 charac- ters, and 95% have a sample mean between 3.1 and 5.3 characters. The median of the 100 sample means is 4.25 characters, and the mean of the 100 sample means is approximately 4.3 characters. Note that both the mean and median of the sample means are close to 4.3, which is the mean of the population distribution of all word lengths.

Thus, although the population distribution is skewed right and the empirical sampling distribution is fairly bell-shaped, the centers for both the population distribution and the empirical sampling dis- tribution are similar. Also, there is less variability in the empirical sampling distribution than in the population distribution, an observation that leads to the questions in Reflect 1.22.

In the case of the empirical sampling distribution displayed in figure 1.50, 95% of the selected samples produced a sample mean between 3.1 and 5.3 characters. If a sample mean is between 3.1 and 5.3 characters, then it is at most 1.2 characters from the population mean, = 4.3. This is an illustration of a foundational concept in statistical estimation—the notion of a margin of error when estimating the mean of a population with the mean of a sample.

In the sampling distribution, some means are above the population mean, and some means are below the population mean. In the long run, the sample means average out to the population mean.
The sampling distribution tends to have less variability than the individual values within the population distribution.
The shape of the sampling distribution tends to be more bell-shaped (unimodal and symmetric) than the population distribution.