Inferential statistics - sample and population
Descriptive vs Inferential Statistics
So far, all the examples we have seen used a small set of data. For instance, we analyzed the heights of dogs on a data set of 30 dogs. We described the relevant characteristics of the distribution such as the shape, center, and spread. We call this field of statistics "descriptive statistics" because it allows us to describe, present and summarize data in a meaningful way such that patterns might emerge from the data.
A key point to note is that all these analyses apply to this particular group only. We might however have unknowingly extended our conclusion beyond the given data set to the larger data set. For instance, if the mean weight for the group of 30 pugs were 5.5 kg, then we might have unwittingly concluded that the mean weight of all the pugs in the US is also 5.5 kg. But, are we right to do so? Can we extend our analysis of a smaller data set to make conclusions beyond the data to a larger data set?
It appears we can and this process of making inferences about populations based on samples is called inferential statistics. In the next few lessons, we will explore when and how we can use a sample to make inferences about the population.
Sample and population
Before we go ahead, we need to definite the two important terms - sample and population, So, what are they?
A population is a set of people or things that we want to study. A sample is a subset of a population. Here are some examples of populations and samples from the listed populations:
Populations |
Samples from the population |
All people in the world |
The leaders of each country |
All seventh graders at a school |
The seventh graders who are in band |
All apples grown in the U.S. |
The apples in the school cafeteria |
When we want to know more about a population, ideally we would collect data on the whole population. Doing so however will be time-consuming, expensive, and often not possible. Imagine weighing all the 90 million dogs in the US! When it is not feasible to collect data from everyone in the population, we often collect data from samples.
Can we draw conclusions about a population by examining any sample?
Here is the given scenario. You are to estimate the average height of adult men in the US. Which of the following processes do you think makes the most sense?
- You take the height of 5 men in your family including your father, brothers, and cousins.
- You take the height of …..
Do you think all the above samples generated from the population represent the population? In other words, can we use any sample from the population to answer the questions of our concern about the population? Definitely not. What would happen if you tried to estimate the mean weight of all the dogs in the US by using a sample of 30 chihuahuas? How about estimating the salary of average Americans by taking a sample of software engineers in Silicon Valley? Of course, your conclusions will not be correct. In the lessons that follow, we will learn more about how to pick the sample that can help answer questions about the entire population.
What is a representative sample?
To answer the question about the entire population, we need to pick a sample that is representative of the population. A representative sample is one that has a distribution that closely resembles the distribution of the population in shape, center, and spread. A representative sample “represents” the population.
For example, consider the distribution of plant heights, in cm, for a population of plants shown in this dot plot. The mean for this population is 4.9 cm, and the MAD is 2.6 cm. A representative sample of this population should have a larger peak on the left and a smaller one on the right, like this one.
[population and sample]
The mean for this sample is 4.9 cm, and the MAD is 2.3 cm, which is close to the population mean and population MAD. The shape also closely resembles the population. So, this sample represents the population.
[population and sample]
Here is the distribution for another sample from the same population. This sample has a mean of 5.7 cm and a MAD of 1.5 cm. These are both very different from the population, and the distribution has a very different shape, so it is not a representative sample.
Simple random sampling to pick a representative sample
In the last lesson, we learned that a representative sample can be used to answer questions about the population. The obvious question that follows is “How do we pick the representative sample”? You might be thinking there must be some crazy sophisticated process to select a representative sample. Surprisingly, the most effective way to pick a representative sample is one of the simplest and it’s called random sampling.
What is a random sample?
A sample is selected at random from a population if it has an equal chance of being selected as every other sample of the same size. In simple random sampling, each individual in the population also has the same chance (probability) of being selected. In this regard, simple random sampling provides a fair way to select a sample.
It’s quite a mouthful, isn't it? What it’s basically saying is that if you pick a sample without bias for any particular values, it’s likely to be a representative sample. Other methods of selecting a sample from a population are likely to be biased and are less likely to be representative of the overall population. A sampling method is biased if it tends to produce samples that systematically overrepresent or underrepresent various features of the population.
A sample that is selected at random may not always be a representative sample, but it is more likely to be representative than using other methods. We will discuss further on this later.
Here are some examples of random sample selection along with other sampling methods with bias.
Examples of random selection |
Examples of biased selection |
There are 25 students in a class, then we can write each of the student's names on a slip of paper and select 5 papers from a bag to get a sample of 5 students selected at random from the class. |
We select the first 5 students who walk in the door, that will not give us a random sample because students who typically come late are not likely to be selected. |
We are trying to estimate the average length of a word in Shakespeare's “The Merchant of Venice” A computer randomly picks 10 words. |
A student circles 10 words from 10 random pages. While this may look like random sampling, there is a high chance that the student will pick longer words and not pick shorter words such as ‘a’, ‘is’ ‘of’ and therefore overrepresent the longer words. |
A biased sample tends to overrepresent or underrepresent certain values
[An example of how students can overrepresent]
Thus, longer words are overrepresented in the sample selected by this student. A consequence of this overrepresentation is that the sample means, 6.2 characters, is larger than the population mean. According to figure 1.44, 23 of the 25 samples (92%) have a mean larger than 4.3 characters, suggesting that most students selected samples that tended to overrepresent longer words. A sampling method is biased if it tends to produce samples that systematically overrepresent or underrepresent various features of the population.
The sampling method employed in this example might be called subjective selection since students simply selected samples on the basis of their subjective opinion about how the word lengths varied within the population. This example illustrates that people, left to their own devices, are generally not good at selecting samples representative of the population.
It is not always possible to select a sample at random. For example, if we want to know the average length of wild salmon, it is not possible to identify each one individually, select a few at random from the list, and then capture and measure those exact fish. When a sample cannot be selected at random, it is important to try to reduce bias as much as possible when selecting the sample.
Generating a (simple) random sample
One of the ways to select a random sample from a population begins with obtaining a list containing every member of the population and assigning a unique number to each individual in the list. This list is called the sampling frame.
For instance, if we want to pick 10 words at random from the Gettysburg Address, we'll first create the sampling frame. To create the sampling frame, the first word in the Gettysburg Address, “Four,” would be listed and assigned the number 1; the second word in the address, “score,” would be listed and assigned the number 2; and so on, through the listing of the last word, “earth,” and the assigning of the number 268. So, each word in this list is assigned a number from 1 (the first word) to 268 (the last word). To select a random sample of 10 words, a random number generator on a calculator or a computer can be used to select 10 unique integers between 1 and 268. An alternative would be to simply ask a computer. All you need to do is to give the computer the list of all the values and ask to give you a random sample of a specified size, let’s say 10.
So, does this method pick a sample that represents the population? Let’s see an example.
Using this method, each student selected a simple random sample of 10 words, recorded the length of each word in the sample, and then made a dot plot and determined the mean for the data in the sample. The means from these 25 samples are as follows: An empirical sampling distribution for the sample mean based on these 25 random samples is shown in the dot plot in figure 1.49. The questions in Reflect 1.19 ask you to explore relationships among the sample means and the population mean. In 12 of these 25 samples the mean is below 4.3 characters, and in 12 of these 25 samples the mean is above 4.3 characters. Thus, simple random sampling does not systematically overrepresent or underrepresent any segment of the population. Consequently, simple random sampling is fair and unbiased.
Wait. Let’s stop here for a moment. Let’s summarize what we have learned. So, a random sampling does not systematically overrepresent or underrepresent any segment of the population and therefore represents the population better than other ways of sampling.