Center and spread of a distribution - Median and Inter-quartile range (IQR)

Median as the center of distribution

In the last lessons, we learned about the mean as the center of the distribution. We interpreted the mean in terms of "equal or fair reallocation". We also saw that the mean corresponds to the balance point of the distribution, i.e. the total distances of the points on the left of the mean is the same as those on the right.

You might however have noticed that the number of points on the left of the mean was not the same as those on the right of the mean. In fact, in the last example, the number of points on the …… of the mean was more than on the ……. Of the mean. And you might have wondered “Shouldn’t the center of the distribution have an equal number of points to the left and to the right of it?”

Well, good news! There is another way to think about the center of a distribution, whereby we identify a value with approximately half the data on each side. This quantity is called the median (also called the 50th percentile), and it is the “middle value” when the data have been arranged in order. Half of the values in a data set are less than or equal to the median, and half of the values are greater than or equal to the median.

Calculating median

Great! So, how do we find the median?

To find the median, we order the data values from least to greatest and find the number in the middle. Suppose we have 5 dogs whose weights, in pounds, are shown in the table.

20, 25, 32, 40, 55

The median weight for this group of dogs is 32 pounds because three dogs weigh less than or equal to 32 pounds and three dogs weigh greater than or equal to 32 pounds.

Calculating median when there are even number of values

Now suppose we have 6 cats whose weights, in pounds, are as shown in the table. Notice that there are two values in the middle: 7 and 8.

4, 6, 7, 8, 10, 10

The median weight must be between 7 and 8 pounds because half of the cats weigh less or equal to 7 pounds and half of the cats weigh greater than or equal to 8 pounds. In general, when we have an even number of values, we take the number exactly in between the two middle values. In this case, the median cat weight is 7.5 pounds because (7+8)/2 = 7.5.

Median on the dot plot

In the dot plot, you can see that there are x number of points both to the right and to the left of the median.

“The median study time of 61 minutes is representative of the study times in that the number of the study times that are less than 61 minutes is the same as the number of study times that are greater than 61 minutes. However, the study times below the median are generally more spread out than those above the median. That is, the data in the lower half have more variability than the data in the upper half. So, the variability in the first half of the data can be different from the second half of the data when data are ordered.

Mean and median - what do they tell us?

So, both the mean and the median are ways of measuring the center of a distribution. However, they tell us slightly different things.

The dot plot shows the weights of 30 cookies. The mean weight is 21 grams (marked with a triangle). The median weight is 20.5 grams (marked with a diamond). The mean tells us that if the weights of all cookies were distributed so that each one weighed the same, that weight would be 21 grams. We could also think of 21 grams as a balance point for the weights of all of the cookies in the set.

The median tells us that half of the cookies weigh more than 20.5 grams and half weigh less than 20.5 grams. In this case, both the mean and the median could describe a typical cookie weight because they are fairly close to each other and to most of the data points.

Mean and median when the distribution is not symmetric

Here is a different set of 30 cookies. It has the same mean weight as the first set, but the median weight is 23 grams. In this case, the median is closer to where most of the data points are clustered and is therefore a better measure of the center for this distribution. That is, it is a better description of a typical cookie weight. The mean weight is influenced (in this case, pulled down) by a handful of much smaller cookies, so it is farther away from most data points.

In general, when a distribution is symmetrical or approximately symmetrical, the mean and median values are close. But when a distribution is not roughly symmetrical, the two values tend to be farther apart. Also, when there are a handful of extreme or unusual values (either smaller or bigger), the mean gets much more affected.

Something to think about: When do we choose mean and when do we choose median to measure the center of distribution? [Don't think too much! We'll learn about this in the next chapter]

Interquartile range (IQR) to measure the variability of the distribution

Earlier we learned that the mean is a measure of the center of distribution and the MAD is a measure of the variability (or spread) that goes with the mean. There is also a measure of spread that goes with the median. It is called the interquartile range (IQR).

Finding quartiles

Finding the IQR involves splitting a data set into fourths. Each of the three values that splits the data into fourths is called a quartile. The median (M) is used to divide the data into two halves - the lower half and the upper half. The first quartile (Q1) and third quartile (Q3) are the quantities that divide each half of the data in half. The median is also called the second quartile (Q2).

For example, here is a data set with 11 values.

[Data here]

The median is 33 and it divides the data into two halves. The first quartile is 20. It is the median of the numbers that are less than 33 (represented in red). The third quartile is 40. It is the median of the numbers greater than 33 (represented in blue)

Range and inter-quartile range (IQR)

The difference between the maximum and minimum values of a data set is the Range. Here, the range is .. - … = ….

The difference between Q3 and Q1 is called Interquartile Range (IQR). Because the distance between Q1 and Q3 includes the middle two-fourths of the distribution, the values between those two quartiles are sometimes called the middle half (or middle 50%) of the data. Here, IQR is calculated as Q3-Q1 = 40-20 = 20.

The IQR indicates that any data in the middle 50% of the data will differ by at most 20. The IQR provides an additional measure of variability for a distribution and is used when median is chosen as the measure of center. The bigger the IQR, the more spread out the middle half of the data values are. The smaller the IQR, the closer together the middle half of the data values are. This is why we can use the IQR as a measure of spread.

There are several different methods for determining quartiles. For example, when n is odd, the ordered data cannot be evenly divided in half, so we take the average of the middle two numbers. For this lesson, we exclude the median from the lower and upper “halves” when determining the quartiles.