Mean or Median - which one to use?

Mean vs median to measure the center of distribution

Now that we have learned the mean and the median and the corresponding measures of spread, which one should we use?

The decision mainly depends on two factors

The shape of the distribution
The presence or absence of outliers

The shape of the distribution in describing mean and median

The shape of the distribution influences which summary measure is most appropriate for describing the center of distribution for quantitative data.

The dot plots in figure 1.18 illustrate some of the common shapes that occur in distributions for quantitative data.

[Graphs here]

When do we prefer the mean?

In a unimodal symmetric distribution, the mean and median should be similar, and the mean is the preferred measure since its determination is based on all the data.

When do we prefer the median?

The graphical displays that we have examined for the study time data for example 1.2 have shown several study times tailing off below 50 minutes without corresponding values at the higher end of the distribution. A distribution with this feature is referred to as “skewed left,” a phrase that provides a description of the shape of the distribution. The median study time is 61 minutes, and the mean study time is 57 minutes. In this case, the mean has been pulled in the direction of the skew more so than the median. In general, skewness in a distribution tends to pull the mean in the direction of the skew. For this reason, in a skewed distribution, the median is more commonly used for characterizing a representative data value.

What are outliers?

In the data that we have been examining for example 1.2, the study times for two students of 20 and 25 minutes stand out as much lower than the study times of the other students. Data values that stand apart from the majority of the data in a set may be classified as outliers. Outliers can affect the values for various numerical summaries and may provide misleading information about the characteristics of a distribution for quantitative data.

Why do we need to identify outliers?

Identifying outliers is important for two reasons:

Some outliers are due to measuring, recording, or copying errors. If an error can be detected, then it can be corrected.
Some outliers however are legitimate values and provide useful information about the distribution for addressing the question at hand. However, they may not be representative of the overall data and can make it difficult to interpret the center and spread of the distribution. For example, let’s say a typical student spends 1 hour daily on homework but there is one student who spends 6 hours a day on homework. While it’s not an error, it might still make sense to exclude this student from the data because he/she is not representative of the data and can negatively affect our analysis. Similarly, if there is one dog that is as big as a horse, we could exclude this dog too.

How do we identify outliers?

There are some general rules you can use to identify outliers. However, we don't need to know them at this time. For the purpose of this lesson, we will simply choose the two values that are located very far apart from the rest of the values on the dot plot.

[table here]

In the study time example, the study times of 20 and 25 minutes are both outliers, and the first non-outlier in the lower half is 35 minutes. The upper end of the distribution has no outliers. Figure 1.22 shows the values for various numerical summaries for the original 28 study times and the corresponding 26 values when the 2 outliers are removed from the data.

How do outliers affect the numerical summaries? Should we use the mean or the median?

Let’s see how the numerical summaries change when we remove them.

Notice that deleting the outliers has some impact on almost all numerical summaries. Comparing the changes in the mean (57 to 59.7) and the median (61 to 62), we see that the deletion of the outliers has less impact on the median. This illustrates the fact that the median is more resistant to the effects of outliers than the mean. For this reason, we generally use the median to measure the center of distribution when there are outliers in the data.