Center and spread of a distribution - Mean and Mean Absolute Deviation (MAD)
In the last lessons, we represented distribution using dot plots and histograms and we interpreted those graphs to get an idea about the "typical value" as well as the spread of the distribution. We also talked about the general shape and various features of the distribution such as peak, symmetry, and cluster in a distribution.
We also looked at two distributions - weight in kg of cats and weight in kg of dogs and made estimations of the typical value or the center of the distribution. We also made statements about the spread such as “the distribution of the weight of dogs is more spread out than those for cats”. In this lesson, we want to be more precise when we talk about the center or spread. To be specific, we want to quantity those measures into a single number so that we can easily describe the center and spread of any distribution.
Mean as the "center of distribution"
Because numerical data in a distribution vary, it is useful to identify a single numerical summary to represent the data. A quantity centrally located in the distribution often provides a representative value of the data. But how do we define the center of distribution? There are multiple ways of considering the notion of the center of distribution for numerical data.
One numerical summary that provides information about the center of a distribution is the mean of the data. A common interpretation of the mean is that it is the arithmetic average of the data. So, basically, you add all the values in the data set and divide the sum by the number of data points and you get the mean. Although this interpretation informs us on how to calculate the mean, it does not describe what the mean represents for the distribution. So, I’d like to think of a mean in two important ways:
- Understanding mean in terms of “fair share” or “leveling out”
- Understanding mean as the balance point of a distribution
Let's explore them in detail.
Understanding the mean in terms of “fair share” or “leveling out”
One way to think about the mean is in terms of “fair share” or “leveling out.” That is, a mean can be thought of as a number that each member of a group would have if all the data values were combined and distributed equally among the members.
For example, suppose there are 5 bottles that have the following amounts of water: 1 liter, 4 liters, 2 liters, 3 liters, and 0 liters. We want to find the typical amount of water in the bottle.
To find the mean, first, we add up all of the values. We can think of this as putting all of the water together: 1+4+2+3+0 = 10. To find the “fair share,” we divide the 10 liters equally into the 5 containers: 10÷5=2. So, if all bottles were to have an equal share of water in them, the amount in each would be 2 liters.
For the distribution of the weight of dogs, the total weight of the dogs is …. Kg. We can calculate the mean by dividing the total by the number of dogs, i.e. …
Here, the mean basically says that if all the dogs were to be of the same weight but the total weight of the dogs was to stay the same, each dog would be ……. kg. In other words, the weight of a typical dog in the group is … kg.
Understanding the mean as the balance point of a distribution
Another way to think about the mean is as a number that corresponds to the balance point of the distribution. This is an important perspective as it helps us to understand the mean in terms of its graph.
Let's look at a distribution represented by the table below. The mean has been calculated as:
In the table, you can see that the total sum of the distances to the left (marked in red) and to the right (marked in red) of the mean is the same. So, the mean seems to be the point that "balances the distribution."
The balance point on a dot plot
Let's interpret this definition of mean in terms of a dot plot.
From this point of view, a mean is such a point that the total distances of all the points on the left of it are the same as the total distance of all the points on the right of it.
[Distribution goes here]
The dot plot is symmetrical. The mean can be calculated as:
On the dot plot, the red vertical line represents the mean. The black horizontal lines represent the distances from the mean to the points located at the left of the mean. The blue horizontal lines represent the distances from the mean to the points located at the right of the mean. If you add together the measurements of all the black lines, it will be the same as the sum of the measurements of blue lines. In this sense, the data points could balance at 21, or 21 is the center of the distribution.
Now let's look at a distribution that is not symmetrical.
[Distribution goes here]
The dot plot above is not symmetrical; it’s skewed to the left, i.e. there are many more values on the left part of the dot plot than on the right. The mean can still be calculated as:
The red vertical line represents the mean. Just like in the first dot plot, the sum of the magnitude of all the black lines is equal to the sum of the magnitude of the blue lines. In this sense, the data points could balance at 21 and therefore can be considered the center of the distribution. We could then say that the distribution of the cookies has a center at 21 because that is its balance point and that the eight friends, on average, baked 21 cookies.
Something to think about: Are there an equal number of points to the left and to the right of the mean?
Does mean alone define the center of distribution?
In the last lesson, we learned to find the typical value or the center of the distribution. We calculated the mean from the data as the arithmetic average and interpreted the mean in terms of both “fair share” and as the “balance point of the distribution”. But is knowing mean enough to make any meaningful conclusion about the center of the distribution? Let’s look at the two different distributions with the same mean.
[two dot plot goes here]
The first dot plot shows the weights, in grams, of 22 cookies. The mean weight is 21 grams. All the weights are within 3 grams of the mean, and most of them are even closer. These cookies are all fairly close in weight.
The second dot plot shows the weights, in grams, of a different set of 30 cookies. The mean weight for this set of cookies is also 21 grams, but some cookies are half that weight, and others are one-and-a-half times that weight. There is a lot more variability in the weight in this distribution.
So, even though the mean is the same for both distributions, the two distributions are quite different. One has more variability than the other and the mean alone does not give this information. Similarly, the mean alone does not tell you about the total number of cookies in a distribution. Here, both distributions had a mean of 21, but the first distribution had a total of 22 cookies, and the second had 30 cookies.
Statistics needs context
Statistics is different from other branches of mathematics in the sense that statistics requires a different kind of thinking because data are not just numbers, they are numbers with a context. The mean of 21 does not mean much without knowing the variability, size, and shape of the distribution.
Measuring variability using Mean Absolute Deviation (MAD)
As we saw in the last lesson, to give context to the mean, we need to understand variability or the spread of the distribution. But how do we define variability and can we come up with a single number that describes the variability of the distribution?
One way of calculating variability is by measuring how far away or how spread-out the data points generally are from the mean. This measure of variability is called the Mean Absolute Deviation (MAD).
For instance, the point that represents 18 grams is 3 units away from the mean of 21 grams. We can find the distance between each point and the mean of 21 grams and organize the distances into a table, as shown.
[Table goes here]
The values in the first row of the table are the cookie weights for the first set of cookies. Their mean, 21 grams, is the mean of the cookie weights. The values in the second row of the table are the distances between the values in the first row and the mean of 21. The mean of these distances is the MAD of the cookie weights. So, the MAD for the distribution is xxx
So, the MAD of xxx means that on average data are xxx far from the mean. The bigger the number, the more spread out the data is.
Something to think about:
Which distribution is more spread out, as calculated in terms of MAD?
A distribution with the mean of 21 and the MAD of xxx
A distribution with the mean of 42 and the MAD of xxx
So, what does the MAD actually mean?
Let’s look at the two dot plots below:
[dot plots go here]
In both the dot plots, the mean is 21, i.e. the sum of horizontal lines to the left of the mean is equal to the sum of horizontal lines to the right of the mean. In fact, that’s how we calculated the mean in the first place. However, as you can see clearly, the total sum of horizontal lines in the second dot plot is much larger than the total sum of the horizontal lines in the first plot. We can already see the data points in the second dot plot are more spread out. But how much exactly? Well, we can calculate the arithmetic average of the distances.
We first add all the distances and then divide them by the number of data points. Then, we get the arithmetic average of the distances of all the points from the mean. It’s called absolute because we do not care whether the data is located on the left or right of the mean and we are only concerned with the absolute value of the magnitude of the distances. This average distance from the mean is called Mean Absolute Deviation or MAD.
So, let’s see what MAD means in the context of the two sets of cookies. In the first set of cookies, the distances are all between 0 and 3. The MAD is 1.2 grams, which tells us that the cookie weights are typically within 1.2 grams of 21 grams. We could say that a typical cookie weighs between 19.8 and 22.2 grams.
In the second set of cookies, the distances are all between 0 and 13. The MAD is 5.6 grams, which tells us that the cookie weights are typically within 5.6 grams of 21 grams. We could say a typical cookie weighs between 15.4 and 26.6 grams.
The MAD is a measure of the variability of the distribution. In these examples, it is easy to see that a higher MAD suggests a distribution that is more spread out, showing more variability.
Something to think about:
Is there more variability in the data below or above the mean?
Can two different data sets have the same mean and the same MAD?