Data, distribution and statistics
Why do we need Statistics?
In mathematics, most of the questions have definite answers. But in real life, even a simple question such as “How much time does it take you to go to school by bus” may not have a definite answer. Sometimes it takes 7 minutes and sometimes it takes 9 minutes. On other days, it takes 8 minutes. Yesterday, it took 10 minutes. Even a simple question like this produces answers that have variability, i.e. the answer varies from day to day. If there were no variability and every day it took exactly 9 minutes for a bus ride, then we would not need statistics. But, because data varies, we need a separate branch of mathematics called statistics to answer questions about the data.
What is a statistical question?
The question "How much time does it take you to go to school by bus" has variability in the answers. Such questions are called statistical questions. We'll learn more about them next.
Statistics is a problem-solving process and it starts with a statistical question. A statistical question is one where you expect variability in answers. And addressing these questions requires data. Here are some examples of statistical questions:
How many hours do you sleep every day?
How many minutes do students in your class spend on homework?
What is the favorite food of your class?
In a presidential election, do potential voters support Joe Biden?
How do the annual salaries for men and women in similar occupations compare?
And here are some examples of questions that are not statistical.
Where in town does our math teacher live?
How many minutes of recess do sixth-grade students have each day?
These questions are not statistical because the answers to these questions do not vary. The math teachers live in a particular location and each day the recess is the same, let’s say 20 minutes.
What is a variable?
To answer a statistical question, we need data. Data consist of observations or measurements on a variable. In statistics, a variable is a characteristic that may be different from one individual to another or from one instance to another.
For example, in the statistical question “How many minutes do students in your class spend on homework?”, the number of minutes students spend on homework is called a variable because its measurement varies from one individual to another. In the case of the statistical question “How many hours do you sleep every day?”, the number of hours is a variable because its measurement will change from one day to another.
Numerical and categorical variables
The answer to both of these questions “How many minutes do students in your class spend on homework?” and “How many hours do you sleep every day?” are numerical. The number of minutes spent on homework is a quantity such as 20 mins, 60 mins, and so on. Similarly, the number of hours you sleep is also a quantity. How do you know they are quantities? Well, if you add any two quantities, you get a third quantity. If you sleep 8 hours today and 7 hours tomorrow, you would sleep a total of 15 hours in two days. So adding two quantities makes sense.
However, the answer to the question “What is the favorite food of your class?” is not numerical. It could be Pizza, Burger, Sandwich, or any other food. Similarly, the answers to the question “In a presidential election, do potential voters support Joe Biden?“ are Yes, No, or Maybe. These are called categorical variables because rather than quantities, they have categories such as ‘Pizza’, ‘Burger’, ‘Yes’, ‘No’, and so on as an answer.
Is zipcode a numerical variable?
Now that you understand the difference between numerical and categorical variables, let me ask you a tricky question. You are running a survey and you ask each of the people what their home zip code is.
Is zipcode a numerical or categorical variable? [Hint: Check whether it makes sense to add two measurements.]
Let’s say you conduct a small survey in the class to understand the breed of dogs your classmates own and their respective weights. Here is the survey you asked your classmates to fill out.
[Survey goes here]
And below is the data you collected from the survey.
Student Name |
Breed of Dog | Weight of the dog in lbs |
Jason |
Golden retriever | 6 |
Katie | Pug | 5.5 |
Hmm, quite some data here. What do we do about it? It seems there are quite some Golden Retrievers. What about the weight of the dog? They vary quite a bit. It can get a bit overwhelming.
What is a distribution?
When we analyze data, we are often interested in the distribution of the data. Distribution of the data basically is information that shows all the data values and how often they occur. The above table shows us the distribution of data but it’s difficult to analyze the data in this raw form. So, we often organize and summarize such a distribution using an appropriate graph or a table, or numerical summaries.
Graphs and tables help us to visualize the distribution and see patterns and special features. Numerical summaries give us some key numbers which can tell us the most important information about the distribution. Together, they give us powerful tools to analyze a distribution.
We are particularly interested in three things in a distribution:
What a typical value of a distribution is. It is also called “center of distribution” or “measures of the central tendency”. In the last example, we would be interested in the typical time students spend on homework
How much the data varies from one student to another. It is often called “spread of the distribution“ or “variability in the data”. In the last example, we would want to know how the time spent by students on homework varies from one student to another
The general shape of the distribution - things such as “most students are spending more than 20 minutes a day”, or “only two students study more than 60 mins a day on homework” and so on.
The visual representation (graph/table) of the distribution and numerical summaries helps us answer these questions. In particular, we will learn frequency distribution tables, dot plots, and histograms as ways to represent a distribution.
If we know about the center, spread, and shape of the distribution, then we know how the data set behaves, even if we do not have complete information about the data. So, a lot of our lessons will center around understanding these three important measures. This branch of statistics that focuses on describing features of a data set by generating summaries and graphs is called descriptive statistics.