One of the most important steps of any data-related project is Exploratory Data Analysis (EDA). It is crucial to explore the distribution of the data to understand it and efficiently decide on the next steps. A straightforward way to explore data distribution is to study its central tendencies through measures of centrality. In this article, we’ll explore the three primary measures of centrality: Mean, Median, and Mode. We’ll discuss their strengths and weaknesses, along with practical examples using SQL and Python.
The Mean: The Average Joe
The mean, often referred to as the average, is calculated by summing all the values in a dataset and dividing by the number of values. It’s a straightforward way to find a central value.
Let’s say we have a table named Sales
with a column Revenue
. The following query will give us the mean revenue per sale:
1SELECT AVG(Revenue) AS MeanRevenue
2FROM Sales;
When using Python, we can obtain the same result by importing Pandas and running the following code:
python1import pandas as pd
2# import your data into a pandas dataframe
3mean_revenue = df['Revenue'].mean()
Strengths and Weaknesses
The mean leaves no man behind, meaning every data point is taken into consideration. Although this allows it to give us a holistic view of the data, it makes it highly susceptible to outliers.
The Median: The Middle Ground
The median is the middle value when the data is sorted, meaning there is an equal number of data points before and after it.
Note: If there’s an even number of observations, the median is obtained by averaging the two middle values.
We can obtain the median price using the following SQL query:
sql1SELECT
2DISTINCT PERCENTILE_CONT(0.5)
3 WITHIN GROUP (ORDER BY Revenue) OVER() AS MedianRevenue
4FROM Sales;
Using Python, however, obtaining the median is much more straightforward:
python1median_revenue = df['Revenue'].median()
Strengths and Weaknesses
Since the median only considers the order of data points, it is not affected by outliers, making it a great indicator of the center when the data is not symmetrically distributed. However, this strength is also its weakness—it fails to capture the whole dataset.
The Mode: The Most Popular
The mode is the value that appears most frequently in a dataset. It’s particularly useful for categorical data where we want to know the most common category.
To find the mode in SQL, we can execute the following query:
sql1SELECT Revenue, COUNT(*) AS Frequency
2FROM Sales
3GROUP BY Revenue
4ORDER BY Frequency DESC
5LIMIT 1;
And for the Pythonistas out there, it’s even simpler:
python1mode_revenue = df['Revenue'].mode()
Strengths and Weaknesses
The mode is useful for providing categorical insights, making it ideal for identifying the most common category or categories (bimodal or multimodal) in non-numeric datasets. However, it has limitations: if all values are unique, there may be no mode at all, and it tends to be less relevant for continuous data compared to the mean or median.
Avengers Assemble
The most efficient way to understand a dataset’s central tendencies is by considering all three measures of centrality. Each provides unique insights into the data distribution, allowing us to build a more comprehensive understanding of its shape.
- If mean ≈ median ≈ mode, the data is likely symmetrically distributed (think bell-shaped curve).
- If mean > median, it suggests a right-skewed (positively skewed) distribution, where higher values pull the mean upwards.
- If mean < median, it indicates a left-skewed (negatively skewed) distribution, meaning lower values pull the mean down.
By observing all three measures, we get a clearer understanding of the data. If there’s a large difference between the mean and median, that could be a red flag for outliers or a skewed distribution, which would require further analysis or potential data transformations.
Conclusion: The Bigger Picture
Exploring central tendencies is a critical part of exploratory data analysis. The mean, median, and mode each serve different purposes, and understanding when and how to use them is critical for interpreting data correctly. Whether you’re working with continuous or categorical data, applying these measures helps uncover patterns and insights that guide the upcoming steps—whether it’s cleaning, transforming, or modeling.