Understanding Measures of Centrality

One of the most important steps of any data-related project is Exploratory Data Analysis (EDA). It is crucial to explore the distribution of the data to understand it and efficiently decide on the next steps. A straightforward way to explore data distribution is to study its central tendencies through measures of centrality. In this article, we’ll explore the three primary measures of centrality: Mean, Median, and Mode. We’ll discuss their strengths and weaknesses, along with practical examples using SQL and Python.

The Mean: The Average Joe

The mean, often referred to as the average, is calculated by summing all the values in a dataset and dividing by the number of values. It’s a straightforward way to find a central value.

Let’s say we have a table named Sales with a column Revenue. The following query will give us the mean revenue per sale:

sql

1SELECT AVG(Revenue) AS MeanRevenue
2FROM Sales;

When using Python, we can obtain the same result by importing Pandas and running the following code:

python

1import pandas as pd
2# import your data into a pandas dataframe
3mean_revenue = df['Revenue'].mean()

Strengths and Weaknesses

The mean leaves no man behind, meaning every data point is taken into consideration. Although this allows it to give us a holistic view of the data, it makes it highly susceptible to outliers.

The Median: The Middle Ground

The median is the middle value when the data is sorted, meaning there is an equal number of data points before and after it.

Note: If there’s an even number of observations, the median is obtained by averaging the two middle values.

We can obtain the median price using the following SQL query:

sql

1SELECT 
2DISTINCT PERCENTILE_CONT(0.5) 
3    WITHIN GROUP (ORDER BY Revenue) OVER() AS MedianRevenue
4FROM Sales;

Using Python, however, obtaining the median is much more straightforward:

python

1median_revenue = df['Revenue'].median()

Strengths and Weaknesses

Since the median only considers the order of data points, it is not affected by outliers, making it a great indicator of the center when the data is not symmetrically distributed. However, this strength is also its weakness—it fails to capture the whole dataset.

The Mode: The Most Popular

The mode is the value that appears most frequently in a dataset. It’s particularly useful for categorical data where we want to know the most common category.

To find the mode in SQL, we can execute the following query:

sql

1SELECT Revenue, COUNT(*) AS Frequency
2FROM Sales
3GROUP BY Revenue
4ORDER BY Frequency DESC
5LIMIT 1;

And for the Pythonistas out there, it’s even simpler:

python

1mode_revenue = df['Revenue'].mode()

Strengths and Weaknesses

The mode is useful for providing categorical insights, making it ideal for identifying the most common category or categories (bimodal or multimodal) in non-numeric datasets. However, it has limitations: if all values are unique, there may be no mode at all, and it tends to be less relevant for continuous data compared to the mean or median.

Avengers Assemble

The most efficient way to understand a dataset’s central tendencies is by considering all three measures of centrality. Each provides unique insights into the data distribution, allowing us to build a more comprehensive understanding of its shape.

If mean ≈ median ≈ mode, the data is likely symmetrically distributed (think bell-shaped curve).
If mean > median, it suggests a right-skewed (positively skewed) distribution, where higher values pull the mean upwards.
If mean < median, it indicates a left-skewed (negatively skewed) distribution, meaning lower values pull the mean down.

By observing all three measures, we get a clearer understanding of the data. If there’s a large difference between the mean and median, that could be a red flag for outliers or a skewed distribution, which would require further analysis or potential data transformations.

Conclusion: The Bigger Picture

Exploring central tendencies is a critical part of exploratory data analysis. The mean, median, and mode each serve different purposes, and understanding when and how to use them is critical for interpreting data correctly. Whether you’re working with continuous or categorical data, applying these measures helps uncover patterns and insights that guide the upcoming steps—whether it’s cleaning, transforming, or modeling.

Understanding Measures of Centrality

The Mean: The Average Joe Link to this heading

Strengths and Weaknesses Link to this heading

The Median: The Middle Ground Link to this heading

Strengths and Weaknesses Link to this heading

The Mode: The Most Popular Link to this heading

Strengths and Weaknesses Link to this heading

Avengers Assemble Link to this heading

Conclusion: The Bigger Picture Link to this heading

The Mean: The Average Joe

Strengths and Weaknesses

The Median: The Middle Ground

Strengths and Weaknesses

The Mode: The Most Popular

Strengths and Weaknesses

Avengers Assemble

Conclusion: The Bigger Picture