import pandas as pd
import numpy as np
19 Grouping and summarizing data
19.1 Introduction
In this lesson, we’ll explore two powerful pandas methods: agg()
and groupby()
. These tools will enable you to extract summary statistics and perform operations on grouped data effortlessly.
A summary statistic is a single value (such as a mean or median) that describes a sequence of values (typically a column in your dataset).
Let’s see how to use these!
19.2 Learning objectives
You can use
pandas.DataFrame.agg()
to extract summary statistics from datasets.You can use
pandas.DataFrame.groupby()
to group data by one or more variables before performing operations on them.You can pass custom functions to
agg()
to compute summary statistics.
19.3 Libraries
Run the following lines to import the necessary libraries:
19.4 The Yaounde COVID-19 dataset
In this lesson, we will again use a subset of data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
You can download the dataset from this link: yaounde_mini.csv
= pd.read_csv("data/yaounde_mini.csv")
yao yao
age | age_category_3 | sex | weight_kg | height_cm | neighborhood | is_smoker | is_pregnant | occupation | treatment_combinations | symptoms | n_days_miss_work | n_bedridden_days | highest_education | igg_result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45 | Adult | Female | 95 | 169 | Briqueterie | Non-smoker | No | Informal worker | Paracetamol | Muscle pain | 0.0 | 0.0 | Secondary | Negative |
1 | 55 | Adult | Male | 96 | 185 | Briqueterie | Ex-smoker | NaN | Salaried worker | NaN | No symptoms | NaN | NaN | University | Positive |
2 | 23 | Adult | Male | 74 | 180 | Briqueterie | Smoker | NaN | Student | NaN | No symptoms | NaN | NaN | University | Negative |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
968 | 35 | Adult | Male | 77 | 168 | Tsinga Oliga | Smoker | NaN | Informal worker | Paracetamol | Headache | 0.0 | 0.0 | University | Positive |
969 | 31 | Adult | Female | 66 | 169 | Tsinga Oliga | Non-smoker | No | Unemployed | NaN | No symptoms | NaN | NaN | Secondary | Negative |
970 | 17 | Child | Female | 67 | 162 | Tsinga Oliga | Non-smoker | No response | Unemployed | NaN | No symptoms | NaN | NaN | Secondary | Negative |
971 rows × 15 columns
You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0
19.5 Introducing pandas.DataFrame.agg()
To get started, let’s consider how to get simple summary statistics without using agg()
, then we will consider why you should actually use agg()
.
Imagine you were asked to find the mean age of respondents in the yao
data frame. You can do this by calling the mean()
method on the age
column of the yao
data frame:
"age"]].mean() yao[[
age 29.017508
dtype: float64
Now, let’s see how to do this with agg()
.
=('age', 'mean')) yao.agg(mean_age
age | |
---|---|
mean_age | 29.017508 |
The anatomy of this syntax is:
=("COLUMN_TO_SUMMARIZE", "SUMMARY_FUNCTION")) dataframe.agg(summary_name
This part ("COLUMN_TO_SUMMARIZE", "SUMMARY_FUNCTION")
is called a tuple. The first element of the tuple is the name of the column to summarize, and the second element is the summary function to apply to that column.
The syntax is more complex, but as you will see later, much more powerful, as it allows you to compute multiple summary statistics, and to compute statistics per group.
Let’s see how to compute multiple summary statistics in a single agg()
statement. If you wanted both the mean and the median age, you could run:
=("age", "mean"), median_age=("age", "median")) yao.agg(mean_age
age | |
---|---|
mean_age | 29.017508 |
median_age | 26.000000 |
Great, now try your hand at the practice question below.
19.6 Practice Q: Mean and median weight
Use agg()
and the relevant summary functions to obtain the mean and median of respondent weights from the weight_kg
variable of the yao
data frame.
# Your code here
19.7 Grouped summaries with pandas.DataFrame.groupby()
Now let’s see how to use groupby()
to obtain grouped summaries, the primary reason for using agg()
in the first place.
As its name suggests, pandas.DataFrame.groupby()
lets you group a data frame by the values in a variable (e.g. male vs female sex). You can then perform operations that are split according to these groups.
Let’s try to group the yao
data frame by sex and observe the effect:
"sex") yao.groupby(
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107108170>
Hmm. Apparently nothing happened. We just get a GroupBy
object.
But watch what happens when we chain the groupby()
with the agg()
call we used in the previous section:
"sex").agg(mean_age=("age", "mean"), median_age=("age", "median")) yao.groupby(
mean_age | median_age | |
---|---|---|
sex | ||
Female | 29.495446 | 26.0 |
Male | 28.395735 | 25.0 |
Now we get a different statistic for each group! The mean age for female respondents is about 29.5, while for male respondents it’s about 28.4.
As was mentioned earlier, this kind of grouped summary is the primary reason the agg()
function is so useful.
You may notice that there are two header rows. This is because the output has a hierarchical index (called a MultiIndex in pandas). While this can be useful in some cases, it often makes further data manipulation more difficult. We can reset the index to convert the group labels back to a regular column with the reset_index()
method.
"sex").agg(mean_age=("age", "mean"), median_age=("age", "median")).reset_index() yao.groupby(
sex | mean_age | median_age | |
---|---|---|---|
0 | Female | 29.495446 | 26.0 |
1 | Male | 28.395735 | 25.0 |
You may notice that the line of code is getting quite long. We can move each new method call to a new line to make the code more readable, but we need to wrap the entire chain in parentheses.
("sex")
yao.groupby(=("age", "mean"), median_age=("age", "median"))
.agg(mean_age
.reset_index() )
sex | mean_age | median_age | |
---|---|---|---|
0 | Female | 29.495446 | 26.0 |
1 | Male | 28.395735 | 25.0 |
Let’s see one more example.
Suppose you were asked to obtain the maximum and minimum weights for individuals in different neighborhoods, and also present the number of individuals in each neighborhood. We can write:
("neighborhood")
yao.groupby(
.agg(=("weight_kg", "max"),
max_weight=("weight_kg", "min"),
min_weight=("weight_kg", "size"), # the size function counts rows per group
count
)
.reset_index() )
neighborhood | max_weight | min_weight | count | |
---|---|---|---|---|
0 | Briqueterie | 128 | 20 | 106 |
1 | Carriere | 129 | 14 | 236 |
2 | Cité Verte | 118 | 16 | 72 |
... | ... | ... | ... | ... |
6 | Nkomkana | 161 | 15 | 75 |
7 | Tsinga | 105 | 15 | 81 |
8 | Tsinga Oliga | 100 | 17 | 67 |
9 rows × 4 columns
19.8 Practice Q: Min and max height per sex
Use groupby()
, agg()
, and the relevant summary functions to obtain the minimum and maximum heights for each sex in the yao
data frame, as well as the number of individuals in each sex group.
Your output should be a DataFrame that looks like this:
sex | min_height_cm | max_height_cm | count |
---|---|---|---|
Female | |||
Male |
# Your code here
19.9 Grouping by multiple variables (nested grouping)
It is possible to group a data frame by more than one variable. This is sometimes called “nested” grouping.
Suppose you want to know the mean age of men and women in each neighbourhood, you could put both sex
and neighborhood
in the groupby()
statement:
(
yao'sex', 'neighborhood'])
.groupby([=('age', 'mean'))
.agg(mean_age
.reset_index() )
sex | neighborhood | mean_age | |
---|---|---|---|
0 | Female | Briqueterie | 31.622951 |
1 | Female | Carriere | 28.164286 |
2 | Female | Cité Verte | 31.750000 |
... | ... | ... | ... |
15 | Male | Nkomkana | 29.812500 |
16 | Male | Tsinga | 28.820513 |
17 | Male | Tsinga Oliga | 24.297297 |
18 rows × 3 columns
From this output data frame you can tell that, for example, women from Briqueterie have a mean age of 31.6 years.
19.10 Practice Q: Min and max height per age sex group
Use groupby()
, agg()
, and min()
and max()
to get the minimum and maximum heights for each age-sex group in the yao
data frame. The variables needed are age_category_3
and sex
.
Your output should be a DataFrame that looks like this:
age_category_3 | sex | min_height | max_height |
---|---|---|---|
Adult | Female | 78 | 185 |
Adult | Male | 147 | 196 |
Child | Female | 54 | 183 |
Child | Male | 96 | 190 |
Senior | Female | 143 | 174 |
Senior | Male | 160 | 195 |
# Your code here
19.11 NaN values in agg()
When using agg()
to compute grouped summary statistics, pay attention to whether your group of interest contains NaN values.
For example, to get mean weight by smoking status, we can write:
("is_smoker")
yao.groupby(=("weight_kg", "mean"))
.agg(weight_mean
.reset_index() )
is_smoker | weight_mean | |
---|---|---|
0 | Ex-smoker | 76.366197 |
1 | Non-smoker | 63.033760 |
2 | Smoker | 72.410256 |
But this actually excludes some rows with NaN smoking status from the summary table.
We can include these individuals in the summary table by setting dropna=False
with the groupby()
function.
("is_smoker", dropna=False)
yao.groupby(=("weight_kg", "mean"))
.agg(weight_mean
.reset_index() )
is_smoker | weight_mean | |
---|---|---|
0 | Ex-smoker | 76.366197 |
1 | Non-smoker | 63.033760 |
2 | Smoker | 72.410256 |
3 | NaN | 73.000000 |
Also recall that you can see how many individuals are in each smoking status group by using the size()
function. It is often useful to include this information in your summary table, so that you know how many individuals are behind each summary statistic.
("is_smoker", dropna=False)
yao.groupby(=("weight_kg", "mean"),
.agg(weight_mean=("weight_kg", "size"))
count
.reset_index() )
is_smoker | weight_mean | count | |
---|---|---|---|
0 | Ex-smoker | 76.366197 | 71 |
1 | Non-smoker | 63.033760 | 859 |
2 | Smoker | 72.410256 | 39 |
3 | NaN | 73.000000 | 2 |
19.12 Practice Q: Mean weight by pregnancy status
Use groupby()
, agg()
, and the mean()
function to obtain the mean weight (kg) by pregnancy status in the yao
data frame. Include individuals with NaN pregnancy status in the summary table.
The output data frame should look something like this:
is_pregnant | weight_mean |
---|---|
No | |
No response | |
Yes | |
NaN |
# your code here
19.13 Custom summary statistics with lambda functions
Before we dive into custom summary statistics, let’s briefly introduce lambda functions. Lambda functions in Python are small, anonymous functions defined with the lambda
keyword.
For example, consider a function that calculates the range (difference between the maximum and minimum) of a list. You could define it using a regular function like this:
def range_func(x):
return max(x) - min(x)
print(range_func([1, 2, 3, 4])) # Output: 3
3
Alternatively, you can achieve the same result using a lambda function:
= lambda x: max(x) - min(x)
range_func print(range_func([1, 2, 3, 4])) # Output: 3
3
Now, let’s see how we can use lambda functions to apply custom summary statistics in our data analysis.
For example, let’s say we want to calculate the range of weights in each neighborhood. We could do this with our range_func
function:
("neighborhood")
yao.groupby(=("weight_kg", range_func))
.agg(weight_range
.reset_index() )
neighborhood | weight_range | |
---|---|---|
0 | Briqueterie | 108 |
1 | Carriere | 115 |
2 | Cité Verte | 102 |
... | ... | ... |
6 | Nkomkana | 146 |
7 | Tsinga | 90 |
8 | Tsinga Oliga | 83 |
9 rows × 2 columns
Notice that we did not wrap range_func
in quotes. Only built-in functions are wrapped in quotes.
Now, instead of calling range_func
we can use a lambda function directly in the agg()
call:
("neighborhood")
yao.groupby(=("weight_kg", lambda x: max(x) - min(x)))
.agg(weight_range
.reset_index() )
neighborhood | weight_range | |
---|---|---|
0 | Briqueterie | 108 |
1 | Carriere | 115 |
2 | Cité Verte | 102 |
... | ... | ... |
6 | Nkomkana | 146 |
7 | Tsinga | 90 |
8 | Tsinga Oliga | 83 |
9 rows × 2 columns
Notice that we still provide a tuple to the agg()
function, ('weight_kg', lambda x: max(x) - min(x))
, but the second element of the tuple is a lambda function.
This lambda function operates on the column provided in the tuple, weight_kg
.
Let’s see another example: calculating the coefficient of variation (CV) of weight within each neighborhood. The CV is the standard deviation divided by the mean, and is a unitless measure of the relative variability of a distribution.
("neighborhood")
yao.groupby(=("weight_kg", lambda x: (np.std(x) / np.mean(x)) * 100))
.agg(weight_cv
.reset_index() )
neighborhood | weight_cv | |
---|---|---|
0 | Briqueterie | 33.531748 |
1 | Carriere | 32.027533 |
2 | Cité Verte | 33.255829 |
... | ... | ... |
6 | Nkomkana | 33.187496 |
7 | Tsinga | 33.937145 |
8 | Tsinga Oliga | 35.894453 |
9 rows × 2 columns
19.14 Practice Q: IQR of age by neighborhood
Find the interquartile range (IQR) of the age variable for each neighborhood. The IQR is the difference between the 75th and 25th percentiles. Your lambda will look like this: lambda x: x.quantile(0.75) - x.quantile(0.25)
# Your code here
19.15 Wrap up
In this lesson, you’ve learned how to obtain quick summary statistics from your data using agg()
, group your data using groupby()
, and combine groupby()
with agg()
for powerful data summarization.
These skills are essential for both exploratory data analysis and preparing data for presentation or plotting. The combination of groupby()
and agg()
is one of the most common and useful data manipulation techniques in pandas.
In our next lesson, we’ll explore ways to combine groupby()
with other pandas methods.
See you there!