19  Grouping and summarizing data

19.1 Introduction

In this lesson, we’ll explore two powerful pandas methods: agg() and groupby(). These tools will enable you to extract summary statistics and perform operations on grouped data effortlessly.

A summary statistic is a single value (such as a mean or median) that describes a sequence of values (typically a column in your dataset).

Let’s see how to use these!

19.2 Learning objectives

  1. You can use pandas.DataFrame.agg() to extract summary statistics from datasets.

  2. You can use pandas.DataFrame.groupby() to group data by one or more variables before performing operations on them.

  3. You can pass custom functions to agg() to compute summary statistics.


19.3 Libraries

Run the following lines to import the necessary libraries:

import pandas as pd
import numpy as np

19.4 The Yaounde COVID-19 dataset

In this lesson, we will again use a subset of data from the COVID-19 serological survey conducted in Yaounde, Cameroon.

You can download the dataset from this link: yaounde_mini.csv

yao = pd.read_csv("data/yaounde_mini.csv")
yao
age age_category_3 sex weight_kg height_cm neighborhood is_smoker is_pregnant occupation treatment_combinations symptoms n_days_miss_work n_bedridden_days highest_education igg_result
0 45 Adult Female 95 169 Briqueterie Non-smoker No Informal worker Paracetamol Muscle pain 0.0 0.0 Secondary Negative
1 55 Adult Male 96 185 Briqueterie Ex-smoker NaN Salaried worker NaN No symptoms NaN NaN University Positive
2 23 Adult Male 74 180 Briqueterie Smoker NaN Student NaN No symptoms NaN NaN University Negative
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
968 35 Adult Male 77 168 Tsinga Oliga Smoker NaN Informal worker Paracetamol Headache 0.0 0.0 University Positive
969 31 Adult Female 66 169 Tsinga Oliga Non-smoker No Unemployed NaN No symptoms NaN NaN Secondary Negative
970 17 Child Female 67 162 Tsinga Oliga Non-smoker No response Unemployed NaN No symptoms NaN NaN Secondary Negative

971 rows × 15 columns

You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0

19.5 Introducing pandas.DataFrame.agg()

To get started, let’s consider how to get simple summary statistics without using agg(), then we will consider why you should actually use agg().

Imagine you were asked to find the mean age of respondents in the yao data frame. You can do this by calling the mean() method on the age column of the yao data frame:

yao[["age"]].mean()
age    29.017508
dtype: float64

Now, let’s see how to do this with agg().

yao.agg(mean_age=('age', 'mean'))
age
mean_age 29.017508

The anatomy of this syntax is:

dataframe.agg(summary_name=("COLUMN_TO_SUMMARIZE", "SUMMARY_FUNCTION"))

This part ("COLUMN_TO_SUMMARIZE", "SUMMARY_FUNCTION") is called a tuple. The first element of the tuple is the name of the column to summarize, and the second element is the summary function to apply to that column.

The syntax is more complex, but as you will see later, much more powerful, as it allows you to compute multiple summary statistics, and to compute statistics per group.


Let’s see how to compute multiple summary statistics in a single agg() statement. If you wanted both the mean and the median age, you could run:

yao.agg(mean_age=("age", "mean"), median_age=("age", "median"))
age
mean_age 29.017508
median_age 26.000000

Great, now try your hand at the practice question below.

Practice

19.6 Practice Q: Mean and median weight

Use agg() and the relevant summary functions to obtain the mean and median of respondent weights from the weight_kg variable of the yao data frame.

# Your code here

19.7 Grouped summaries with pandas.DataFrame.groupby()

Now let’s see how to use groupby() to obtain grouped summaries, the primary reason for using agg() in the first place.

As its name suggests, pandas.DataFrame.groupby() lets you group a data frame by the values in a variable (e.g. male vs female sex). You can then perform operations that are split according to these groups.

Let’s try to group the yao data frame by sex and observe the effect:

yao.groupby("sex")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107108170>

Hmm. Apparently nothing happened. We just get a GroupBy object.

But watch what happens when we chain the groupby() with the agg() call we used in the previous section:

yao.groupby("sex").agg(mean_age=("age", "mean"), median_age=("age", "median"))
mean_age median_age
sex
Female 29.495446 26.0
Male 28.395735 25.0

Now we get a different statistic for each group! The mean age for female respondents is about 29.5, while for male respondents it’s about 28.4.

As was mentioned earlier, this kind of grouped summary is the primary reason the agg() function is so useful.

You may notice that there are two header rows. This is because the output has a hierarchical index (called a MultiIndex in pandas). While this can be useful in some cases, it often makes further data manipulation more difficult. We can reset the index to convert the group labels back to a regular column with the reset_index() method.

yao.groupby("sex").agg(mean_age=("age", "mean"), median_age=("age", "median")).reset_index()
sex mean_age median_age
0 Female 29.495446 26.0
1 Male 28.395735 25.0

You may notice that the line of code is getting quite long. We can move each new method call to a new line to make the code more readable, but we need to wrap the entire chain in parentheses.

(
    yao.groupby("sex")
    .agg(mean_age=("age", "mean"), median_age=("age", "median"))
    .reset_index()
)
sex mean_age median_age
0 Female 29.495446 26.0
1 Male 28.395735 25.0

Let’s see one more example.

Suppose you were asked to obtain the maximum and minimum weights for individuals in different neighborhoods, and also present the number of individuals in each neighborhood. We can write:

(
    yao.groupby("neighborhood")
    .agg(
        max_weight=("weight_kg", "max"),
        min_weight=("weight_kg", "min"),
        count=("weight_kg", "size"),  # the size function counts rows per group
    )
    .reset_index()
) 
neighborhood max_weight min_weight count
0 Briqueterie 128 20 106
1 Carriere 129 14 236
2 Cité Verte 118 16 72
... ... ... ... ...
6 Nkomkana 161 15 75
7 Tsinga 105 15 81
8 Tsinga Oliga 100 17 67

9 rows × 4 columns

Practice

19.8 Practice Q: Min and max height per sex

Use groupby(), agg(), and the relevant summary functions to obtain the minimum and maximum heights for each sex in the yao data frame, as well as the number of individuals in each sex group.

Your output should be a DataFrame that looks like this:

sex min_height_cm max_height_cm count
Female
Male
# Your code here

19.9 Grouping by multiple variables (nested grouping)

It is possible to group a data frame by more than one variable. This is sometimes called “nested” grouping.

Suppose you want to know the mean age of men and women in each neighbourhood, you could put both sex and neighborhood in the groupby() statement:

(
    yao
    .groupby(['sex', 'neighborhood'])
    .agg(mean_age=('age', 'mean'))
    .reset_index()
)
sex neighborhood mean_age
0 Female Briqueterie 31.622951
1 Female Carriere 28.164286
2 Female Cité Verte 31.750000
... ... ... ...
15 Male Nkomkana 29.812500
16 Male Tsinga 28.820513
17 Male Tsinga Oliga 24.297297

18 rows × 3 columns

From this output data frame you can tell that, for example, women from Briqueterie have a mean age of 31.6 years.

Practice

19.10 Practice Q: Min and max height per age sex group

Use groupby(), agg(), and min() and max() to get the minimum and maximum heights for each age-sex group in the yao data frame. The variables needed are age_category_3 and sex.

Your output should be a DataFrame that looks like this:

age_category_3 sex min_height max_height
Adult Female 78 185
Adult Male 147 196
Child Female 54 183
Child Male 96 190
Senior Female 143 174
Senior Male 160 195
# Your code here

19.11 NaN values in agg()

When using agg() to compute grouped summary statistics, pay attention to whether your group of interest contains NaN values.

For example, to get mean weight by smoking status, we can write:

(
    yao.groupby("is_smoker")
    .agg(weight_mean=("weight_kg", "mean"))
    .reset_index()
)
is_smoker weight_mean
0 Ex-smoker 76.366197
1 Non-smoker 63.033760
2 Smoker 72.410256

But this actually excludes some rows with NaN smoking status from the summary table.

We can include these individuals in the summary table by setting dropna=False with the groupby() function.

(
    yao.groupby("is_smoker", dropna=False)
    .agg(weight_mean=("weight_kg", "mean"))
    .reset_index()
)
is_smoker weight_mean
0 Ex-smoker 76.366197
1 Non-smoker 63.033760
2 Smoker 72.410256
3 NaN 73.000000

Also recall that you can see how many individuals are in each smoking status group by using the size() function. It is often useful to include this information in your summary table, so that you know how many individuals are behind each summary statistic.

(
    yao.groupby("is_smoker", dropna=False)
    .agg(weight_mean=("weight_kg", "mean"), 
         count=("weight_kg", "size"))
    .reset_index()
)
is_smoker weight_mean count
0 Ex-smoker 76.366197 71
1 Non-smoker 63.033760 859
2 Smoker 72.410256 39
3 NaN 73.000000 2
Practice

19.12 Practice Q: Mean weight by pregnancy status

Use groupby(), agg(), and the mean() function to obtain the mean weight (kg) by pregnancy status in the yao data frame. Include individuals with NaN pregnancy status in the summary table.

The output data frame should look something like this:

is_pregnant weight_mean
No
No response
Yes
NaN
# your code here

19.13 Custom summary statistics with lambda functions

Before we dive into custom summary statistics, let’s briefly introduce lambda functions. Lambda functions in Python are small, anonymous functions defined with the lambda keyword.

For example, consider a function that calculates the range (difference between the maximum and minimum) of a list. You could define it using a regular function like this:

def range_func(x):
    return max(x) - min(x)

print(range_func([1, 2, 3, 4]))  # Output: 3
3

Alternatively, you can achieve the same result using a lambda function:

range_func = lambda x: max(x) - min(x)
print(range_func([1, 2, 3, 4]))  # Output: 3
3

Now, let’s see how we can use lambda functions to apply custom summary statistics in our data analysis.

For example, let’s say we want to calculate the range of weights in each neighborhood. We could do this with our range_func function:

(
    yao.groupby("neighborhood")
    .agg(weight_range=("weight_kg", range_func))
    .reset_index()
)
neighborhood weight_range
0 Briqueterie 108
1 Carriere 115
2 Cité Verte 102
... ... ...
6 Nkomkana 146
7 Tsinga 90
8 Tsinga Oliga 83

9 rows × 2 columns

Notice that we did not wrap range_func in quotes. Only built-in functions are wrapped in quotes.

Now, instead of calling range_func we can use a lambda function directly in the agg() call:

(
    yao.groupby("neighborhood")
    .agg(weight_range=("weight_kg", lambda x: max(x) - min(x)))
    .reset_index()
)
neighborhood weight_range
0 Briqueterie 108
1 Carriere 115
2 Cité Verte 102
... ... ...
6 Nkomkana 146
7 Tsinga 90
8 Tsinga Oliga 83

9 rows × 2 columns

Notice that we still provide a tuple to the agg() function, ('weight_kg', lambda x: max(x) - min(x)), but the second element of the tuple is a lambda function.

This lambda function operates on the column provided in the tuple, weight_kg.

Let’s see another example: calculating the coefficient of variation (CV) of weight within each neighborhood. The CV is the standard deviation divided by the mean, and is a unitless measure of the relative variability of a distribution.

(
    yao.groupby("neighborhood")
    .agg(weight_cv=("weight_kg", lambda x: (np.std(x) / np.mean(x)) * 100))
    .reset_index()
)
neighborhood weight_cv
0 Briqueterie 33.531748
1 Carriere 32.027533
2 Cité Verte 33.255829
... ... ...
6 Nkomkana 33.187496
7 Tsinga 33.937145
8 Tsinga Oliga 35.894453

9 rows × 2 columns

Practice

19.14 Practice Q: IQR of age by neighborhood

Find the interquartile range (IQR) of the age variable for each neighborhood. The IQR is the difference between the 75th and 25th percentiles. Your lambda will look like this: lambda x: x.quantile(0.75) - x.quantile(0.25)

# Your code here

19.15 Wrap up

In this lesson, you’ve learned how to obtain quick summary statistics from your data using agg(), group your data using groupby(), and combine groupby() with agg() for powerful data summarization.

These skills are essential for both exploratory data analysis and preparing data for presentation or plotting. The combination of groupby() and agg() is one of the most common and useful data manipulation techniques in pandas.

In our next lesson, we’ll explore ways to combine groupby() with other pandas methods.

See you there!