19 Grouping and summarizing data

19.1 Introduction

In this lesson, we’ll explore two powerful pandas methods: agg() and groupby(). These tools will enable you to extract summary statistics and perform operations on grouped data effortlessly.

A summary statistic is a single value (such as a mean or median) that describes a sequence of values (typically a column in your dataset).

Let’s see how to use these!

19.2 Learning objectives

You can use pandas.DataFrame.agg() to extract summary statistics from datasets.
You can use pandas.DataFrame.groupby() to group data by one or more variables before performing operations on them.
You can pass custom functions to agg() to compute summary statistics.

19.3 Libraries

Run the following lines to import the necessary libraries:

import pandas as pd
import numpy as np

19.4 The Yaounde COVID-19 dataset

In this lesson, we will again use a subset of data from the COVID-19 serological survey conducted in Yaounde, Cameroon.

You can download the dataset from this link: yaounde_mini.csv

yao = pd.read_csv("data/yaounde_mini.csv")
yao

	age	age_category_3	sex	weight_kg	height_cm	neighborhood	is_smoker	is_pregnant	occupation	treatment_combinations	symptoms	n_days_miss_work	n_bedridden_days	highest_education	igg_result
0	45	Adult	Female	95	169	Briqueterie	Non-smoker	No	Informal worker	Paracetamol	Muscle pain	0.0	0.0	Secondary	Negative
1	55	Adult	Male	96	185	Briqueterie	Ex-smoker	NaN	Salaried worker	NaN	No symptoms	NaN	NaN	University	Positive
2	23	Adult	Male	74	180	Briqueterie	Smoker	NaN	Student	NaN	No symptoms	NaN	NaN	University	Negative
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
968	35	Adult	Male	77	168	Tsinga Oliga	Smoker	NaN	Informal worker	Paracetamol	Headache	0.0	0.0	University	Positive
969	31	Adult	Female	66	169	Tsinga Oliga	Non-smoker	No	Unemployed	NaN	No symptoms	NaN	NaN	Secondary	Negative
970	17	Child	Female	67	162	Tsinga Oliga	Non-smoker	No response	Unemployed	NaN	No symptoms	NaN	NaN	Secondary	Negative

971 rows × 15 columns

You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0

19.5 Introducing `pandas.DataFrame.agg()`

To get started, let’s consider how to get simple summary statistics without using agg(), then we will consider why you should actually use agg().

Imagine you were asked to find the mean age of respondents in the yao data frame. You can do this by calling the mean() method on the age column of the yao data frame:

yao[["age"]].mean()

age    29.017508
dtype: float64

Now, let’s see how to do this with agg().

yao.agg(mean_age=('age', 'mean'))

	age
mean_age	29.017508

The anatomy of this syntax is:

dataframe.agg(summary_name=("COLUMN_TO_SUMMARIZE", "SUMMARY_FUNCTION"))

This part ("COLUMN_TO_SUMMARIZE", "SUMMARY_FUNCTION") is called a tuple. The first element of the tuple is the name of the column to summarize, and the second element is the summary function to apply to that column.

The syntax is more complex, but as you will see later, much more powerful, as it allows you to compute multiple summary statistics, and to compute statistics per group.

Let’s see how to compute multiple summary statistics in a single agg() statement. If you wanted both the mean and the median age, you could run:

yao.agg(mean_age=("age", "mean"), median_age=("age", "median"))

	age
mean_age	29.017508
median_age	26.000000

Great, now try your hand at the practice question below.

Practice

19.6 Practice Q: Mean and median weight

Use agg() and the relevant summary functions to obtain the mean and median of respondent weights from the weight_kg variable of the yao data frame.

# Your code here

19.7 Grouped summaries with `pandas.DataFrame.groupby()`

Now let’s see how to use groupby() to obtain grouped summaries, the primary reason for using agg() in the first place.

As its name suggests, pandas.DataFrame.groupby() lets you group a data frame by the values in a variable (e.g. male vs female sex). You can then perform operations that are split according to these groups.

Let’s try to group the yao data frame by sex and observe the effect:

yao.groupby("sex")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107108170>

Hmm. Apparently nothing happened. We just get a GroupBy object.

But watch what happens when we chain the groupby() with the agg() call we used in the previous section:

yao.groupby("sex").agg(mean_age=("age", "mean"), median_age=("age", "median"))

	mean_age	median_age
sex
Female	29.495446	26.0
Male	28.395735	25.0

Now we get a different statistic for each group! The mean age for female respondents is about 29.5, while for male respondents it’s about 28.4.

As was mentioned earlier, this kind of grouped summary is the primary reason the agg() function is so useful.

You may notice that there are two header rows. This is because the output has a hierarchical index (called a MultiIndex in pandas). While this can be useful in some cases, it often makes further data manipulation more difficult. We can reset the index to convert the group labels back to a regular column with the reset_index() method.

yao.groupby("sex").agg(mean_age=("age", "mean"), median_age=("age", "median")).reset_index()

	sex	mean_age	median_age
0	Female	29.495446	26.0
1	Male	28.395735	25.0

You may notice that the line of code is getting quite long. We can move each new method call to a new line to make the code more readable, but we need to wrap the entire chain in parentheses.

(
    yao.groupby("sex")
    .agg(mean_age=("age", "mean"), median_age=("age", "median"))
    .reset_index()
)

	sex	mean_age	median_age
0	Female	29.495446	26.0
1	Male	28.395735	25.0

Let’s see one more example.

Suppose you were asked to obtain the maximum and minimum weights for individuals in different neighborhoods, and also present the number of individuals in each neighborhood. We can write:

(
    yao.groupby("neighborhood")
    .agg(
        max_weight=("weight_kg", "max"),
        min_weight=("weight_kg", "min"),
        count=("weight_kg", "size"),  # the size function counts rows per group
    )
    .reset_index()
)

	neighborhood	max_weight	min_weight	count
0	Briqueterie	128	20	106
1	Carriere	129	14	236
2	Cité Verte	118	16	72
...	...	...	...	...
6	Nkomkana	161	15	75
7	Tsinga	105	15	81
8	Tsinga Oliga	100	17	67

9 rows × 4 columns

Practice

19.8 Practice Q: Min and max height per sex

Use groupby(), agg(), and the relevant summary functions to obtain the minimum and maximum heights for each sex in the yao data frame, as well as the number of individuals in each sex group.

Your output should be a DataFrame that looks like this:

sex	min_height_cm	max_height_cm	count
Female
Male

# Your code here

19.9 Grouping by multiple variables (nested grouping)

It is possible to group a data frame by more than one variable. This is sometimes called “nested” grouping.

Suppose you want to know the mean age of men and women in each neighbourhood, you could put both sex and neighborhood in the groupby() statement:

(
    yao
    .groupby(['sex', 'neighborhood'])
    .agg(mean_age=('age', 'mean'))
    .reset_index()
)

	sex	neighborhood	mean_age
0	Female	Briqueterie	31.622951
1	Female	Carriere	28.164286
2	Female	Cité Verte	31.750000
...	...	...	...
15	Male	Nkomkana	29.812500
16	Male	Tsinga	28.820513
17	Male	Tsinga Oliga	24.297297

18 rows × 3 columns

From this output data frame you can tell that, for example, women from Briqueterie have a mean age of 31.6 years.

Practice

19.10 Practice Q: Min and max height per age sex group

Use groupby(), agg(), and min() and max() to get the minimum and maximum heights for each age-sex group in the yao data frame. The variables needed are age_category_3 and sex.

Your output should be a DataFrame that looks like this:

age_category_3	sex	min_height	max_height
Adult	Female	78	185
Adult	Male	147	196
Child	Female	54	183
Child	Male	96	190
Senior	Female	143	174
Senior	Male	160	195

# Your code here

19.11 NaN values in `agg()`

When using agg() to compute grouped summary statistics, pay attention to whether your group of interest contains NaN values.

For example, to get mean weight by smoking status, we can write:

(
    yao.groupby("is_smoker")
    .agg(weight_mean=("weight_kg", "mean"))
    .reset_index()
)

	is_smoker	weight_mean
0	Ex-smoker	76.366197
1	Non-smoker	63.033760
2	Smoker	72.410256

But this actually excludes some rows with NaN smoking status from the summary table.

We can include these individuals in the summary table by setting dropna=False with the groupby() function.

(
    yao.groupby("is_smoker", dropna=False)
    .agg(weight_mean=("weight_kg", "mean"))
    .reset_index()
)

	is_smoker	weight_mean
0	Ex-smoker	76.366197
1	Non-smoker	63.033760
2	Smoker	72.410256
3	NaN	73.000000

Also recall that you can see how many individuals are in each smoking status group by using the size() function. It is often useful to include this information in your summary table, so that you know how many individuals are behind each summary statistic.

(
    yao.groupby("is_smoker", dropna=False)
    .agg(weight_mean=("weight_kg", "mean"), 
         count=("weight_kg", "size"))
    .reset_index()
)

	is_smoker	weight_mean	count
0	Ex-smoker	76.366197	71
1	Non-smoker	63.033760	859
2	Smoker	72.410256	39
3	NaN	73.000000	2

Practice

19.12 Practice Q: Mean weight by pregnancy status

Use groupby(), agg(), and the mean() function to obtain the mean weight (kg) by pregnancy status in the yao data frame. Include individuals with NaN pregnancy status in the summary table.

The output data frame should look something like this:

is_pregnant	weight_mean
No
No response
Yes
NaN

# your code here

19.13 Custom summary statistics with lambda functions

Before we dive into custom summary statistics, let’s briefly introduce lambda functions. Lambda functions in Python are small, anonymous functions defined with the lambda keyword.

For example, consider a function that calculates the range (difference between the maximum and minimum) of a list. You could define it using a regular function like this:

def range_func(x):
    return max(x) - min(x)

print(range_func([1, 2, 3, 4]))  # Output: 3

Alternatively, you can achieve the same result using a lambda function:

range_func = lambda x: max(x) - min(x)
print(range_func([1, 2, 3, 4]))  # Output: 3

Now, let’s see how we can use lambda functions to apply custom summary statistics in our data analysis.

For example, let’s say we want to calculate the range of weights in each neighborhood. We could do this with our range_func function:

(
    yao.groupby("neighborhood")
    .agg(weight_range=("weight_kg", range_func))
    .reset_index()
)

	neighborhood	weight_range
0	Briqueterie	108
1	Carriere	115
2	Cité Verte	102
...	...	...
6	Nkomkana	146
7	Tsinga	90
8	Tsinga Oliga	83

9 rows × 2 columns

Notice that we did not wrap range_func in quotes. Only built-in functions are wrapped in quotes.

Now, instead of calling range_func we can use a lambda function directly in the agg() call:

(
    yao.groupby("neighborhood")
    .agg(weight_range=("weight_kg", lambda x: max(x) - min(x)))
    .reset_index()
)

	neighborhood	weight_range
0	Briqueterie	108
1	Carriere	115
2	Cité Verte	102
...	...	...
6	Nkomkana	146
7	Tsinga	90
8	Tsinga Oliga	83

9 rows × 2 columns

Notice that we still provide a tuple to the agg() function, ('weight_kg', lambda x: max(x) - min(x)), but the second element of the tuple is a lambda function.

This lambda function operates on the column provided in the tuple, weight_kg.

Let’s see another example: calculating the coefficient of variation (CV) of weight within each neighborhood. The CV is the standard deviation divided by the mean, and is a unitless measure of the relative variability of a distribution.

(
    yao.groupby("neighborhood")
    .agg(weight_cv=("weight_kg", lambda x: (np.std(x) / np.mean(x)) * 100))
    .reset_index()
)

	neighborhood	weight_cv
0	Briqueterie	33.531748
1	Carriere	32.027533
2	Cité Verte	33.255829
...	...	...
6	Nkomkana	33.187496
7	Tsinga	33.937145
8	Tsinga Oliga	35.894453

9 rows × 2 columns

Practice

19.14 Practice Q: IQR of age by neighborhood

Find the interquartile range (IQR) of the age variable for each neighborhood. The IQR is the difference between the 75th and 25th percentiles. Your lambda will look like this: lambda x: x.quantile(0.75) - x.quantile(0.25)

# Your code here

19.15 Wrap up

In this lesson, you’ve learned how to obtain quick summary statistics from your data using agg(), group your data using groupby(), and combine groupby() with agg() for powerful data summarization.

These skills are essential for both exploratory data analysis and preparing data for presentation or plotting. The combination of groupby() and agg() is one of the most common and useful data manipulation techniques in pandas.

In our next lesson, we’ll explore ways to combine groupby() with other pandas methods.

See you there!

19.1 Introduction

19.2 Learning objectives

19.3 Libraries

19.4 The Yaounde COVID-19 dataset

19.5 Introducing pandas.DataFrame.agg()

19.6 Practice Q: Mean and median weight

19.7 Grouped summaries with pandas.DataFrame.groupby()

19.8 Practice Q: Min and max height per sex

19.9 Grouping by multiple variables (nested grouping)

19.10 Practice Q: Min and max height per age sex group

19.11 NaN values in agg()

19.12 Practice Q: Mean weight by pregnancy status

19.13 Custom summary statistics with lambda functions

19.14 Practice Q: IQR of age by neighborhood

19.15 Wrap up

19.5 Introducing `pandas.DataFrame.agg()`

19.7 Grouped summaries with `pandas.DataFrame.groupby()`

19.11 NaN values in `agg()`