import pandas as pd
import vega_datasets as vd
import plotly.express as px
import warnings
import calendar
20 Other Grouped Operations in Pandas
20.1 Introduction
In our previous lessons, you’ve learned how to extract summary statistics from groups using groupby()
and agg()
. Now, we’ll take a step further by exploring some additional useful grouped data transformations. Let’s get started.
20.2 Learning Objectives
By the end of this lesson, you will be able to:
- Add group-level summary statistics as new columns using
transform()
. - Count values within groups using
value_counts()
. - Compute cumulative sums within groups.
20.3 Imports
Run the following cell to import the necessary libraries:
20.4 Data
We’ll use the weather
dataset for our examples.
= vd.data.seattle_weather()
weather_raw
# Select just 2012 data using query and add a month column
= weather_raw.query("date.dt.year == 2012")
weather "month"] = pd.Categorical(
weather["date"].dt.strftime("%B"),
weather[=list(calendar.month_name[1:]),
categories=True,
ordered
) weather
/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_965/2556095375.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
date | precipitation | temp_max | temp_min | wind | weather | month | |
---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January |
... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December |
366 rows × 7 columns
Now let’s set the display options for the rest of the lesson:
= 20 pd.options.display.max_rows
And let’s ignore the warnings that come up when working with categorical data with the current version of pandas:
warnings.filterwarnings("ignore"
## There is a class of warnings that come up when working with categorical data with the current version of pandas that we can ignore )
20.5 Adding Summary Statistics Using transform()
In the previous lesson, you learned how to calculate summary statistics like mean, median, or standard deviation using agg()
.
For example, to compute the mean precipitation (rain + snow) for each month, you could use:
'month').agg(mean_precip = ('precipitation', 'mean')) weather.groupby(
mean_precip | |
---|---|
month | |
January | 5.590323 |
February | 3.182759 |
March | 5.903226 |
April | 2.270000 |
May | 1.683871 |
... | ... |
August | 0.000000 |
September | 0.030000 |
October | 5.493548 |
November | 7.016667 |
December | 5.612903 |
12 rows × 1 columns
Sometimes, we want to add these group-level statistics as new columns to our original DataFrame. We can’t do this directly with the agg()
output:
# Does not work
'mean_precip'] = weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | |
---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN |
366 rows × 8 columns
But we can do this using transform()
. transform()
reshapes the output to match the original DataFrame’s shape, allowing us to add the group-level statistics as new columns.
'mean_precip_month'] = weather.groupby('month')['precipitation'].transform('mean')
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | |
---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 |
366 rows × 9 columns
You can compute other statistics similarly. For example, to compute the median precipitation for each month, you could use:
'prep_median_month'] = weather.groupby('month')['precipitation'].transform('median')
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 |
366 rows × 10 columns
Or to get the sum of precipitation for each month:
'precip_sum_month'] = weather.groupby('month')['precipitation'].transform('sum')
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | precip_sum_month | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 | 173.3 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 | 173.3 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 | 173.3 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 | 173.3 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 | 173.3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 | 174.0 |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 |
366 rows × 11 columns
With the sum in hand, we can easily calculate the proportion of that month’s precipitation that fell on each day:
"precip_month_prop"] = weather["precipitation"] / weather["precip_sum_month"]
weather[ weather
date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | precip_sum_month | precip_month_prop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 | 173.3 | 0.000000 |
1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.062897 |
2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.004616 |
3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.117138 |
4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.007501 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.023563 |
362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 |
363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.008621 |
364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 |
365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 |
366 rows × 12 columns
20.6 Practice Q: Daily Tip Proportions
Using the tips
dataset, calculate: 1. A new column daily_total_tips
containing the total tips for each day 2. A new column tip_proportion
showing what proportion of that day’s total tips came from each customer
# Your code here:
= px.data.tips()
tips tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
The first few rows of your output data should look something like this:
total_bill tip sex smoker day time size daily_total_tips tip_proportion
16.99 1.01 Female No Sun Dinner 2 247.39 0.004083
10.34 1.66 Male No Sun Dinner 3 247.39 0.006710
21.01 3.50 Male No Sun Dinner 3 247.39 0.014148
23.68 3.31 Male No Sun Dinner 2 247.39 0.013380
Let’s reinitialize the weather
DataFrame to a smaller set of columns for the rest of the lesson:
= weather[['date', 'month', 'precipitation', 'wind', 'weather']]
weather weather
date | month | precipitation | wind | weather | |
---|---|---|---|---|---|
0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle |
1 | 2012-01-02 | January | 10.9 | 4.5 | rain |
2 | 2012-01-03 | January | 0.8 | 2.3 | rain |
3 | 2012-01-04 | January | 20.3 | 4.7 | rain |
4 | 2012-01-05 | January | 1.3 | 6.1 | rain |
... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | December | 4.1 | 3.2 | rain |
362 | 2012-12-28 | December | 0.0 | 1.7 | rain |
363 | 2012-12-29 | December | 1.5 | 1.7 | rain |
364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle |
365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle |
366 rows × 5 columns
20.7 Counting Values Within Groups Using value_counts()
Counting occurrences of categorical variables within groups can reveal interesting patterns, and you often need to do this after using groupby()
.
First, let’s recall how value_counts()
works on the entire DataFrame.
# Count of weather types
"weather"].value_counts() weather[
weather
rain 191
sun 118
drizzle 31
snow 21
fog 5
Name: count, dtype: int64
We can add normalize=True
to get proportions:
'weather'].value_counts(normalize=True) weather[
weather
rain 0.521858
sun 0.322404
drizzle 0.084699
snow 0.057377
fog 0.013661
Name: proportion, dtype: float64
Now, to count weather types within each month, we first group by month
, then subset the weather
column and apply value_counts()
to it.
# Counts of weather types per month
'month')['weather'].value_counts() weather.groupby(
month weather
January rain 18
snow 7
sun 4
drizzle 2
fog 0
..
December rain 23
snow 5
drizzle 2
sun 1
fog 0
Name: count, Length: 60, dtype: int64
This returns a Series with a MultiIndex, which can be converted to a regular DataFrame with reset_index()
:
'month')['weather'].value_counts().reset_index() weather.groupby(
month | weather | count | |
---|---|---|---|
0 | January | rain | 18 |
1 | January | snow | 7 |
2 | January | sun | 4 |
3 | January | drizzle | 2 |
4 | January | fog | 0 |
... | ... | ... | ... |
55 | December | rain | 23 |
56 | December | snow | 5 |
57 | December | drizzle | 2 |
58 | December | sun | 1 |
59 | December | fog | 0 |
60 rows × 3 columns
20.8 Practice Q: Count Smokers and Non-Smokers by Day
Using the tips
dataset, count the number of smokers and non-smokers for each day.
= px.data.tips()
tips
tips
# Your code here:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
The first few rows of your result should look something like this:
day smoker count
Fri Yes 15
Fri No 4
Sat No 45
20.9 Computing Cumulative Sums Within Groups
Cumulative sums help track running totals within groups. This is an often-useful operation. Let’s see how we can do this to grouped data.
As a recall, here’s how we can compute the cumulative sum of precipitation for the entire DataFrame:
# Cumulative sum of precipitation
"precip_cumul"] = weather["precipitation"].cumsum()
weather[ weather
date | month | precipitation | wind | weather | precip_cumul | |
---|---|---|---|---|---|---|
0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle | 0.0 |
1 | 2012-01-02 | January | 10.9 | 4.5 | rain | 10.9 |
2 | 2012-01-03 | January | 0.8 | 2.3 | rain | 11.7 |
3 | 2012-01-04 | January | 20.3 | 4.7 | rain | 32.0 |
4 | 2012-01-05 | January | 1.3 | 6.1 | rain | 33.3 |
... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | December | 4.1 | 3.2 | rain | 1224.5 |
362 | 2012-12-28 | December | 0.0 | 1.7 | rain | 1224.5 |
363 | 2012-12-29 | December | 1.5 | 1.7 | rain | 1226.0 |
364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle | 1226.0 |
365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle | 1226.0 |
366 rows × 6 columns
To compute cumulative precipitation within each month, we can use groupby()
and cumsum()
:
# Cumulative precipitation per month
"precip_cumul"] = weather.groupby("month")["precipitation"].cumsum()
weather[ weather
date | month | precipitation | wind | weather | precip_cumul | |
---|---|---|---|---|---|---|
0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle | 0.0 |
1 | 2012-01-02 | January | 10.9 | 4.5 | rain | 10.9 |
2 | 2012-01-03 | January | 0.8 | 2.3 | rain | 11.7 |
3 | 2012-01-04 | January | 20.3 | 4.7 | rain | 32.0 |
4 | 2012-01-05 | January | 1.3 | 6.1 | rain | 33.3 |
... | ... | ... | ... | ... | ... | ... |
361 | 2012-12-27 | December | 4.1 | 3.2 | rain | 172.5 |
362 | 2012-12-28 | December | 0.0 | 1.7 | rain | 172.5 |
363 | 2012-12-29 | December | 1.5 | 1.7 | rain | 174.0 |
364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle | 174.0 |
365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle | 174.0 |
366 rows × 6 columns
20.10 Practice Q: Cumulative Tip Amount by Day
Using the tips
dataset, compute the cumulative sum of total_bill
for each day
, adding a new column cumul_total_bill_day
. Then add another column cumul_tip_day
that contains the cumulative sum of tip
for each day
.
= px.data.tips()
tips = tips.sort_values('day')
tips
tips# Your code here:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
96 | 27.28 | 4.00 | Male | Yes | Fri | Dinner | 2 |
101 | 15.38 | 3.00 | Female | Yes | Fri | Dinner | 2 |
98 | 21.01 | 3.00 | Male | Yes | Fri | Dinner | 2 |
97 | 12.03 | 1.50 | Male | Yes | Fri | Dinner | 2 |
95 | 40.17 | 4.73 | Male | Yes | Fri | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
132 | 11.17 | 1.50 | Female | No | Thur | Lunch | 2 |
131 | 20.27 | 2.83 | Female | No | Thur | Lunch | 2 |
130 | 19.08 | 1.50 | Male | No | Thur | Lunch | 2 |
128 | 11.38 | 2.00 | Female | No | Thur | Lunch | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
20.11 Wrap-Up
In this lesson, you’ve learned several powerful group-level data transformations in pandas:
- Adding Summary Statistics: Using
transform()
to add group-level calculations as new columns - Counting Within Groups: Using
value_counts()
to count occurrences in groups - Computing Cumulative Sums: Tracking running totals within groups
These techniques allow you to analyze patterns and statistics within specific subsets of your data. Keep practicing with different datasets to build your data manipulation skills!
See you next time!