20  Other Grouped Operations in Pandas

20.1 Introduction

In our previous lessons, you’ve learned how to extract summary statistics from groups using groupby() and agg(). Now, we’ll take a step further by exploring some additional useful grouped data transformations. Let’s get started.

20.2 Learning Objectives

By the end of this lesson, you will be able to:

  • Add group-level summary statistics as new columns using transform().
  • Count values within groups using value_counts().
  • Compute cumulative sums within groups.

20.3 Imports

Run the following cell to import the necessary libraries:

import pandas as pd
import vega_datasets as vd
import plotly.express as px
import warnings
import calendar

20.4 Data

We’ll use the weather dataset for our examples.

weather_raw = vd.data.seattle_weather()

# Select just 2012 data using query and add a month column
weather = weather_raw.query("date.dt.year == 2012")
weather["month"] = pd.Categorical(
    weather["date"].dt.strftime("%B"),
    categories=list(calendar.month_name[1:]),
    ordered=True,
)
weather
/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_965/2556095375.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
date precipitation temp_max temp_min wind weather month
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January
1 2012-01-02 10.9 10.6 2.8 4.5 rain January
2 2012-01-03 0.8 11.7 7.2 2.3 rain January
3 2012-01-04 20.3 12.2 5.6 4.7 rain January
4 2012-01-05 1.3 8.9 2.8 6.1 rain January
... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December
362 2012-12-28 0.0 8.3 3.9 1.7 rain December
363 2012-12-29 1.5 5.0 3.3 1.7 rain December
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December

366 rows × 7 columns

Now let’s set the display options for the rest of the lesson:

pd.options.display.max_rows = 20

And let’s ignore the warnings that come up when working with categorical data with the current version of pandas:

warnings.filterwarnings(
    "ignore"
)  ## There is a class of warnings that come up when working with categorical data with the current version of pandas that we can ignore

20.5 Adding Summary Statistics Using transform()

In the previous lesson, you learned how to calculate summary statistics like mean, median, or standard deviation using agg().

For example, to compute the mean precipitation (rain + snow) for each month, you could use:

weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
mean_precip
month
January 5.590323
February 3.182759
March 5.903226
April 2.270000
May 1.683871
... ...
August 0.000000
September 0.030000
October 5.493548
November 7.016667
December 5.612903

12 rows × 1 columns

Sometimes, we want to add these group-level statistics as new columns to our original DataFrame. We can’t do this directly with the agg() output:

# Does not work
weather['mean_precip'] = weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
weather
date precipitation temp_max temp_min wind weather month mean_precip
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN
... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN

366 rows × 8 columns

But we can do this using transform(). transform() reshapes the output to match the original DataFrame’s shape, allowing us to add the group-level statistics as new columns.

weather['mean_precip_month'] = weather.groupby('month')['precipitation'].transform('mean')
weather
date precipitation temp_max temp_min wind weather month mean_precip mean_precip_month
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN 5.590323
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN 5.590323
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN 5.590323
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN 5.590323
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN 5.590323
... ... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN 5.612903
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN 5.612903
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN 5.612903
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN 5.612903
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN 5.612903

366 rows × 9 columns

You can compute other statistics similarly. For example, to compute the median precipitation for each month, you could use:

weather['prep_median_month'] = weather.groupby('month')['precipitation'].transform('median')    
weather
date precipitation temp_max temp_min wind weather month mean_precip mean_precip_month prep_median_month
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN 5.590323 3.0
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN 5.590323 3.0
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN 5.590323 3.0
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN 5.590323 3.0
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN 5.590323 3.0
... ... ... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN 5.612903 3.3
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN 5.612903 3.3
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN 5.612903 3.3
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN 5.612903 3.3
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN 5.612903 3.3

366 rows × 10 columns

Or to get the sum of precipitation for each month:

weather['precip_sum_month'] = weather.groupby('month')['precipitation'].transform('sum')
weather
date precipitation temp_max temp_min wind weather month mean_precip mean_precip_month prep_median_month precip_sum_month
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN 5.590323 3.0 173.3
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN 5.590323 3.0 173.3
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN 5.590323 3.0 173.3
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN 5.590323 3.0 173.3
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN 5.590323 3.0 173.3
... ... ... ... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN 5.612903 3.3 174.0
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN 5.612903 3.3 174.0
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN 5.612903 3.3 174.0
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN 5.612903 3.3 174.0
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN 5.612903 3.3 174.0

366 rows × 11 columns

With the sum in hand, we can easily calculate the proportion of that month’s precipitation that fell on each day:

weather["precip_month_prop"] = weather["precipitation"] / weather["precip_sum_month"]
weather
date precipitation temp_max temp_min wind weather month mean_precip mean_precip_month prep_median_month precip_sum_month precip_month_prop
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle January NaN 5.590323 3.0 173.3 0.000000
1 2012-01-02 10.9 10.6 2.8 4.5 rain January NaN 5.590323 3.0 173.3 0.062897
2 2012-01-03 0.8 11.7 7.2 2.3 rain January NaN 5.590323 3.0 173.3 0.004616
3 2012-01-04 20.3 12.2 5.6 4.7 rain January NaN 5.590323 3.0 173.3 0.117138
4 2012-01-05 1.3 8.9 2.8 6.1 rain January NaN 5.590323 3.0 173.3 0.007501
... ... ... ... ... ... ... ... ... ... ... ... ...
361 2012-12-27 4.1 7.8 3.3 3.2 rain December NaN 5.612903 3.3 174.0 0.023563
362 2012-12-28 0.0 8.3 3.9 1.7 rain December NaN 5.612903 3.3 174.0 0.000000
363 2012-12-29 1.5 5.0 3.3 1.7 rain December NaN 5.612903 3.3 174.0 0.008621
364 2012-12-30 0.0 4.4 0.0 1.8 drizzle December NaN 5.612903 3.3 174.0 0.000000
365 2012-12-31 0.0 3.3 -1.1 2.0 drizzle December NaN 5.612903 3.3 174.0 0.000000

366 rows × 12 columns

Practice

20.6 Practice Q: Daily Tip Proportions

Using the tips dataset, calculate: 1. A new column daily_total_tips containing the total tips for each day 2. A new column tip_proportion showing what proportion of that day’s total tips came from each customer

# Your code here:
tips = px.data.tips()
tips
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

The first few rows of your output data should look something like this:

total_bill    tip     sex      smoker    day    time      size    daily_total_tips    tip_proportion
16.99         1.01    Female   No        Sun    Dinner    2       247.39              0.004083
10.34         1.66    Male     No        Sun    Dinner    3       247.39              0.006710
21.01         3.50    Male     No        Sun    Dinner    3       247.39              0.014148
23.68         3.31    Male     No        Sun    Dinner    2       247.39              0.013380

Let’s reinitialize the weather DataFrame to a smaller set of columns for the rest of the lesson:

weather = weather[['date', 'month', 'precipitation', 'wind', 'weather']]
weather
date month precipitation wind weather
0 2012-01-01 January 0.0 4.7 drizzle
1 2012-01-02 January 10.9 4.5 rain
2 2012-01-03 January 0.8 2.3 rain
3 2012-01-04 January 20.3 4.7 rain
4 2012-01-05 January 1.3 6.1 rain
... ... ... ... ... ...
361 2012-12-27 December 4.1 3.2 rain
362 2012-12-28 December 0.0 1.7 rain
363 2012-12-29 December 1.5 1.7 rain
364 2012-12-30 December 0.0 1.8 drizzle
365 2012-12-31 December 0.0 2.0 drizzle

366 rows × 5 columns

20.7 Counting Values Within Groups Using value_counts()

Counting occurrences of categorical variables within groups can reveal interesting patterns, and you often need to do this after using groupby().

First, let’s recall how value_counts() works on the entire DataFrame.

# Count of weather types
weather["weather"].value_counts()
weather
rain       191
sun        118
drizzle     31
snow        21
fog          5
Name: count, dtype: int64

We can add normalize=True to get proportions:

weather['weather'].value_counts(normalize=True)
weather
rain       0.521858
sun        0.322404
drizzle    0.084699
snow       0.057377
fog        0.013661
Name: proportion, dtype: float64

Now, to count weather types within each month, we first group by month, then subset the weather column and apply value_counts() to it.

# Counts of weather types per month
weather.groupby('month')['weather'].value_counts()
month     weather
January   rain       18
          snow        7
          sun         4
          drizzle     2
          fog         0
                     ..
December  rain       23
          snow        5
          drizzle     2
          sun         1
          fog         0
Name: count, Length: 60, dtype: int64

This returns a Series with a MultiIndex, which can be converted to a regular DataFrame with reset_index():

weather.groupby('month')['weather'].value_counts().reset_index()
month weather count
0 January rain 18
1 January snow 7
2 January sun 4
3 January drizzle 2
4 January fog 0
... ... ... ...
55 December rain 23
56 December snow 5
57 December drizzle 2
58 December sun 1
59 December fog 0

60 rows × 3 columns

Practice

20.8 Practice Q: Count Smokers and Non-Smokers by Day

Using the tips dataset, count the number of smokers and non-smokers for each day.

tips = px.data.tips()
tips

# Your code here:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

The first few rows of your result should look something like this:

day smoker  count
Fri Yes 15
Fri No  4
Sat No  45

20.9 Computing Cumulative Sums Within Groups

Cumulative sums help track running totals within groups. This is an often-useful operation. Let’s see how we can do this to grouped data.

As a recall, here’s how we can compute the cumulative sum of precipitation for the entire DataFrame:

# Cumulative sum of precipitation
weather["precip_cumul"] = weather["precipitation"].cumsum()
weather
date month precipitation wind weather precip_cumul
0 2012-01-01 January 0.0 4.7 drizzle 0.0
1 2012-01-02 January 10.9 4.5 rain 10.9
2 2012-01-03 January 0.8 2.3 rain 11.7
3 2012-01-04 January 20.3 4.7 rain 32.0
4 2012-01-05 January 1.3 6.1 rain 33.3
... ... ... ... ... ... ...
361 2012-12-27 December 4.1 3.2 rain 1224.5
362 2012-12-28 December 0.0 1.7 rain 1224.5
363 2012-12-29 December 1.5 1.7 rain 1226.0
364 2012-12-30 December 0.0 1.8 drizzle 1226.0
365 2012-12-31 December 0.0 2.0 drizzle 1226.0

366 rows × 6 columns

To compute cumulative precipitation within each month, we can use groupby() and cumsum():

# Cumulative precipitation per month
weather["precip_cumul"] = weather.groupby("month")["precipitation"].cumsum()
weather
date month precipitation wind weather precip_cumul
0 2012-01-01 January 0.0 4.7 drizzle 0.0
1 2012-01-02 January 10.9 4.5 rain 10.9
2 2012-01-03 January 0.8 2.3 rain 11.7
3 2012-01-04 January 20.3 4.7 rain 32.0
4 2012-01-05 January 1.3 6.1 rain 33.3
... ... ... ... ... ... ...
361 2012-12-27 December 4.1 3.2 rain 172.5
362 2012-12-28 December 0.0 1.7 rain 172.5
363 2012-12-29 December 1.5 1.7 rain 174.0
364 2012-12-30 December 0.0 1.8 drizzle 174.0
365 2012-12-31 December 0.0 2.0 drizzle 174.0

366 rows × 6 columns

20.10 Practice Q: Cumulative Tip Amount by Day

Using the tips dataset, compute the cumulative sum of total_bill for each day, adding a new column cumul_total_bill_day. Then add another column cumul_tip_day that contains the cumulative sum of tip for each day.

tips = px.data.tips()
tips = tips.sort_values('day')
tips
# Your code here:
total_bill tip sex smoker day time size
96 27.28 4.00 Male Yes Fri Dinner 2
101 15.38 3.00 Female Yes Fri Dinner 2
98 21.01 3.00 Male Yes Fri Dinner 2
97 12.03 1.50 Male Yes Fri Dinner 2
95 40.17 4.73 Male Yes Fri Dinner 4
... ... ... ... ... ... ... ...
132 11.17 1.50 Female No Thur Lunch 2
131 20.27 2.83 Female No Thur Lunch 2
130 19.08 1.50 Male No Thur Lunch 2
128 11.38 2.00 Female No Thur Lunch 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

20.11 Wrap-Up

In this lesson, you’ve learned several powerful group-level data transformations in pandas:

  • Adding Summary Statistics: Using transform() to add group-level calculations as new columns
  • Counting Within Groups: Using value_counts() to count occurrences in groups
  • Computing Cumulative Sums: Tracking running totals within groups

These techniques allow you to analyze patterns and statistics within specific subsets of your data. Keep practicing with different datasets to build your data manipulation skills!

See you next time!