import pandas as pd
import vega_datasets as vd
import plotly.express as px
import warnings
import calendar20 Other Grouped Operations in Pandas
20.1 Introduction
In our previous lessons, you’ve learned how to extract summary statistics from groups using groupby() and agg(). Now, we’ll take a step further by exploring some additional useful grouped data transformations. Let’s get started.
20.2 Learning Objectives
By the end of this lesson, you will be able to:
- Add group-level summary statistics as new columns using 
transform(). - Count values within groups using 
value_counts(). - Compute cumulative sums within groups.
 
20.3 Imports
Run the following cell to import the necessary libraries:
20.4 Data
We’ll use the weather dataset for our examples.
weather_raw = vd.data.seattle_weather()
# Select just 2012 data using query and add a month column
weather = weather_raw.query("date.dt.year == 2012")
weather["month"] = pd.Categorical(
    weather["date"].dt.strftime("%B"),
    categories=list(calendar.month_name[1:]),
    ordered=True,
)
weather/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_40785/2556095375.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| date | precipitation | temp_max | temp_min | wind | weather | month | |
|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | 
| 1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | 
| 2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | 
| 3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | 
| 4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | 
| ... | ... | ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | 
| 362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | 
| 363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | 
| 364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | 
| 365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | 
366 rows × 7 columns
Now let’s set the display options for the rest of the lesson:
pd.options.display.max_rows = 20And let’s ignore the warnings that come up when working with categorical data with the current version of pandas:
warnings.filterwarnings(
    "ignore"
)  ## There is a class of warnings that come up when working with categorical data with the current version of pandas that we can ignore20.5 Adding Summary Statistics Using transform()
In the previous lesson, you learned how to calculate summary statistics like mean, median, or standard deviation using agg().
For example, to compute the mean precipitation (rain + snow) for each month, you could use:
weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))| mean_precip | |
|---|---|
| month | |
| January | 5.590323 | 
| February | 3.182759 | 
| March | 5.903226 | 
| April | 2.270000 | 
| May | 1.683871 | 
| ... | ... | 
| August | 0.000000 | 
| September | 0.030000 | 
| October | 5.493548 | 
| November | 7.016667 | 
| December | 5.612903 | 
12 rows × 1 columns
Sometimes, we want to add these group-level statistics as new columns to our original DataFrame. We can’t do this directly with the agg() output:
# Does not work
weather['mean_precip'] = weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
weather| date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 
| 1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 
| 2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 
| 3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 
| 4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 
| 362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 
| 363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 
| 364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 
| 365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 
366 rows × 8 columns
But we can do this using transform(). transform() reshapes the output to match the original DataFrame’s shape, allowing us to add the group-level statistics as new columns.
weather['mean_precip_month'] = weather.groupby('month')['precipitation'].transform('mean')
weather| date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 
| 1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 
| 2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 
| 3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 
| 4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 
| 362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 
| 363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 
| 364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 
| 365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 
366 rows × 9 columns
You can compute other statistics similarly. For example, to compute the median precipitation for each month, you could use:
weather['prep_median_month'] = weather.groupby('month')['precipitation'].transform('median')    
weather| date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 | 
| 1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 | 
| 2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 | 
| 3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 | 
| 4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 | 
| 362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 
| 363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 
| 364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 | 
| 365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 | 
366 rows × 10 columns
Or to get the sum of precipitation for each month:
weather['precip_sum_month'] = weather.groupby('month')['precipitation'].transform('sum')
weather| date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | precip_sum_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 | 173.3 | 
| 1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 
| 2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 
| 3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 
| 4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 
| 362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 
| 363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 
| 364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 | 
| 365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 | 
366 rows × 11 columns
With the sum in hand, we can easily calculate the proportion of that month’s precipitation that fell on each day:
weather["precip_month_prop"] = weather["precipitation"] / weather["precip_sum_month"]
weather| date | precipitation | temp_max | temp_min | wind | weather | month | mean_precip | mean_precip_month | prep_median_month | precip_sum_month | precip_month_prop | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | 0.0 | 12.8 | 5.0 | 4.7 | drizzle | January | NaN | 5.590323 | 3.0 | 173.3 | 0.000000 | 
| 1 | 2012-01-02 | 10.9 | 10.6 | 2.8 | 4.5 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.062897 | 
| 2 | 2012-01-03 | 0.8 | 11.7 | 7.2 | 2.3 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.004616 | 
| 3 | 2012-01-04 | 20.3 | 12.2 | 5.6 | 4.7 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.117138 | 
| 4 | 2012-01-05 | 1.3 | 8.9 | 2.8 | 6.1 | rain | January | NaN | 5.590323 | 3.0 | 173.3 | 0.007501 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | 4.1 | 7.8 | 3.3 | 3.2 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.023563 | 
| 362 | 2012-12-28 | 0.0 | 8.3 | 3.9 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 | 
| 363 | 2012-12-29 | 1.5 | 5.0 | 3.3 | 1.7 | rain | December | NaN | 5.612903 | 3.3 | 174.0 | 0.008621 | 
| 364 | 2012-12-30 | 0.0 | 4.4 | 0.0 | 1.8 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 | 
| 365 | 2012-12-31 | 0.0 | 3.3 | -1.1 | 2.0 | drizzle | December | NaN | 5.612903 | 3.3 | 174.0 | 0.000000 | 
366 rows × 12 columns
20.6 Practice Q: Daily Tip Proportions
Using the tips dataset, calculate: 1. A new column daily_total_tips containing the total tips for each day 2. A new column tip_proportion showing what proportion of that day’s total tips came from each customer
# Your code here:
tips = px.data.tips()
tips| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 
| ... | ... | ... | ... | ... | ... | ... | ... | 
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 
244 rows × 7 columns
The first few rows of your output data should look something like this:
total_bill    tip     sex      smoker    day    time      size    daily_total_tips    tip_proportion
16.99         1.01    Female   No        Sun    Dinner    2       247.39              0.004083
10.34         1.66    Male     No        Sun    Dinner    3       247.39              0.006710
21.01         3.50    Male     No        Sun    Dinner    3       247.39              0.014148
23.68         3.31    Male     No        Sun    Dinner    2       247.39              0.013380
Let’s reinitialize the weather DataFrame to a smaller set of columns for the rest of the lesson:
weather = weather[['date', 'month', 'precipitation', 'wind', 'weather']]
weather| date | month | precipitation | wind | weather | |
|---|---|---|---|---|---|
| 0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle | 
| 1 | 2012-01-02 | January | 10.9 | 4.5 | rain | 
| 2 | 2012-01-03 | January | 0.8 | 2.3 | rain | 
| 3 | 2012-01-04 | January | 20.3 | 4.7 | rain | 
| 4 | 2012-01-05 | January | 1.3 | 6.1 | rain | 
| ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | December | 4.1 | 3.2 | rain | 
| 362 | 2012-12-28 | December | 0.0 | 1.7 | rain | 
| 363 | 2012-12-29 | December | 1.5 | 1.7 | rain | 
| 364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle | 
| 365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle | 
366 rows × 5 columns
20.7 Counting Values Within Groups Using value_counts()
Counting occurrences of categorical variables within groups can reveal interesting patterns, and you often need to do this after using groupby().
First, let’s recall how value_counts() works on the entire DataFrame.
# Count of weather types
weather["weather"].value_counts()weather
rain       191
sun        118
drizzle     31
snow        21
fog          5
Name: count, dtype: int64
We can add normalize=True to get proportions:
weather['weather'].value_counts(normalize=True)weather
rain       0.521858
sun        0.322404
drizzle    0.084699
snow       0.057377
fog        0.013661
Name: proportion, dtype: float64
Now, to count weather types within each month, we first group by month, then subset the weather column and apply value_counts() to it.
# Counts of weather types per month
weather.groupby('month')['weather'].value_counts()month     weather
January   rain       18
          snow        7
          sun         4
          drizzle     2
          fog         0
                     ..
December  rain       23
          snow        5
          drizzle     2
          sun         1
          fog         0
Name: count, Length: 60, dtype: int64
This returns a Series with a MultiIndex, which can be converted to a regular DataFrame with reset_index():
weather.groupby('month')['weather'].value_counts().reset_index()| month | weather | count | |
|---|---|---|---|
| 0 | January | rain | 18 | 
| 1 | January | snow | 7 | 
| 2 | January | sun | 4 | 
| 3 | January | drizzle | 2 | 
| 4 | January | fog | 0 | 
| ... | ... | ... | ... | 
| 55 | December | rain | 23 | 
| 56 | December | snow | 5 | 
| 57 | December | drizzle | 2 | 
| 58 | December | sun | 1 | 
| 59 | December | fog | 0 | 
60 rows × 3 columns
20.8 Practice Q: Count Smokers and Non-Smokers by Day
Using the tips dataset, count the number of smokers and non-smokers for each day.
tips = px.data.tips()
tips
# Your code here:| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 
| ... | ... | ... | ... | ... | ... | ... | ... | 
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 
244 rows × 7 columns
The first few rows of your result should look something like this:
day smoker  count
Fri Yes 15
Fri No  4
Sat No  45
20.9 Computing Cumulative Sums Within Groups
Cumulative sums help track running totals within groups. This is an often-useful operation. Let’s see how we can do this to grouped data.
As a recall, here’s how we can compute the cumulative sum of precipitation for the entire DataFrame:
# Cumulative sum of precipitation
weather["precip_cumul"] = weather["precipitation"].cumsum()
weather| date | month | precipitation | wind | weather | precip_cumul | |
|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle | 0.0 | 
| 1 | 2012-01-02 | January | 10.9 | 4.5 | rain | 10.9 | 
| 2 | 2012-01-03 | January | 0.8 | 2.3 | rain | 11.7 | 
| 3 | 2012-01-04 | January | 20.3 | 4.7 | rain | 32.0 | 
| 4 | 2012-01-05 | January | 1.3 | 6.1 | rain | 33.3 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | December | 4.1 | 3.2 | rain | 1224.5 | 
| 362 | 2012-12-28 | December | 0.0 | 1.7 | rain | 1224.5 | 
| 363 | 2012-12-29 | December | 1.5 | 1.7 | rain | 1226.0 | 
| 364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle | 1226.0 | 
| 365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle | 1226.0 | 
366 rows × 6 columns
To compute cumulative precipitation within each month, we can use groupby() and cumsum():
# Cumulative precipitation per month
weather["precip_cumul"] = weather.groupby("month")["precipitation"].cumsum()
weather| date | month | precipitation | wind | weather | precip_cumul | |
|---|---|---|---|---|---|---|
| 0 | 2012-01-01 | January | 0.0 | 4.7 | drizzle | 0.0 | 
| 1 | 2012-01-02 | January | 10.9 | 4.5 | rain | 10.9 | 
| 2 | 2012-01-03 | January | 0.8 | 2.3 | rain | 11.7 | 
| 3 | 2012-01-04 | January | 20.3 | 4.7 | rain | 32.0 | 
| 4 | 2012-01-05 | January | 1.3 | 6.1 | rain | 33.3 | 
| ... | ... | ... | ... | ... | ... | ... | 
| 361 | 2012-12-27 | December | 4.1 | 3.2 | rain | 172.5 | 
| 362 | 2012-12-28 | December | 0.0 | 1.7 | rain | 172.5 | 
| 363 | 2012-12-29 | December | 1.5 | 1.7 | rain | 174.0 | 
| 364 | 2012-12-30 | December | 0.0 | 1.8 | drizzle | 174.0 | 
| 365 | 2012-12-31 | December | 0.0 | 2.0 | drizzle | 174.0 | 
366 rows × 6 columns
20.10 Practice Q: Cumulative Tip Amount by Day
Using the tips dataset, compute the cumulative sum of total_bill for each day, adding a new column cumul_total_bill_day. Then add another column cumul_tip_day that contains the cumulative sum of tip for each day.
tips = px.data.tips()
tips = tips.sort_values('day')
tips
# Your code here:| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 96 | 27.28 | 4.00 | Male | Yes | Fri | Dinner | 2 | 
| 101 | 15.38 | 3.00 | Female | Yes | Fri | Dinner | 2 | 
| 98 | 21.01 | 3.00 | Male | Yes | Fri | Dinner | 2 | 
| 97 | 12.03 | 1.50 | Male | Yes | Fri | Dinner | 2 | 
| 95 | 40.17 | 4.73 | Male | Yes | Fri | Dinner | 4 | 
| ... | ... | ... | ... | ... | ... | ... | ... | 
| 132 | 11.17 | 1.50 | Female | No | Thur | Lunch | 2 | 
| 131 | 20.27 | 2.83 | Female | No | Thur | Lunch | 2 | 
| 130 | 19.08 | 1.50 | Male | No | Thur | Lunch | 2 | 
| 128 | 11.38 | 2.00 | Female | No | Thur | Lunch | 2 | 
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 
244 rows × 7 columns
20.11 Wrap-Up
In this lesson, you’ve learned several powerful group-level data transformations in pandas:
- Adding Summary Statistics: Using 
transform()to add group-level calculations as new columns - Counting Within Groups: Using 
value_counts()to count occurrences in groups - Computing Cumulative Sums: Tracking running totals within groups
 
These techniques allow you to analyze patterns and statistics within specific subsets of your data. Keep practicing with different datasets to build your data manipulation skills!
See you next time!