20 Other Grouped Operations in Pandas – Introduction to Data Science with Python

20.1 Introduction

In our previous lessons, you’ve learned how to extract summary statistics from groups using groupby() and agg(). Now, we’ll take a step further by exploring some additional useful grouped data transformations. Let’s get started.

20.2 Learning Objectives

By the end of this lesson, you will be able to:

Add group-level summary statistics as new columns using transform().
Count values within groups using value_counts().
Compute cumulative sums within groups.

20.3 Imports

Run the following cell to import the necessary libraries:

import pandas as pd
import vega_datasets as vd
import plotly.express as px
import warnings
import calendar

20.4 Data

We’ll use the weather dataset for our examples.

weather_raw = vd.data.seattle_weather()

# Select just 2012 data using query and add a month column
weather = weather_raw.query("date.dt.year == 2012")
weather["month"] = pd.Categorical(
    weather["date"].dt.strftime("%B"),
    categories=list(calendar.month_name[1:]),
    ordered=True,
)
weather

/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_40785/2556095375.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	date	precipitation	temp_max	temp_min	wind	weather	month
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January
...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December

366 rows × 7 columns

Now let’s set the display options for the rest of the lesson:

pd.options.display.max_rows = 20

And let’s ignore the warnings that come up when working with categorical data with the current version of pandas:

warnings.filterwarnings(
    "ignore"
)  ## There is a class of warnings that come up when working with categorical data with the current version of pandas that we can ignore

20.5 Adding Summary Statistics Using `transform()`

In the previous lesson, you learned how to calculate summary statistics like mean, median, or standard deviation using agg().

For example, to compute the mean precipitation (rain + snow) for each month, you could use:

weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))

	mean_precip
month
January	5.590323
February	3.182759
March	5.903226
April	2.270000
May	1.683871
...	...
August	0.000000
September	0.030000
October	5.493548
November	7.016667
December	5.612903

12 rows × 1 columns

Sometimes, we want to add these group-level statistics as new columns to our original DataFrame. We can’t do this directly with the agg() output:

# Does not work
weather['mean_precip'] = weather.groupby('month').agg(mean_precip = ('precipitation', 'mean'))
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN
...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN

366 rows × 8 columns

But we can do this using transform(). transform() reshapes the output to match the original DataFrame’s shape, allowing us to add the group-level statistics as new columns.

weather['mean_precip_month'] = weather.groupby('month')['precipitation'].transform('mean')
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip	mean_precip_month
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN	5.590323
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN	5.590323
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN	5.590323
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN	5.590323
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN	5.590323
...	...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN	5.612903
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN	5.612903
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN	5.612903
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN	5.612903
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN	5.612903

366 rows × 9 columns

You can compute other statistics similarly. For example, to compute the median precipitation for each month, you could use:

weather['prep_median_month'] = weather.groupby('month')['precipitation'].transform('median')    
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip	mean_precip_month	prep_median_month
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN	5.590323	3.0
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN	5.590323	3.0
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN	5.590323	3.0
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN	5.590323	3.0
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN	5.590323	3.0
...	...	...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN	5.612903	3.3
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN	5.612903	3.3
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN	5.612903	3.3
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN	5.612903	3.3
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN	5.612903	3.3

366 rows × 10 columns

Or to get the sum of precipitation for each month:

weather['precip_sum_month'] = weather.groupby('month')['precipitation'].transform('sum')
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip	mean_precip_month	prep_median_month	precip_sum_month
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN	5.590323	3.0	173.3
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN	5.590323	3.0	173.3
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN	5.590323	3.0	173.3
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN	5.590323	3.0	173.3
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN	5.590323	3.0	173.3
...	...	...	...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN	5.612903	3.3	174.0
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN	5.612903	3.3	174.0
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN	5.612903	3.3	174.0
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN	5.612903	3.3	174.0
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN	5.612903	3.3	174.0

366 rows × 11 columns

With the sum in hand, we can easily calculate the proportion of that month’s precipitation that fell on each day:

weather["precip_month_prop"] = weather["precipitation"] / weather["precip_sum_month"]
weather

	date	precipitation	temp_max	temp_min	wind	weather	month	mean_precip	mean_precip_month	prep_median_month	precip_sum_month	precip_month_prop
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle	January	NaN	5.590323	3.0	173.3	0.000000
1	2012-01-02	10.9	10.6	2.8	4.5	rain	January	NaN	5.590323	3.0	173.3	0.062897
2	2012-01-03	0.8	11.7	7.2	2.3	rain	January	NaN	5.590323	3.0	173.3	0.004616
3	2012-01-04	20.3	12.2	5.6	4.7	rain	January	NaN	5.590323	3.0	173.3	0.117138
4	2012-01-05	1.3	8.9	2.8	6.1	rain	January	NaN	5.590323	3.0	173.3	0.007501
...	...	...	...	...	...	...	...	...	...	...	...	...
361	2012-12-27	4.1	7.8	3.3	3.2	rain	December	NaN	5.612903	3.3	174.0	0.023563
362	2012-12-28	0.0	8.3	3.9	1.7	rain	December	NaN	5.612903	3.3	174.0	0.000000
363	2012-12-29	1.5	5.0	3.3	1.7	rain	December	NaN	5.612903	3.3	174.0	0.008621
364	2012-12-30	0.0	4.4	0.0	1.8	drizzle	December	NaN	5.612903	3.3	174.0	0.000000
365	2012-12-31	0.0	3.3	-1.1	2.0	drizzle	December	NaN	5.612903	3.3	174.0	0.000000

366 rows × 12 columns

Practice

20.6 Practice Q: Daily Tip Proportions

Using the tips dataset, calculate: 1. A new column daily_total_tips containing the total tips for each day 2. A new column tip_proportion showing what proportion of that day’s total tips came from each customer

# Your code here:
tips = px.data.tips()
tips

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

The first few rows of your output data should look something like this:

total_bill    tip     sex      smoker    day    time      size    daily_total_tips    tip_proportion
16.99         1.01    Female   No        Sun    Dinner    2       247.39              0.004083
10.34         1.66    Male     No        Sun    Dinner    3       247.39              0.006710
21.01         3.50    Male     No        Sun    Dinner    3       247.39              0.014148
23.68         3.31    Male     No        Sun    Dinner    2       247.39              0.013380

Let’s reinitialize the weather DataFrame to a smaller set of columns for the rest of the lesson:

weather = weather[['date', 'month', 'precipitation', 'wind', 'weather']]
weather

	date	month	precipitation	wind	weather
0	2012-01-01	January	0.0	4.7	drizzle
1	2012-01-02	January	10.9	4.5	rain
2	2012-01-03	January	0.8	2.3	rain
3	2012-01-04	January	20.3	4.7	rain
4	2012-01-05	January	1.3	6.1	rain
...	...	...	...	...	...
361	2012-12-27	December	4.1	3.2	rain
362	2012-12-28	December	0.0	1.7	rain
363	2012-12-29	December	1.5	1.7	rain
364	2012-12-30	December	0.0	1.8	drizzle
365	2012-12-31	December	0.0	2.0	drizzle

366 rows × 5 columns

20.7 Counting Values Within Groups Using `value_counts()`

Counting occurrences of categorical variables within groups can reveal interesting patterns, and you often need to do this after using groupby().

First, let’s recall how value_counts() works on the entire DataFrame.

# Count of weather types
weather["weather"].value_counts()

weather
rain       191
sun        118
drizzle     31
snow        21
fog          5
Name: count, dtype: int64

We can add normalize=True to get proportions:

weather['weather'].value_counts(normalize=True)

weather
rain       0.521858
sun        0.322404
drizzle    0.084699
snow       0.057377
fog        0.013661
Name: proportion, dtype: float64

Now, to count weather types within each month, we first group by month, then subset the weather column and apply value_counts() to it.

# Counts of weather types per month
weather.groupby('month')['weather'].value_counts()

month     weather
January   rain       18
          snow        7
          sun         4
          drizzle     2
          fog         0
                     ..
December  rain       23
          snow        5
          drizzle     2
          sun         1
          fog         0
Name: count, Length: 60, dtype: int64

This returns a Series with a MultiIndex, which can be converted to a regular DataFrame with reset_index():

weather.groupby('month')['weather'].value_counts().reset_index()

	month	weather	count
0	January	rain	18
1	January	snow	7
2	January	sun	4
3	January	drizzle	2
4	January	fog	0
...	...	...	...
55	December	rain	23
56	December	snow	5
57	December	drizzle	2
58	December	sun	1
59	December	fog	0

60 rows × 3 columns

Practice

20.8 Practice Q: Count Smokers and Non-Smokers by Day

Using the tips dataset, count the number of smokers and non-smokers for each day.

tips = px.data.tips()
tips

# Your code here:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

The first few rows of your result should look something like this:

day smoker  count
Fri Yes 15
Fri No  4
Sat No  45

20.9 Computing Cumulative Sums Within Groups

Cumulative sums help track running totals within groups. This is an often-useful operation. Let’s see how we can do this to grouped data.

As a recall, here’s how we can compute the cumulative sum of precipitation for the entire DataFrame:

# Cumulative sum of precipitation
weather["precip_cumul"] = weather["precipitation"].cumsum()
weather

	date	month	precipitation	wind	weather	precip_cumul
0	2012-01-01	January	0.0	4.7	drizzle	0.0
1	2012-01-02	January	10.9	4.5	rain	10.9
2	2012-01-03	January	0.8	2.3	rain	11.7
3	2012-01-04	January	20.3	4.7	rain	32.0
4	2012-01-05	January	1.3	6.1	rain	33.3
...	...	...	...	...	...	...
361	2012-12-27	December	4.1	3.2	rain	1224.5
362	2012-12-28	December	0.0	1.7	rain	1224.5
363	2012-12-29	December	1.5	1.7	rain	1226.0
364	2012-12-30	December	0.0	1.8	drizzle	1226.0
365	2012-12-31	December	0.0	2.0	drizzle	1226.0

366 rows × 6 columns

To compute cumulative precipitation within each month, we can use groupby() and cumsum():

# Cumulative precipitation per month
weather["precip_cumul"] = weather.groupby("month")["precipitation"].cumsum()
weather

	date	month	precipitation	wind	weather	precip_cumul
0	2012-01-01	January	0.0	4.7	drizzle	0.0
1	2012-01-02	January	10.9	4.5	rain	10.9
2	2012-01-03	January	0.8	2.3	rain	11.7
3	2012-01-04	January	20.3	4.7	rain	32.0
4	2012-01-05	January	1.3	6.1	rain	33.3
...	...	...	...	...	...	...
361	2012-12-27	December	4.1	3.2	rain	172.5
362	2012-12-28	December	0.0	1.7	rain	172.5
363	2012-12-29	December	1.5	1.7	rain	174.0
364	2012-12-30	December	0.0	1.8	drizzle	174.0
365	2012-12-31	December	0.0	2.0	drizzle	174.0

366 rows × 6 columns

20.10 Practice Q: Cumulative Tip Amount by Day

Using the tips dataset, compute the cumulative sum of total_bill for each day, adding a new column cumul_total_bill_day. Then add another column cumul_tip_day that contains the cumulative sum of tip for each day.

tips = px.data.tips()
tips = tips.sort_values('day')
tips
# Your code here:

	total_bill	tip	sex	smoker	day	time	size
96	27.28	4.00	Male	Yes	Fri	Dinner	2
101	15.38	3.00	Female	Yes	Fri	Dinner	2
98	21.01	3.00	Male	Yes	Fri	Dinner	2
97	12.03	1.50	Male	Yes	Fri	Dinner	2
95	40.17	4.73	Male	Yes	Fri	Dinner	4
...	...	...	...	...	...	...	...
132	11.17	1.50	Female	No	Thur	Lunch	2
131	20.27	2.83	Female	No	Thur	Lunch	2
130	19.08	1.50	Male	No	Thur	Lunch	2
128	11.38	2.00	Female	No	Thur	Lunch	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

20.11 Wrap-Up

In this lesson, you’ve learned several powerful group-level data transformations in pandas:

Adding Summary Statistics: Using transform() to add group-level calculations as new columns
Counting Within Groups: Using value_counts() to count occurrences in groups
Computing Cumulative Sums: Tracking running totals within groups

These techniques allow you to analyze patterns and statistics within specific subsets of your data. Keep practicing with different datasets to build your data manipulation skills!

See you next time!

20.1 Introduction

20.2 Learning Objectives

20.3 Imports

20.4 Data

20.5 Adding Summary Statistics Using transform()

20.6 Practice Q: Daily Tip Proportions

20.7 Counting Values Within Groups Using value_counts()

20.8 Practice Q: Count Smokers and Non-Smokers by Day

20.9 Computing Cumulative Sums Within Groups

20.10 Practice Q: Cumulative Tip Amount by Day

20.11 Wrap-Up

20.5 Adding Summary Statistics Using `transform()`

20.7 Counting Values Within Groups Using `value_counts()`