import pandas as pd
import numpy as np
import vega_datasets as vd
import plotly.express as px
18 Conditional Transformations of Variables
18.1 Introduction
In the previous lesson, you learned the basics of data transformation in pandas.
In this lesson, we will explore how to conditionally transform variables in pandas using methods like replace()
and custom functions.
Conditional transformations are essential when you need to recode variables or create new ones based on specific criteria.
Let’s dive in!
18.2 Learning Objectives
By the end of this lesson, you will:
- Be able to transform or create new variables based on conditions using
replace()
and dictionaries. - Know how to handle
NaN
values inreplace()
transformations. - Be able to define and apply custom functions to recode variables.
18.3 Packages
This lesson will require pandas
, numpy
, plotly.express
, and vega_datasets
:
18.4 Introduction to replace()
One common task in data wrangling is to replace values in a column based on certain conditions. The replace()
method in pandas is a versatile tool for this purpose.
In the tips
dataset, the day
column contains abbreviated day names:
= px.data.tips()
tips 'day'].unique() tips[
array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)
Our goal is to replace these abbreviations with the full day names.
We can create a dictionary that maps the abbreviated names to the full names:
= {
day_mapping "Sun": "Sunday",
"Sat": "Saturday",
"Fri": "Friday",
"Thur": "Thursday"
}
Now, we use the replace()
method with the dictionary:
'day_full'] = tips['day'].replace(day_mapping)
tips[ tips
total_bill | tip | sex | smoker | day | time | size | day_full | |
---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | Sunday |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | Sunday |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | Sunday |
... | ... | ... | ... | ... | ... | ... | ... | ... |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | Saturday |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | Saturday |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | Thursday |
244 rows × 8 columns
Alternatively, we can perform the replacement directly within the replace()
method without explicitly defining a dictionary:
'day_full'] = tips['day'].replace({
tips["Sun": "Sunday",
"Sat": "Saturday",
"Fri": "Friday",
"Thur": "Thursday"
})'day', 'day_full']].head() tips[[
day | day_full | |
---|---|---|
0 | Sun | Sunday |
1 | Sun | Sunday |
2 | Sun | Sunday |
3 | Sun | Sunday |
4 | Sun | Sunday |
18.5 Practice Q: Abbreviate Sex
Using the tips
dataset, replace the values in the sex
column to abbreviate gender:
- Replace
"Female"
with"F"
. - Replace
"Male"
with"M"
.
Assign the result to a new column called sex_abbr
and display the first few rows.
# Your code here:
18.6 Handling Missing Values with replace()
Sometimes, your dataset may contain missing values (NaN
or None
) that you want to replace with a placeholder or specific value. The replace()
method can handle this.
Let’s examine the Creative_Type
column in the movies
dataset from vega_datasets:
= vd.data.movies()
movies 'Creative_Type'].value_counts(dropna=False) movies[
Creative_Type
Contemporary Fiction 1453
None 446
Historical Fiction 350
...
Factual 49
Super Hero 49
Multiple Creative Types 1
Name: count, Length: 10, dtype: int64
Notice that there are some None
values in the Creative_Type
column.
Let’s replace None
with "Unknown/Unclear"
:
'Creative_Type'] = movies['Creative_Type'].replace({
movies[None: "Unknown/Unclear", # 👈 On this line, None is the key
})
Now, let’s verify the replacement:
'Creative_Type'].value_counts(dropna=False) movies[
Creative_Type
Contemporary Fiction 1453
Unknown/Unclear 446
Historical Fiction 350
...
Factual 49
Super Hero 49
Multiple Creative Types 1
Name: count, Length: 10, dtype: int64
While None
is typically used to represent missing strings, NaN
is used for missing numbers. Consider the US_DVD_Sales
column:
"US_DVD_Sales.isna()").shape # Check the number of missing values movies.query(
(2637, 16)
'US_DVD_Sales'].tail(10) # View the last 10 values. Some are missing. movies[
3191 3273039.0
3192 22025352.0
3193 NaN
...
3198 6679409.0
3199 NaN
3200 NaN
Name: US_DVD_Sales, Length: 10, dtype: float64
We can use replace()
to replace NaN
with 0:
'US_DVD_Sales'] = movies['US_DVD_Sales'].replace({
movies[0 # 👈 `NaN` is represented by `np.nan` in pandas
np.nan: })
Let’s verify the replacement:
'US_DVD_Sales'].tail(10) movies[
3191 3273039.0
3192 22025352.0
3193 0.0
...
3198 6679409.0
3199 0.0
3200 0.0
Name: US_DVD_Sales, Length: 10, dtype: float64
"US_DVD_Sales.isna()").shape movies.query(
(0, 16)
18.7 Practice Q: Standardize MPAA Ratings
In the movies
dataset, the MPAA_Rating
column contains movie ratings. Some entries are None
or "Not Rated"
. Replace both None
and "Not Rated"
with "Unrated"
.
Then, use value_counts()
to see how many movies are unrated. There should be 699 movies in this category.
# Your code here:
18.8 Categorizing Numeric Data with Custom Functions
Recall from our previous lesson that we can use custom functions with conditional logic to transform variables. For example, we can categorize the US_Gross
column into three categories based on the following criteria:
- If the value is less than 10 million, the category is
"Low"
. - If the value is between 10 million and 50 million, the category is
"Medium"
. - If the value is greater than 50 million, the category is
"High"
.
def categ_gross(gross):
if gross < 10000000:
return "Low"
elif gross >= 10000000 and gross <= 50000000:
return "Medium"
elif gross > 50000000:
return "High"
else:
return None
= np.vectorize(categ_gross) categ_gross_vec
The np.vectorize
function in the above case will return None
as a string. To enforce the None
type, you can use the otypes
parameter:
= np.vectorize(categ_gross, otypes=[object]) categ_gross_vec
Now we can apply it to the entire column:
'Gross_Category'] = categ_gross_vec(movies['US_Gross'])
movies['Gross_Category'].value_counts(dropna=False) movies[
Gross_Category
Medium 1241
Low 1046
High 907
None 7
Name: count, dtype: int64
This can also be achieved with pd.cut()
, np.where()
and np.select()
. But the custom function approach is the most flexible. Below we’ll see how to extend this to more complex conditions.
18.9 Complex Transformations with Custom Functions
The flexibility of custom functions can be extended easily to more complex conditional transformations.
For example, suppose we want to flag superhero movies as “US action movie” or “Global action movie” based on their US and worldwide gross earnings.
- For Super Hero movies, if the US gross and worldwide gross are the same (indicating sales were only in the US), the movie is flagged as a US action movie.
- For Super Hero movies, if the worldwide gross is greater than the US gross, the movie is flagged as a global action movie.
- For all other movies, we leave the flag blank
We can define a funcion that takes in three arguments and returns the appropriate flag:
# Define the function to flag movies based on the conditions
def flag_movie(movie_type, us, worldwide):
if movie_type == 'Super Hero' and us == worldwide:
return 'US action movie'
elif movie_type == 'Super Hero' and worldwide > us:
return 'Global action movie'
else:
return None
Let’s test it out with a few sets of values:
print(flag_movie(movie_type='Super Hero', us=100, worldwide=100))
print(flag_movie(movie_type='Super Hero', us=100, worldwide=200))
print(flag_movie(movie_type='Comedy', us=100, worldwide=100))
US action movie
Global action movie
None
Now, let’s vectorize it:
= np.vectorize(flag_movie) flag_movie_vec
We can now apply it to the columns:
'Action_Flag'] = flag_movie_vec(movies['Creative_Type'], movies['US_Gross'], movies['Worldwide_Gross'])
movies[ movies
Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | Gross_Category | Action_Flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | The Land Girls | 146083.0 | 146083.0 | 0.0 | 8000000.0 | Jun 12 1998 | R | NaN | Gramercy | None | None | Unknown/Unclear | None | NaN | 6.1 | 1071.0 | Low | None |
1 | First Love, Last Rites | 10876.0 | 10876.0 | 0.0 | 300000.0 | Aug 07 1998 | R | NaN | Strand | None | Drama | Unknown/Unclear | None | NaN | 6.9 | 207.0 | Low | None |
2 | I Married a Strange Person | 203134.0 | 203134.0 | 0.0 | 250000.0 | Aug 28 1998 | None | NaN | Lionsgate | None | Comedy | Unknown/Unclear | None | NaN | 6.8 | 865.0 | Low | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3198 | Zoom | 11989328.0 | 12506188.0 | 6679409.0 | 35000000.0 | Aug 11 2006 | PG | NaN | Sony Pictures | Based on Comic/Graphic Novel | Adventure | Super Hero | Peter Hewitt | 3.0 | 3.4 | 7424.0 | Medium | Global action movie |
3199 | The Legend of Zorro | 45575336.0 | 141475336.0 | 0.0 | 80000000.0 | Oct 28 2005 | PG | 129.0 | Sony Pictures | Remake | Adventure | Historical Fiction | Martin Campbell | 26.0 | 5.7 | 21161.0 | Medium | None |
3200 | The Mask of Zorro | 93828745.0 | 233700000.0 | 0.0 | 65000000.0 | Jul 17 1998 | PG-13 | 136.0 | Sony Pictures | Remake | Adventure | Historical Fiction | Martin Campbell | 82.0 | 6.7 | 4789.0 | High | None |
3201 rows × 18 columns
To see the distribution of movie categories based on our flag, we can use value_counts()
:
'Action_Flag'].value_counts(dropna=False) movies[
Action_Flag
None 3152
Global action movie 42
US action movie 7
Name: count, dtype: int64
18.9.1 Practice: Flag Movies Based on Ratings
In the movies
dataset, flag movies as Critic-friendly or Commercial based on their Rotten Tomatoes and IMDB ratings.
- If the Rotten Tomatoes rating is above 70% and the IMDB rating is below 5, the movie is flagged as Critic-friendly.
- If the Rotten Tomatoes rating is below 50% and the IMDB rating is above 7, the movie is flagged as Commercial.
- Otherwise, the movie is categorized as Other.
- Count how many movies are Critic-friendly and Commercial. There should be 13 Critic-friendly movies and 33 Commercial movies. Do you recognize any of them?
# Your code here:
18.10 Wrap-Up
In this lesson, you learned how to conditionally transform variables in pandas using:
- The
replace()
method with dictionaries to map and replace specific values. - Handling missing values (
NaN
orNone
) during replacements. - Defining custom functions and applying them to handle complex conditions.
These techniques are powerful tools for data cleaning and preprocessing, allowing you to reshape your data to meet your analysis needs.
See you next time!