18 Conditional Transformations of Variables

18.1 Introduction

In the previous lesson, you learned the basics of data transformation in pandas.

In this lesson, we will explore how to conditionally transform variables in pandas using methods like replace() and custom functions.

Conditional transformations are essential when you need to recode variables or create new ones based on specific criteria.

Let’s dive in!

18.2 Learning Objectives

By the end of this lesson, you will:

Be able to transform or create new variables based on conditions using replace() and dictionaries.
Know how to handle NaN values in replace() transformations.
Be able to define and apply custom functions to recode variables.

18.3 Packages

This lesson will require pandas, numpy, plotly.express, and vega_datasets:

import pandas as pd
import numpy as np
import vega_datasets as vd
import plotly.express as px

18.4 Introduction to `replace()`

One common task in data wrangling is to replace values in a column based on certain conditions. The replace() method in pandas is a versatile tool for this purpose.

In the tips dataset, the day column contains abbreviated day names:

tips = px.data.tips()
tips['day'].unique()

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

Our goal is to replace these abbreviations with the full day names.

We can create a dictionary that maps the abbreviated names to the full names:

day_mapping = {
    "Sun": "Sunday",
    "Sat": "Saturday",
    "Fri": "Friday",
    "Thur": "Thursday"
}

Now, we use the replace() method with the dictionary:

tips['day_full'] = tips['day'].replace(day_mapping)
tips

	total_bill	tip	sex	smoker	day	time	size	day_full
0	16.99	1.01	Female	No	Sun	Dinner	2	Sunday
1	10.34	1.66	Male	No	Sun	Dinner	3	Sunday
2	21.01	3.50	Male	No	Sun	Dinner	3	Sunday
...	...	...	...	...	...	...	...	...
241	22.67	2.00	Male	Yes	Sat	Dinner	2	Saturday
242	17.82	1.75	Male	No	Sat	Dinner	2	Saturday
243	18.78	3.00	Female	No	Thur	Dinner	2	Thursday

244 rows × 8 columns

Alternatively, we can perform the replacement directly within the replace() method without explicitly defining a dictionary:

tips['day_full'] = tips['day'].replace({
    "Sun": "Sunday",
    "Sat": "Saturday",
    "Fri": "Friday",
    "Thur": "Thursday"
})
tips[['day', 'day_full']].head()

	day	day_full
0	Sun	Sunday
1	Sun	Sunday
2	Sun	Sunday
3	Sun	Sunday
4	Sun	Sunday

Practice

18.5 Practice Q: Abbreviate Sex

Using the tips dataset, replace the values in the sex column to abbreviate gender:

Replace "Female" with "F".
Replace "Male" with "M".

Assign the result to a new column called sex_abbr and display the first few rows.

# Your code here:

18.6 Handling Missing Values with `replace()`

Sometimes, your dataset may contain missing values (NaN or None) that you want to replace with a placeholder or specific value. The replace() method can handle this.

Let’s examine the Creative_Type column in the movies dataset from vega_datasets:

movies = vd.data.movies()
movies['Creative_Type'].value_counts(dropna=False)

Creative_Type
Contemporary Fiction       1453
None                        446
Historical Fiction          350
                           ... 
Factual                      49
Super Hero                   49
Multiple Creative Types       1
Name: count, Length: 10, dtype: int64

Notice that there are some None values in the Creative_Type column.

Let’s replace None with "Unknown/Unclear":

movies['Creative_Type'] = movies['Creative_Type'].replace({
    None: "Unknown/Unclear", # 👈 On this line, None is the key
})

Now, let’s verify the replacement:

movies['Creative_Type'].value_counts(dropna=False)

Creative_Type
Contemporary Fiction       1453
Unknown/Unclear             446
Historical Fiction          350
                           ... 
Factual                      49
Super Hero                   49
Multiple Creative Types       1
Name: count, Length: 10, dtype: int64

While None is typically used to represent missing strings, NaN is used for missing numbers. Consider the US_DVD_Sales column:

movies.query("US_DVD_Sales.isna()").shape # Check the number of missing values

(2637, 16)

movies['US_DVD_Sales'].tail(10) # View the last 10 values. Some are missing.

3191     3273039.0
3192    22025352.0
3193           NaN
           ...    
3198     6679409.0
3199           NaN
3200           NaN
Name: US_DVD_Sales, Length: 10, dtype: float64

We can use replace() to replace NaN with 0:

movies['US_DVD_Sales'] = movies['US_DVD_Sales'].replace({
    np.nan: 0 # 👈 `NaN` is represented by `np.nan` in pandas
})

Let’s verify the replacement:

movies['US_DVD_Sales'].tail(10)

3191     3273039.0
3192    22025352.0
3193           0.0
           ...    
3198     6679409.0
3199           0.0
3200           0.0
Name: US_DVD_Sales, Length: 10, dtype: float64

movies.query("US_DVD_Sales.isna()").shape

(0, 16)

18.7 Practice Q: Standardize MPAA Ratings

In the movies dataset, the MPAA_Rating column contains movie ratings. Some entries are None or "Not Rated". Replace both None and "Not Rated" with "Unrated".

Then, use value_counts() to see how many movies are unrated. There should be 699 movies in this category.

# Your code here:

18.8 Categorizing Numeric Data with Custom Functions

Recall from our previous lesson that we can use custom functions with conditional logic to transform variables. For example, we can categorize the US_Gross column into three categories based on the following criteria:

If the value is less than 10 million, the category is "Low".
If the value is between 10 million and 50 million, the category is "Medium".
If the value is greater than 50 million, the category is "High".

def categ_gross(gross):
    if gross < 10000000:
        return "Low"
    elif gross >= 10000000 and gross <= 50000000:
        return "Medium"
    elif gross > 50000000:
        return "High"
    else:
        return None 


categ_gross_vec = np.vectorize(categ_gross)

Side Note

The np.vectorize function in the above case will return None as a string. To enforce the None type, you can use the otypes parameter:

categ_gross_vec = np.vectorize(categ_gross, otypes=[object])

Now we can apply it to the entire column:

movies['Gross_Category'] = categ_gross_vec(movies['US_Gross'])
movies['Gross_Category'].value_counts(dropna=False)

Gross_Category
Medium    1241
Low       1046
High       907
None         7
Name: count, dtype: int64

This can also be achieved with pd.cut(), np.where() and np.select(). But the custom function approach is the most flexible. Below we’ll see how to extend this to more complex conditions.

18.9 Complex Transformations with Custom Functions

The flexibility of custom functions can be extended easily to more complex conditional transformations.

For example, suppose we want to flag superhero movies as “US action movie” or “Global action movie” based on their US and worldwide gross earnings.

For Super Hero movies, if the US gross and worldwide gross are the same (indicating sales were only in the US), the movie is flagged as a US action movie.
For Super Hero movies, if the worldwide gross is greater than the US gross, the movie is flagged as a global action movie.
For all other movies, we leave the flag blank

We can define a funcion that takes in three arguments and returns the appropriate flag:

# Define the function to flag movies based on the conditions
def flag_movie(movie_type, us, worldwide):
    if movie_type == 'Super Hero' and us == worldwide:
        return 'US action movie'
    elif movie_type == 'Super Hero' and worldwide > us:
        return 'Global action movie'
    else:
        return None

Let’s test it out with a few sets of values:

print(flag_movie(movie_type='Super Hero', us=100, worldwide=100))
print(flag_movie(movie_type='Super Hero', us=100, worldwide=200))
print(flag_movie(movie_type='Comedy', us=100, worldwide=100))

US action movie
Global action movie
None

Now, let’s vectorize it:

flag_movie_vec = np.vectorize(flag_movie)

We can now apply it to the columns:

movies['Action_Flag'] = flag_movie_vec(movies['Creative_Type'], movies['US_Gross'], movies['Worldwide_Gross'])
movies

	Title	US_Gross	Worldwide_Gross	US_DVD_Sales	Production_Budget	Release_Date	MPAA_Rating	Running_Time_min	Distributor	Source	Major_Genre	Creative_Type	Director	Rotten_Tomatoes_Rating	IMDB_Rating	IMDB_Votes	Gross_Category	Action_Flag
0	The Land Girls	146083.0	146083.0	0.0	8000000.0	Jun 12 1998	R	NaN	Gramercy	None	None	Unknown/Unclear	None	NaN	6.1	1071.0	Low	None
1	First Love, Last Rites	10876.0	10876.0	0.0	300000.0	Aug 07 1998	R	NaN	Strand	None	Drama	Unknown/Unclear	None	NaN	6.9	207.0	Low	None
2	I Married a Strange Person	203134.0	203134.0	0.0	250000.0	Aug 28 1998	None	NaN	Lionsgate	None	Comedy	Unknown/Unclear	None	NaN	6.8	865.0	Low	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3198	Zoom	11989328.0	12506188.0	6679409.0	35000000.0	Aug 11 2006	PG	NaN	Sony Pictures	Based on Comic/Graphic Novel	Adventure	Super Hero	Peter Hewitt	3.0	3.4	7424.0	Medium	Global action movie
3199	The Legend of Zorro	45575336.0	141475336.0	0.0	80000000.0	Oct 28 2005	PG	129.0	Sony Pictures	Remake	Adventure	Historical Fiction	Martin Campbell	26.0	5.7	21161.0	Medium	None
3200	The Mask of Zorro	93828745.0	233700000.0	0.0	65000000.0	Jul 17 1998	PG-13	136.0	Sony Pictures	Remake	Adventure	Historical Fiction	Martin Campbell	82.0	6.7	4789.0	High	None

3201 rows × 18 columns

To see the distribution of movie categories based on our flag, we can use value_counts():

movies['Action_Flag'].value_counts(dropna=False)

Action_Flag
None                   3152
Global action movie      42
US action movie           7
Name: count, dtype: int64

18.9.1 Practice: Flag Movies Based on Ratings

In the movies dataset, flag movies as Critic-friendly or Commercial based on their Rotten Tomatoes and IMDB ratings.

If the Rotten Tomatoes rating is above 70% and the IMDB rating is below 5, the movie is flagged as Critic-friendly.
If the Rotten Tomatoes rating is below 50% and the IMDB rating is above 7, the movie is flagged as Commercial.
Otherwise, the movie is categorized as Other.
Count how many movies are Critic-friendly and Commercial. There should be 13 Critic-friendly movies and 33 Commercial movies. Do you recognize any of them?

# Your code here:

18.10 Wrap-Up

In this lesson, you learned how to conditionally transform variables in pandas using:

The replace() method with dictionaries to map and replace specific values.
Handling missing values (NaN or None) during replacements.
Defining custom functions and applying them to handle complex conditions.

These techniques are powerful tools for data cleaning and preprocessing, allowing you to reshape your data to meet your analysis needs.

See you next time!