18  Conditional Transformations of Variables

18.1 Introduction

In the previous lesson, you learned the basics of data transformation in pandas.

In this lesson, we will explore how to conditionally transform variables in pandas using methods like replace() and custom functions.

Conditional transformations are essential when you need to recode variables or create new ones based on specific criteria.

Let’s dive in!

18.2 Learning Objectives

By the end of this lesson, you will:

  • Be able to transform or create new variables based on conditions using replace() and dictionaries.
  • Know how to handle NaN values in replace() transformations.
  • Be able to define and apply custom functions to recode variables.

18.3 Packages

This lesson will require pandas, numpy, plotly.express, and vega_datasets:

import pandas as pd
import numpy as np
import vega_datasets as vd
import plotly.express as px

18.4 Introduction to replace()

One common task in data wrangling is to replace values in a column based on certain conditions. The replace() method in pandas is a versatile tool for this purpose.

In the tips dataset, the day column contains abbreviated day names:

tips = px.data.tips()
tips['day'].unique()
array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

Our goal is to replace these abbreviations with the full day names.

We can create a dictionary that maps the abbreviated names to the full names:

day_mapping = {
    "Sun": "Sunday",
    "Sat": "Saturday",
    "Fri": "Friday",
    "Thur": "Thursday"
}

Now, we use the replace() method with the dictionary:

tips['day_full'] = tips['day'].replace(day_mapping)
tips
total_bill tip sex smoker day time size day_full
0 16.99 1.01 Female No Sun Dinner 2 Sunday
1 10.34 1.66 Male No Sun Dinner 3 Sunday
2 21.01 3.50 Male No Sun Dinner 3 Sunday
... ... ... ... ... ... ... ... ...
241 22.67 2.00 Male Yes Sat Dinner 2 Saturday
242 17.82 1.75 Male No Sat Dinner 2 Saturday
243 18.78 3.00 Female No Thur Dinner 2 Thursday

244 rows × 8 columns

Alternatively, we can perform the replacement directly within the replace() method without explicitly defining a dictionary:

tips['day_full'] = tips['day'].replace({
    "Sun": "Sunday",
    "Sat": "Saturday",
    "Fri": "Friday",
    "Thur": "Thursday"
})
tips[['day', 'day_full']].head()
day day_full
0 Sun Sunday
1 Sun Sunday
2 Sun Sunday
3 Sun Sunday
4 Sun Sunday
Practice

18.5 Practice Q: Abbreviate Sex

Using the tips dataset, replace the values in the sex column to abbreviate gender:

  • Replace "Female" with "F".
  • Replace "Male" with "M".

Assign the result to a new column called sex_abbr and display the first few rows.

# Your code here:

18.6 Handling Missing Values with replace()

Sometimes, your dataset may contain missing values (NaN or None) that you want to replace with a placeholder or specific value. The replace() method can handle this.

Let’s examine the Creative_Type column in the movies dataset from vega_datasets:

movies = vd.data.movies()
movies['Creative_Type'].value_counts(dropna=False)
Creative_Type
Contemporary Fiction       1453
None                        446
Historical Fiction          350
                           ... 
Factual                      49
Super Hero                   49
Multiple Creative Types       1
Name: count, Length: 10, dtype: int64

Notice that there are some None values in the Creative_Type column.

Let’s replace None with "Unknown/Unclear":

movies['Creative_Type'] = movies['Creative_Type'].replace({
    None: "Unknown/Unclear", # 👈 On this line, None is the key
})

Now, let’s verify the replacement:

movies['Creative_Type'].value_counts(dropna=False)
Creative_Type
Contemporary Fiction       1453
Unknown/Unclear             446
Historical Fiction          350
                           ... 
Factual                      49
Super Hero                   49
Multiple Creative Types       1
Name: count, Length: 10, dtype: int64

While None is typically used to represent missing strings, NaN is used for missing numbers. Consider the US_DVD_Sales column:

movies.query("US_DVD_Sales.isna()").shape # Check the number of missing values
(2637, 16)
movies['US_DVD_Sales'].tail(10) # View the last 10 values. Some are missing.
3191     3273039.0
3192    22025352.0
3193           NaN
           ...    
3198     6679409.0
3199           NaN
3200           NaN
Name: US_DVD_Sales, Length: 10, dtype: float64

We can use replace() to replace NaN with 0:

movies['US_DVD_Sales'] = movies['US_DVD_Sales'].replace({
    np.nan: 0 # 👈 `NaN` is represented by `np.nan` in pandas
})

Let’s verify the replacement:

movies['US_DVD_Sales'].tail(10)
3191     3273039.0
3192    22025352.0
3193           0.0
           ...    
3198     6679409.0
3199           0.0
3200           0.0
Name: US_DVD_Sales, Length: 10, dtype: float64
movies.query("US_DVD_Sales.isna()").shape
(0, 16)

18.7 Practice Q: Standardize MPAA Ratings

In the movies dataset, the MPAA_Rating column contains movie ratings. Some entries are None or "Not Rated". Replace both None and "Not Rated" with "Unrated".

Then, use value_counts() to see how many movies are unrated. There should be 699 movies in this category.

# Your code here:

18.8 Categorizing Numeric Data with Custom Functions

Recall from our previous lesson that we can use custom functions with conditional logic to transform variables. For example, we can categorize the US_Gross column into three categories based on the following criteria:

  • If the value is less than 10 million, the category is "Low".
  • If the value is between 10 million and 50 million, the category is "Medium".
  • If the value is greater than 50 million, the category is "High".
def categ_gross(gross):
    if gross < 10000000:
        return "Low"
    elif gross >= 10000000 and gross <= 50000000:
        return "Medium"
    elif gross > 50000000:
        return "High"
    else:
        return None 


categ_gross_vec = np.vectorize(categ_gross)
Side Note

The np.vectorize function in the above case will return None as a string. To enforce the None type, you can use the otypes parameter:

categ_gross_vec = np.vectorize(categ_gross, otypes=[object])

Now we can apply it to the entire column:

movies['Gross_Category'] = categ_gross_vec(movies['US_Gross'])
movies['Gross_Category'].value_counts(dropna=False)
Gross_Category
Medium    1241
Low       1046
High       907
None         7
Name: count, dtype: int64

This can also be achieved with pd.cut(), np.where() and np.select(). But the custom function approach is the most flexible. Below we’ll see how to extend this to more complex conditions.

18.9 Complex Transformations with Custom Functions

The flexibility of custom functions can be extended easily to more complex conditional transformations.

For example, suppose we want to flag superhero movies as “US action movie” or “Global action movie” based on their US and worldwide gross earnings.

  • For Super Hero movies, if the US gross and worldwide gross are the same (indicating sales were only in the US), the movie is flagged as a US action movie.
  • For Super Hero movies, if the worldwide gross is greater than the US gross, the movie is flagged as a global action movie.
  • For all other movies, we leave the flag blank

We can define a funcion that takes in three arguments and returns the appropriate flag:

# Define the function to flag movies based on the conditions
def flag_movie(movie_type, us, worldwide):
    if movie_type == 'Super Hero' and us == worldwide:
        return 'US action movie'
    elif movie_type == 'Super Hero' and worldwide > us:
        return 'Global action movie'
    else:
        return None

Let’s test it out with a few sets of values:

print(flag_movie(movie_type='Super Hero', us=100, worldwide=100))
print(flag_movie(movie_type='Super Hero', us=100, worldwide=200))
print(flag_movie(movie_type='Comedy', us=100, worldwide=100))
US action movie
Global action movie
None

Now, let’s vectorize it:

flag_movie_vec = np.vectorize(flag_movie)

We can now apply it to the columns:

movies['Action_Flag'] = flag_movie_vec(movies['Creative_Type'], movies['US_Gross'], movies['Worldwide_Gross'])
movies
Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes Gross_Category Action_Flag
0 The Land Girls 146083.0 146083.0 0.0 8000000.0 Jun 12 1998 R NaN Gramercy None None Unknown/Unclear None NaN 6.1 1071.0 Low None
1 First Love, Last Rites 10876.0 10876.0 0.0 300000.0 Aug 07 1998 R NaN Strand None Drama Unknown/Unclear None NaN 6.9 207.0 Low None
2 I Married a Strange Person 203134.0 203134.0 0.0 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy Unknown/Unclear None NaN 6.8 865.0 Low None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3198 Zoom 11989328.0 12506188.0 6679409.0 35000000.0 Aug 11 2006 PG NaN Sony Pictures Based on Comic/Graphic Novel Adventure Super Hero Peter Hewitt 3.0 3.4 7424.0 Medium Global action movie
3199 The Legend of Zorro 45575336.0 141475336.0 0.0 80000000.0 Oct 28 2005 PG 129.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 26.0 5.7 21161.0 Medium None
3200 The Mask of Zorro 93828745.0 233700000.0 0.0 65000000.0 Jul 17 1998 PG-13 136.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 82.0 6.7 4789.0 High None

3201 rows × 18 columns

To see the distribution of movie categories based on our flag, we can use value_counts():

movies['Action_Flag'].value_counts(dropna=False)
Action_Flag
None                   3152
Global action movie      42
US action movie           7
Name: count, dtype: int64

18.9.1 Practice: Flag Movies Based on Ratings

In the movies dataset, flag movies as Critic-friendly or Commercial based on their Rotten Tomatoes and IMDB ratings.

  • If the Rotten Tomatoes rating is above 70% and the IMDB rating is below 5, the movie is flagged as Critic-friendly.
  • If the Rotten Tomatoes rating is below 50% and the IMDB rating is above 7, the movie is flagged as Commercial.
  • Otherwise, the movie is categorized as Other.
  • Count how many movies are Critic-friendly and Commercial. There should be 13 Critic-friendly movies and 33 Commercial movies. Do you recognize any of them?
# Your code here:

18.10 Wrap-Up

In this lesson, you learned how to conditionally transform variables in pandas using:

  • The replace() method with dictionaries to map and replace specific values.
  • Handling missing values (NaN or None) during replacements.
  • Defining custom functions and applying them to handle complex conditions.

These techniques are powerful tools for data cleaning and preprocessing, allowing you to reshape your data to meet your analysis needs.

See you next time!