17  Transforming Variables in pandas

17.1 Introduction

In data analysis, one of the most common tasks is transforming variables within your dataset. The pandas library provides straightforward and efficient ways to accomplish this.

17.2 Learning Objectives

  • Understand how to create new variables in a DataFrame.
  • Learn how to modify existing variables.
  • Handle potential issues with modifying variables on views.

17.3 Imports

First, let’s import the pandas package:

import pandas as pd

Now we’ll set an important option that will help us avoid some warnings down the line. Later in the lesson, we’ll discuss this in more detail.

pd.options.mode.copy_on_write = True

17.4 Dataset

In this lesson, we’ll use a dataset of United States counties with demographic and economic data. You can download the dataset from this link:https://github.com/the-graph-courses/idap_book/raw/refs/heads/main/data/us_counties_data.zip.

Once you’ve downloaded the file, unzip it and place the us_counties_data.csv file in the data folder for your project.

counties = pd.read_csv("data/us_counties_data.csv")
counties
state county pop_20 area_sq_miles hh_inc_21 econ_type unemp_20 foreign_born_num pop_change_2010_2020 pct_emp_change_2010_2021
0 AL Autauga, AL 58877.0 594.456107 66444.0 Nonspecialized 5.4 1241.0 7.758700 9.0
1 AL Baldwin, AL 233140.0 1589.836014 65658.0 Recreation 6.2 7938.0 27.159356 28.2
... ... ... ... ... ... ... ... ... ... ...
3224 PR Yabucoa, PR 30364.0 55.214614 NaN NaN NaN NaN -19.807069 0.1
3225 PR Yauco, PR 34062.0 67.711484 NaN NaN NaN NaN -18.721309 -5.3

3226 rows × 10 columns

The variables in the dataset are:

  • state: US state
  • county: US county
  • pop_20: Population estimate for 2020
  • area_sq_miles: Area in square miles
  • hh_inc_21: Median household income for 2021
  • econ_type: Economic type of the county
  • pop_change_2010_2020: Population change between 2010 and 2020 (%)
  • unemp_20: Unemployment rate for 2020 (%)
  • pct_emp_change_2010_2021: Percentage change in employment between 2010 and 2021 (%)
  • foreign_born_num: Number of foreign-born residents

Let’s create a small subset of the dataset with just the area and population columns for illustration.

# Small subset for illustration
area_df = counties[["county", "area_sq_miles", "pop_20"]]
area_df
county area_sq_miles pop_20
0 Autauga, AL 594.456107 58877.0
1 Baldwin, AL 1589.836014 233140.0
... ... ... ...
3224 Yabucoa, PR 55.214614 30364.0
3225 Yauco, PR 67.711484 34062.0

3226 rows × 3 columns

17.5 Creating New Variables

Suppose we want to convert the area from square miles to square kilometers. Since 1 square mile is approximately 2.59 square kilometers, we can create a new variable area_sq_km by multiplying the area_sq_miles column by 2.59.

area_df["area_sq_km"] = area_df["area_sq_miles"] * 2.59
area_df
county area_sq_miles pop_20 area_sq_km
0 Autauga, AL 594.456107 58877.0 1539.641317
1 Baldwin, AL 1589.836014 233140.0 4117.675277
... ... ... ... ...
3224 Yabucoa, PR 55.214614 30364.0 143.005851
3225 Yauco, PR 67.711484 34062.0 175.372743

3226 rows × 4 columns

The syntax is very easy to understand, although a bit hard to type.

With area_df["area_sq_km"], we’re indicating that we want to create a new column called area_sq_km, then area_df["area_sq_miles"] * 2.59 is the expression that computes the values for this new column.

Let’s add another variable, this time in hectares. The conversion factor is 1 square mile = 259 hectares.

# Convert area to hectares as well
area_df["area_hectares"] = area_df["area_sq_miles"] * 259
area_df
county area_sq_miles pop_20 area_sq_km area_hectares
0 Autauga, AL 594.456107 58877.0 1539.641317 153964.131747
1 Baldwin, AL 1589.836014 233140.0 4117.675277 411767.527703
... ... ... ... ... ...
3224 Yabucoa, PR 55.214614 30364.0 143.005851 14300.585058
3225 Yauco, PR 67.711484 34062.0 175.372743 17537.274254

3226 rows × 5 columns

Practice

17.5.1 Practice Q: Area in Acres

Using the area_df dataset, create a new column called area_acres by multiplying the area_sq_miles variable by 640. Store the result back into area_df and display the DataFrame.

# Your code here

17.6 Modifying Existing Variables

Suppose we want to round the area_sq_km variable to one decimal place. We can call the round method on the area_sq_km column.

area_df["area_sq_km"] = area_df["area_sq_km"].round(1)
area_df
county area_sq_miles pop_20 area_sq_km area_hectares
0 Autauga, AL 594.456107 58877.0 1539.6 153964.131747
1 Baldwin, AL 1589.836014 233140.0 4117.7 411767.527703
... ... ... ... ... ...
3224 Yabucoa, PR 55.214614 30364.0 143.0 14300.585058
3225 Yauco, PR 67.711484 34062.0 175.4 17537.274254

3226 rows × 5 columns

Practice

17.6.1 Practice Q: Rounding area_acres

Using the area_df dataset, round the area_acres variable to one decimal place. Update the DataFrame area_df and display it.

# Your code here

17.7 Calculations with Multiple Variables

We can create new variables based on multiple existing variables.

For example, let’s calculate the population density per square kilometer.

area_df["pop_per_sq_km"] = area_df["pop_20"] / area_df["area_sq_km"]
area_df
county area_sq_miles pop_20 area_sq_km area_hectares pop_per_sq_km
0 Autauga, AL 594.456107 58877.0 1539.6 153964.131747 38.241751
1 Baldwin, AL 1589.836014 233140.0 4117.7 411767.527703 56.618986
... ... ... ... ... ... ...
3224 Yabucoa, PR 55.214614 30364.0 143.0 14300.585058 212.335664
3225 Yauco, PR 67.711484 34062.0 175.4 17537.274254 194.196123

3226 rows × 6 columns

We could tag on the round method to this output to round the result to one decimal place.

area_df["pop_per_sq_km"] = (area_df["pop_20"] / area_df["area_sq_km"]).round(1)
area_df
county area_sq_miles pop_20 area_sq_km area_hectares pop_per_sq_km
0 Autauga, AL 594.456107 58877.0 1539.6 153964.131747 38.2
1 Baldwin, AL 1589.836014 233140.0 4117.7 411767.527703 56.6
... ... ... ... ... ... ...
3224 Yabucoa, PR 55.214614 30364.0 143.0 14300.585058 212.3
3225 Yauco, PR 67.711484 34062.0 175.4 17537.274254 194.2

3226 rows × 6 columns

Or, if you prefer, you can do this in two steps:

area_df["pop_per_sq_km"] = area_df["pop_20"] / area_df["area_sq_km"]
area_df["pop_per_sq_km"] = area_df["pop_per_sq_km"].round(1)
area_df
county area_sq_miles pop_20 area_sq_km area_hectares pop_per_sq_km
0 Autauga, AL 594.456107 58877.0 1539.6 153964.131747 38.2
1 Baldwin, AL 1589.836014 233140.0 4117.7 411767.527703 56.6
... ... ... ... ... ... ...
3224 Yabucoa, PR 55.214614 30364.0 143.0 14300.585058 212.3
3225 Yauco, PR 67.711484 34062.0 175.4 17537.274254 194.2

3226 rows × 6 columns

After calculating the population density, we might want to sort the DataFrame based on this new variable. Let’s sort in descending order.

# Sort by population density in descending order
area_df = area_df.sort_values("pop_per_sq_km", ascending=False)
area_df
county area_sq_miles pop_20 area_sq_km area_hectares pop_per_sq_km
1863 New York, NY 22.656266 1687834.0 58.7 5867.972888 28753.6
1856 Kings, NY 69.376570 2727393.0 179.7 17968.531752 15177.5
... ... ... ... ... ... ...
98 Wrangell-Petersburg, AK NaN NaN NaN NaN NaN
2921 Bedford, VA NaN NaN NaN NaN NaN

3226 rows × 6 columns

We see that New York County has the highest population density in the dataset.

Practice

17.7.1 Practice Q: Calculate Foreign-Born Percentage

Use the counties dataset to calculate the percentage of foreign-born residents in each county. The variable foreign_born_num shows the number of foreign-born residents and pop_20 shows the total population. Sort the DataFrame in descending order of the percentage of foreign-born residents. Which two counties have the highest percentage of foreign-born residents?

# Your code here

17.8 Creating Boolean Variables

It is sometimes useful to create Boolean variables to categorize or flag data based on conditions. Boolean variables are variables that take on only two values: True or False.

Consider the pop_change_2010_2020 variable in the counties dataset, which shows the percentage change in population between 2010 and 2020.

changes_df = counties[["county", "pop_change_2010_2020", "pct_emp_change_2010_2021"]]
changes_df
county pop_change_2010_2020 pct_emp_change_2010_2021
0 Autauga, AL 7.758700 9.0
1 Baldwin, AL 27.159356 28.2
... ... ... ...
3224 Yabucoa, PR -19.807069 0.1
3225 Yauco, PR -18.721309 -5.3

3226 rows × 3 columns

We might want to create a Boolean variable to flag whether the population increased. For this, let’s set the pop_increase variable to True if the population increased and False otherwise.

Running the expression changes_df["pop_change_2010_2020"] > 0 returns a Series of Boolean values:

changes_df["pop_change_2010_2020"] > 0
0        True
1        True
        ...  
3224    False
3225    False
Name: pop_change_2010_2020, Length: 3226, dtype: bool

We can assign this Series of Boolean values to the pop_increase variable.

changes_df["pop_increase"] = changes_df["pop_change_2010_2020"] > 0
changes_df
county pop_change_2010_2020 pct_emp_change_2010_2021 pop_increase
0 Autauga, AL 7.758700 9.0 True
1 Baldwin, AL 27.159356 28.2 True
... ... ... ... ...
3224 Yabucoa, PR -19.807069 0.1 False
3225 Yauco, PR -18.721309 -5.3 False

3226 rows × 4 columns

Similarly, we can create a Boolean variable emp_increase for employment change.

changes_df["emp_increase"] = changes_df["pct_emp_change_2010_2021"] > 0
changes_df
county pop_change_2010_2020 pct_emp_change_2010_2021 pop_increase emp_increase
0 Autauga, AL 7.758700 9.0 True True
1 Baldwin, AL 27.159356 28.2 True True
... ... ... ... ... ...
3224 Yabucoa, PR -19.807069 0.1 False True
3225 Yauco, PR -18.721309 -5.3 False False

3226 rows × 5 columns

We can now filter the DataFrame to find counties where the population increased but employment decreased.

# Counties where population increased but employment decreased
pop_up_emp_down = changes_df.query("pop_increase == True & emp_increase == False")
pop_up_emp_down
county pop_change_2010_2020 pct_emp_change_2010_2021 pop_increase emp_increase
71 Bethel, AK 9.716099 -0.7 True False
75 Dillingham, AK 0.206313 -16.1 True False
... ... ... ... ... ...
3127 Campbell, WY 1.935708 -14.8 True False
3137 Natrona, WY 5.970842 -0.2 True False

242 rows × 5 columns

You could also write this in shorthand like so:

# Counties where population increased but employment decreased
pop_up_emp_down = changes_df.query("pop_increase & ~(emp_increase)")
pop_up_emp_down
county pop_change_2010_2020 pct_emp_change_2010_2021 pop_increase emp_increase
71 Bethel, AK 9.716099 -0.7 True False
75 Dillingham, AK 0.206313 -16.1 True False
... ... ... ... ... ...
3127 Campbell, WY 1.935708 -14.8 True False
3137 Natrona, WY 5.970842 -0.2 True False

242 rows × 5 columns

There are several such counties, which might be of interest for further analysis.

Practice

17.8.1 Practice Q: Categorize Counties by Foreign-Born Population

In a previous practice question, we calculated the percentage of foreign-born residents in each county. Now, create a Boolean variable foreign_born_pct_gt_30 that is True if the percentage is greater than 30%.

When you’re done, query the DataFrame to show only counties where foreign_born_pct_gt_30 is True. You should get 24 rows.

# Your code here

17.9 The Copy-on-Write Warning

Earlier in this lesson, we enabled “copy-on-write” mode. Let’s see what happens when this feature is disabled.

pd.set_option("mode.copy_on_write", False)

# Create a small subset of our data
subset = counties.query("state == 'AL'")

subset
state county pop_20 area_sq_miles hh_inc_21 econ_type unemp_20 foreign_born_num pop_change_2010_2020 pct_emp_change_2010_2021
0 AL Autauga, AL 58877.0 594.456107 66444.0 Nonspecialized 5.4 1241.0 7.758700 9.0
1 AL Baldwin, AL 233140.0 1589.836014 65658.0 Recreation 6.2 7938.0 27.159356 28.2
... ... ... ... ... ... ... ... ... ... ...
65 AL Wilcox, AL 10518.0 887.857611 30071.0 Manufacturing 16.2 36.0 -9.168809 3.8
66 AL Winston, AL 23491.0 612.998002 47176.0 Manufacturing 5.2 408.0 -3.855579 18.1

67 rows × 10 columns

When we attempt to modify the subset, we receive a warning:

# Modify the subset
subset['unemp_20'] = subset['unemp_20'].round(0)
/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_17183/317403666.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

While we won’t delve deeply into the technical details of this warning (as it involves complex pandas internals), it’s worth noting that the warning includes a link to the pandas documentation. This documentation contains the setting we used at the beginning of our lesson.

If you ever need to reference this setting again, you can simply click the link in the warning message to access the documentation. The documentation page also provides more detailed information about this particular issue.

Also note that from Pandas 3.0 (probably to be released in 2025) this warning will be removed, as the default behavior will be to copy on write.

17.10 Wrap-Up

Transforming data is a fundamental step in any data analysis workflow. pandas makes it straightforward to create and modify variables within your DataFrames using simple and intuitive syntax.

In this lesson, you’ve learned how to:

  • Create new variables by assigning to new columns.
  • Modify existing variables.
  • Perform calculations involving multiple variables.
  • Create Boolean variables based on conditions.

Congratulations on completing this lesson!