import pandas as pd
17 Transforming Variables in pandas
17.1 Introduction
In data analysis, one of the most common tasks is transforming variables within your dataset. The pandas library provides straightforward and efficient ways to accomplish this.
17.2 Learning Objectives
- Understand how to create new variables in a DataFrame.
- Learn how to modify existing variables.
- Handle potential issues with modifying variables on views.
17.3 Imports
First, let’s import the pandas package:
Now we’ll set an important option that will help us avoid some warnings down the line. Later in the lesson, we’ll discuss this in more detail.
= True pd.options.mode.copy_on_write
17.4 Dataset
In this lesson, we’ll use a dataset of United States counties with demographic and economic data. You can download the dataset from this link:https://github.com/the-graph-courses/idap_book/raw/refs/heads/main/data/us_counties_data.zip.
Once you’ve downloaded the file, unzip it and place the us_counties_data.csv
file in the data
folder for your project.
= pd.read_csv("data/us_counties_data.csv")
counties counties
state | county | pop_20 | area_sq_miles | hh_inc_21 | econ_type | unemp_20 | foreign_born_num | pop_change_2010_2020 | pct_emp_change_2010_2021 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | AL | Autauga, AL | 58877.0 | 594.456107 | 66444.0 | Nonspecialized | 5.4 | 1241.0 | 7.758700 | 9.0 |
1 | AL | Baldwin, AL | 233140.0 | 1589.836014 | 65658.0 | Recreation | 6.2 | 7938.0 | 27.159356 | 28.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3224 | PR | Yabucoa, PR | 30364.0 | 55.214614 | NaN | NaN | NaN | NaN | -19.807069 | 0.1 |
3225 | PR | Yauco, PR | 34062.0 | 67.711484 | NaN | NaN | NaN | NaN | -18.721309 | -5.3 |
3226 rows × 10 columns
The variables in the dataset are:
state
: US statecounty
: US countypop_20
: Population estimate for 2020area_sq_miles
: Area in square mileshh_inc_21
: Median household income for 2021econ_type
: Economic type of the countypop_change_2010_2020
: Population change between 2010 and 2020 (%)unemp_20
: Unemployment rate for 2020 (%)pct_emp_change_2010_2021
: Percentage change in employment between 2010 and 2021 (%)foreign_born_num
: Number of foreign-born residents
Let’s create a small subset of the dataset with just the area and population columns for illustration.
# Small subset for illustration
= counties[["county", "area_sq_miles", "pop_20"]]
area_df area_df
county | area_sq_miles | pop_20 | |
---|---|---|---|
0 | Autauga, AL | 594.456107 | 58877.0 |
1 | Baldwin, AL | 1589.836014 | 233140.0 |
... | ... | ... | ... |
3224 | Yabucoa, PR | 55.214614 | 30364.0 |
3225 | Yauco, PR | 67.711484 | 34062.0 |
3226 rows × 3 columns
17.5 Creating New Variables
Suppose we want to convert the area from square miles to square kilometers. Since 1 square mile is approximately 2.59 square kilometers, we can create a new variable area_sq_km
by multiplying the area_sq_miles
column by 2.59.
"area_sq_km"] = area_df["area_sq_miles"] * 2.59
area_df[ area_df
county | area_sq_miles | pop_20 | area_sq_km | |
---|---|---|---|---|
0 | Autauga, AL | 594.456107 | 58877.0 | 1539.641317 |
1 | Baldwin, AL | 1589.836014 | 233140.0 | 4117.675277 |
... | ... | ... | ... | ... |
3224 | Yabucoa, PR | 55.214614 | 30364.0 | 143.005851 |
3225 | Yauco, PR | 67.711484 | 34062.0 | 175.372743 |
3226 rows × 4 columns
The syntax is very easy to understand, although a bit hard to type.
With area_df["area_sq_km"]
, we’re indicating that we want to create a new column called area_sq_km
, then area_df["area_sq_miles"] * 2.59
is the expression that computes the values for this new column.
Let’s add another variable, this time in hectares. The conversion factor is 1 square mile = 259 hectares.
# Convert area to hectares as well
"area_hectares"] = area_df["area_sq_miles"] * 259
area_df[ area_df
county | area_sq_miles | pop_20 | area_sq_km | area_hectares | |
---|---|---|---|---|---|
0 | Autauga, AL | 594.456107 | 58877.0 | 1539.641317 | 153964.131747 |
1 | Baldwin, AL | 1589.836014 | 233140.0 | 4117.675277 | 411767.527703 |
... | ... | ... | ... | ... | ... |
3224 | Yabucoa, PR | 55.214614 | 30364.0 | 143.005851 | 14300.585058 |
3225 | Yauco, PR | 67.711484 | 34062.0 | 175.372743 | 17537.274254 |
3226 rows × 5 columns
17.5.1 Practice Q: Area in Acres
Using the area_df
dataset, create a new column called area_acres
by multiplying the area_sq_miles
variable by 640. Store the result back into area_df
and display the DataFrame.
# Your code here
17.6 Modifying Existing Variables
Suppose we want to round the area_sq_km
variable to one decimal place. We can call the round
method on the area_sq_km
column.
"area_sq_km"] = area_df["area_sq_km"].round(1)
area_df[ area_df
county | area_sq_miles | pop_20 | area_sq_km | area_hectares | |
---|---|---|---|---|---|
0 | Autauga, AL | 594.456107 | 58877.0 | 1539.6 | 153964.131747 |
1 | Baldwin, AL | 1589.836014 | 233140.0 | 4117.7 | 411767.527703 |
... | ... | ... | ... | ... | ... |
3224 | Yabucoa, PR | 55.214614 | 30364.0 | 143.0 | 14300.585058 |
3225 | Yauco, PR | 67.711484 | 34062.0 | 175.4 | 17537.274254 |
3226 rows × 5 columns
17.6.1 Practice Q: Rounding area_acres
Using the area_df
dataset, round the area_acres
variable to one decimal place. Update the DataFrame area_df
and display it.
# Your code here
17.7 Calculations with Multiple Variables
We can create new variables based on multiple existing variables.
For example, let’s calculate the population density per square kilometer.
"pop_per_sq_km"] = area_df["pop_20"] / area_df["area_sq_km"]
area_df[ area_df
county | area_sq_miles | pop_20 | area_sq_km | area_hectares | pop_per_sq_km | |
---|---|---|---|---|---|---|
0 | Autauga, AL | 594.456107 | 58877.0 | 1539.6 | 153964.131747 | 38.241751 |
1 | Baldwin, AL | 1589.836014 | 233140.0 | 4117.7 | 411767.527703 | 56.618986 |
... | ... | ... | ... | ... | ... | ... |
3224 | Yabucoa, PR | 55.214614 | 30364.0 | 143.0 | 14300.585058 | 212.335664 |
3225 | Yauco, PR | 67.711484 | 34062.0 | 175.4 | 17537.274254 | 194.196123 |
3226 rows × 6 columns
We could tag on the round method to this output to round the result to one decimal place.
"pop_per_sq_km"] = (area_df["pop_20"] / area_df["area_sq_km"]).round(1)
area_df[ area_df
county | area_sq_miles | pop_20 | area_sq_km | area_hectares | pop_per_sq_km | |
---|---|---|---|---|---|---|
0 | Autauga, AL | 594.456107 | 58877.0 | 1539.6 | 153964.131747 | 38.2 |
1 | Baldwin, AL | 1589.836014 | 233140.0 | 4117.7 | 411767.527703 | 56.6 |
... | ... | ... | ... | ... | ... | ... |
3224 | Yabucoa, PR | 55.214614 | 30364.0 | 143.0 | 14300.585058 | 212.3 |
3225 | Yauco, PR | 67.711484 | 34062.0 | 175.4 | 17537.274254 | 194.2 |
3226 rows × 6 columns
Or, if you prefer, you can do this in two steps:
"pop_per_sq_km"] = area_df["pop_20"] / area_df["area_sq_km"]
area_df["pop_per_sq_km"] = area_df["pop_per_sq_km"].round(1)
area_df[ area_df
county | area_sq_miles | pop_20 | area_sq_km | area_hectares | pop_per_sq_km | |
---|---|---|---|---|---|---|
0 | Autauga, AL | 594.456107 | 58877.0 | 1539.6 | 153964.131747 | 38.2 |
1 | Baldwin, AL | 1589.836014 | 233140.0 | 4117.7 | 411767.527703 | 56.6 |
... | ... | ... | ... | ... | ... | ... |
3224 | Yabucoa, PR | 55.214614 | 30364.0 | 143.0 | 14300.585058 | 212.3 |
3225 | Yauco, PR | 67.711484 | 34062.0 | 175.4 | 17537.274254 | 194.2 |
3226 rows × 6 columns
After calculating the population density, we might want to sort the DataFrame based on this new variable. Let’s sort in descending order.
# Sort by population density in descending order
= area_df.sort_values("pop_per_sq_km", ascending=False)
area_df area_df
county | area_sq_miles | pop_20 | area_sq_km | area_hectares | pop_per_sq_km | |
---|---|---|---|---|---|---|
1863 | New York, NY | 22.656266 | 1687834.0 | 58.7 | 5867.972888 | 28753.6 |
1856 | Kings, NY | 69.376570 | 2727393.0 | 179.7 | 17968.531752 | 15177.5 |
... | ... | ... | ... | ... | ... | ... |
98 | Wrangell-Petersburg, AK | NaN | NaN | NaN | NaN | NaN |
2921 | Bedford, VA | NaN | NaN | NaN | NaN | NaN |
3226 rows × 6 columns
We see that New York County has the highest population density in the dataset.
17.7.1 Practice Q: Calculate Foreign-Born Percentage
Use the counties
dataset to calculate the percentage of foreign-born residents in each county. The variable foreign_born_num
shows the number of foreign-born residents and pop_20
shows the total population. Sort the DataFrame in descending order of the percentage of foreign-born residents. Which two counties have the highest percentage of foreign-born residents?
# Your code here
17.8 Creating Boolean Variables
It is sometimes useful to create Boolean variables to categorize or flag data based on conditions. Boolean variables are variables that take on only two values: True
or False
.
Consider the pop_change_2010_2020
variable in the counties
dataset, which shows the percentage change in population between 2010 and 2020.
= counties[["county", "pop_change_2010_2020", "pct_emp_change_2010_2021"]]
changes_df changes_df
county | pop_change_2010_2020 | pct_emp_change_2010_2021 | |
---|---|---|---|
0 | Autauga, AL | 7.758700 | 9.0 |
1 | Baldwin, AL | 27.159356 | 28.2 |
... | ... | ... | ... |
3224 | Yabucoa, PR | -19.807069 | 0.1 |
3225 | Yauco, PR | -18.721309 | -5.3 |
3226 rows × 3 columns
We might want to create a Boolean variable to flag whether the population increased. For this, let’s set the pop_increase
variable to True
if the population increased and False
otherwise.
Running the expression changes_df["pop_change_2010_2020"] > 0
returns a Series of Boolean values:
"pop_change_2010_2020"] > 0 changes_df[
0 True
1 True
...
3224 False
3225 False
Name: pop_change_2010_2020, Length: 3226, dtype: bool
We can assign this Series of Boolean values to the pop_increase
variable.
"pop_increase"] = changes_df["pop_change_2010_2020"] > 0
changes_df[ changes_df
county | pop_change_2010_2020 | pct_emp_change_2010_2021 | pop_increase | |
---|---|---|---|---|
0 | Autauga, AL | 7.758700 | 9.0 | True |
1 | Baldwin, AL | 27.159356 | 28.2 | True |
... | ... | ... | ... | ... |
3224 | Yabucoa, PR | -19.807069 | 0.1 | False |
3225 | Yauco, PR | -18.721309 | -5.3 | False |
3226 rows × 4 columns
Similarly, we can create a Boolean variable emp_increase
for employment change.
"emp_increase"] = changes_df["pct_emp_change_2010_2021"] > 0
changes_df[ changes_df
county | pop_change_2010_2020 | pct_emp_change_2010_2021 | pop_increase | emp_increase | |
---|---|---|---|---|---|
0 | Autauga, AL | 7.758700 | 9.0 | True | True |
1 | Baldwin, AL | 27.159356 | 28.2 | True | True |
... | ... | ... | ... | ... | ... |
3224 | Yabucoa, PR | -19.807069 | 0.1 | False | True |
3225 | Yauco, PR | -18.721309 | -5.3 | False | False |
3226 rows × 5 columns
We can now filter the DataFrame to find counties where the population increased but employment decreased.
# Counties where population increased but employment decreased
= changes_df.query("pop_increase == True & emp_increase == False")
pop_up_emp_down pop_up_emp_down
county | pop_change_2010_2020 | pct_emp_change_2010_2021 | pop_increase | emp_increase | |
---|---|---|---|---|---|
71 | Bethel, AK | 9.716099 | -0.7 | True | False |
75 | Dillingham, AK | 0.206313 | -16.1 | True | False |
... | ... | ... | ... | ... | ... |
3127 | Campbell, WY | 1.935708 | -14.8 | True | False |
3137 | Natrona, WY | 5.970842 | -0.2 | True | False |
242 rows × 5 columns
You could also write this in shorthand like so:
# Counties where population increased but employment decreased
= changes_df.query("pop_increase & ~(emp_increase)")
pop_up_emp_down pop_up_emp_down
county | pop_change_2010_2020 | pct_emp_change_2010_2021 | pop_increase | emp_increase | |
---|---|---|---|---|---|
71 | Bethel, AK | 9.716099 | -0.7 | True | False |
75 | Dillingham, AK | 0.206313 | -16.1 | True | False |
... | ... | ... | ... | ... | ... |
3127 | Campbell, WY | 1.935708 | -14.8 | True | False |
3137 | Natrona, WY | 5.970842 | -0.2 | True | False |
242 rows × 5 columns
There are several such counties, which might be of interest for further analysis.
17.8.1 Practice Q: Categorize Counties by Foreign-Born Population
In a previous practice question, we calculated the percentage of foreign-born residents in each county. Now, create a Boolean variable foreign_born_pct_gt_30
that is True
if the percentage is greater than 30%.
When you’re done, query the DataFrame to show only counties where foreign_born_pct_gt_30
is True
. You should get 24 rows.
# Your code here
17.9 The Copy-on-Write Warning
Earlier in this lesson, we enabled “copy-on-write” mode. Let’s see what happens when this feature is disabled.
"mode.copy_on_write", False)
pd.set_option(
# Create a small subset of our data
= counties.query("state == 'AL'")
subset
subset
state | county | pop_20 | area_sq_miles | hh_inc_21 | econ_type | unemp_20 | foreign_born_num | pop_change_2010_2020 | pct_emp_change_2010_2021 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | AL | Autauga, AL | 58877.0 | 594.456107 | 66444.0 | Nonspecialized | 5.4 | 1241.0 | 7.758700 | 9.0 |
1 | AL | Baldwin, AL | 233140.0 | 1589.836014 | 65658.0 | Recreation | 6.2 | 7938.0 | 27.159356 | 28.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
65 | AL | Wilcox, AL | 10518.0 | 887.857611 | 30071.0 | Manufacturing | 16.2 | 36.0 | -9.168809 | 3.8 |
66 | AL | Winston, AL | 23491.0 | 612.998002 | 47176.0 | Manufacturing | 5.2 | 408.0 | -3.855579 | 18.1 |
67 rows × 10 columns
When we attempt to modify the subset, we receive a warning:
# Modify the subset
'unemp_20'] = subset['unemp_20'].round(0) subset[
/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_17183/317403666.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
While we won’t delve deeply into the technical details of this warning (as it involves complex pandas internals), it’s worth noting that the warning includes a link to the pandas documentation. This documentation contains the setting we used at the beginning of our lesson.
If you ever need to reference this setting again, you can simply click the link in the warning message to access the documentation. The documentation page also provides more detailed information about this particular issue.
Also note that from Pandas 3.0 (probably to be released in 2025) this warning will be removed, as the default behavior will be to copy on write.
17.10 Wrap-Up
Transforming data is a fundamental step in any data analysis workflow. pandas makes it straightforward to create and modify variables within your DataFrames using simple and intuitive syntax.
In this lesson, you’ve learned how to:
- Create new variables by assigning to new columns.
- Modify existing variables.
- Perform calculations involving multiple variables.
- Create Boolean variables based on conditions.
Congratulations on completing this lesson!