17 Transforming Variables in pandas

17.1 Introduction

In data analysis, one of the most common tasks is transforming variables within your dataset. The pandas library provides straightforward and efficient ways to accomplish this.

17.2 Learning Objectives

Understand how to create new variables in a DataFrame.
Learn how to modify existing variables.
Handle potential issues with modifying variables on views.

17.3 Imports

First, let’s import the pandas package:

import pandas as pd

Now we’ll set an important option that will help us avoid some warnings down the line. Later in the lesson, we’ll discuss this in more detail.

pd.options.mode.copy_on_write = True

17.4 Dataset

In this lesson, we’ll use a dataset of United States counties with demographic and economic data. You can download the dataset from this link:https://github.com/the-graph-courses/idap_book/raw/refs/heads/main/data/us_counties_data.zip.

Once you’ve downloaded the file, unzip it and place the us_counties_data.csv file in the data folder for your project.

counties = pd.read_csv("data/us_counties_data.csv")
counties

	state	county	pop_20	area_sq_miles	hh_inc_21	econ_type	unemp_20	foreign_born_num	pop_change_2010_2020	pct_emp_change_2010_2021
0	AL	Autauga, AL	58877.0	594.456107	66444.0	Nonspecialized	5.4	1241.0	7.758700	9.0
1	AL	Baldwin, AL	233140.0	1589.836014	65658.0	Recreation	6.2	7938.0	27.159356	28.2
...	...	...	...	...	...	...	...	...	...	...
3224	PR	Yabucoa, PR	30364.0	55.214614	NaN	NaN	NaN	NaN	-19.807069	0.1
3225	PR	Yauco, PR	34062.0	67.711484	NaN	NaN	NaN	NaN	-18.721309	-5.3

3226 rows × 10 columns

The variables in the dataset are:

state: US state
county: US county
pop_20: Population estimate for 2020
area_sq_miles: Area in square miles
hh_inc_21: Median household income for 2021
econ_type: Economic type of the county
pop_change_2010_2020: Population change between 2010 and 2020 (%)
unemp_20: Unemployment rate for 2020 (%)
pct_emp_change_2010_2021: Percentage change in employment between 2010 and 2021 (%)
foreign_born_num: Number of foreign-born residents

Let’s create a small subset of the dataset with just the area and population columns for illustration.

# Small subset for illustration
area_df = counties[["county", "area_sq_miles", "pop_20"]]
area_df

	county	area_sq_miles	pop_20
0	Autauga, AL	594.456107	58877.0
1	Baldwin, AL	1589.836014	233140.0
...	...	...	...
3224	Yabucoa, PR	55.214614	30364.0
3225	Yauco, PR	67.711484	34062.0

3226 rows × 3 columns

17.5 Creating New Variables

Suppose we want to convert the area from square miles to square kilometers. Since 1 square mile is approximately 2.59 square kilometers, we can create a new variable area_sq_km by multiplying the area_sq_miles column by 2.59.

area_df["area_sq_km"] = area_df["area_sq_miles"] * 2.59
area_df

	county	area_sq_miles	pop_20	area_sq_km
0	Autauga, AL	594.456107	58877.0	1539.641317
1	Baldwin, AL	1589.836014	233140.0	4117.675277
...	...	...	...	...
3224	Yabucoa, PR	55.214614	30364.0	143.005851
3225	Yauco, PR	67.711484	34062.0	175.372743

3226 rows × 4 columns

The syntax is very easy to understand, although a bit hard to type.

With area_df["area_sq_km"], we’re indicating that we want to create a new column called area_sq_km, then area_df["area_sq_miles"] * 2.59 is the expression that computes the values for this new column.

Let’s add another variable, this time in hectares. The conversion factor is 1 square mile = 259 hectares.

# Convert area to hectares as well
area_df["area_hectares"] = area_df["area_sq_miles"] * 259
area_df

	county	area_sq_miles	pop_20	area_sq_km	area_hectares
0	Autauga, AL	594.456107	58877.0	1539.641317	153964.131747
1	Baldwin, AL	1589.836014	233140.0	4117.675277	411767.527703
...	...	...	...	...	...
3224	Yabucoa, PR	55.214614	30364.0	143.005851	14300.585058
3225	Yauco, PR	67.711484	34062.0	175.372743	17537.274254

3226 rows × 5 columns

Practice

17.5.1 Practice Q: Area in Acres

Using the area_df dataset, create a new column called area_acres by multiplying the area_sq_miles variable by 640. Store the result back into area_df and display the DataFrame.

# Your code here

17.6 Modifying Existing Variables

Suppose we want to round the area_sq_km variable to one decimal place. We can call the round method on the area_sq_km column.

area_df["area_sq_km"] = area_df["area_sq_km"].round(1)
area_df

	county	area_sq_miles	pop_20	area_sq_km	area_hectares
0	Autauga, AL	594.456107	58877.0	1539.6	153964.131747
1	Baldwin, AL	1589.836014	233140.0	4117.7	411767.527703
...	...	...	...	...	...
3224	Yabucoa, PR	55.214614	30364.0	143.0	14300.585058
3225	Yauco, PR	67.711484	34062.0	175.4	17537.274254

3226 rows × 5 columns

Practice

17.6.1 Practice Q: Rounding `area_acres`

Using the area_df dataset, round the area_acres variable to one decimal place. Update the DataFrame area_df and display it.

# Your code here

17.7 Calculations with Multiple Variables

We can create new variables based on multiple existing variables.

For example, let’s calculate the population density per square kilometer.

area_df["pop_per_sq_km"] = area_df["pop_20"] / area_df["area_sq_km"]
area_df

	county	area_sq_miles	pop_20	area_sq_km	area_hectares	pop_per_sq_km
0	Autauga, AL	594.456107	58877.0	1539.6	153964.131747	38.241751
1	Baldwin, AL	1589.836014	233140.0	4117.7	411767.527703	56.618986
...	...	...	...	...	...	...
3224	Yabucoa, PR	55.214614	30364.0	143.0	14300.585058	212.335664
3225	Yauco, PR	67.711484	34062.0	175.4	17537.274254	194.196123

3226 rows × 6 columns

We could tag on the round method to this output to round the result to one decimal place.

area_df["pop_per_sq_km"] = (area_df["pop_20"] / area_df["area_sq_km"]).round(1)
area_df

	county	area_sq_miles	pop_20	area_sq_km	area_hectares	pop_per_sq_km
0	Autauga, AL	594.456107	58877.0	1539.6	153964.131747	38.2
1	Baldwin, AL	1589.836014	233140.0	4117.7	411767.527703	56.6
...	...	...	...	...	...	...
3224	Yabucoa, PR	55.214614	30364.0	143.0	14300.585058	212.3
3225	Yauco, PR	67.711484	34062.0	175.4	17537.274254	194.2

3226 rows × 6 columns

Or, if you prefer, you can do this in two steps:

area_df["pop_per_sq_km"] = area_df["pop_20"] / area_df["area_sq_km"]
area_df["pop_per_sq_km"] = area_df["pop_per_sq_km"].round(1)
area_df

	county	area_sq_miles	pop_20	area_sq_km	area_hectares	pop_per_sq_km
0	Autauga, AL	594.456107	58877.0	1539.6	153964.131747	38.2
1	Baldwin, AL	1589.836014	233140.0	4117.7	411767.527703	56.6
...	...	...	...	...	...	...
3224	Yabucoa, PR	55.214614	30364.0	143.0	14300.585058	212.3
3225	Yauco, PR	67.711484	34062.0	175.4	17537.274254	194.2

3226 rows × 6 columns

After calculating the population density, we might want to sort the DataFrame based on this new variable. Let’s sort in descending order.

# Sort by population density in descending order
area_df = area_df.sort_values("pop_per_sq_km", ascending=False)
area_df

	county	area_sq_miles	pop_20	area_sq_km	area_hectares	pop_per_sq_km
1863	New York, NY	22.656266	1687834.0	58.7	5867.972888	28753.6
1856	Kings, NY	69.376570	2727393.0	179.7	17968.531752	15177.5
...	...	...	...	...	...	...
98	Wrangell-Petersburg, AK	NaN	NaN	NaN	NaN	NaN
2921	Bedford, VA	NaN	NaN	NaN	NaN	NaN

3226 rows × 6 columns

We see that New York County has the highest population density in the dataset.

Practice

17.7.1 Practice Q: Calculate Foreign-Born Percentage

Use the counties dataset to calculate the percentage of foreign-born residents in each county. The variable foreign_born_num shows the number of foreign-born residents and pop_20 shows the total population. Sort the DataFrame in descending order of the percentage of foreign-born residents. Which two counties have the highest percentage of foreign-born residents?

# Your code here

17.8 Creating Boolean Variables

It is sometimes useful to create Boolean variables to categorize or flag data based on conditions. Boolean variables are variables that take on only two values: True or False.

Consider the pop_change_2010_2020 variable in the counties dataset, which shows the percentage change in population between 2010 and 2020.

changes_df = counties[["county", "pop_change_2010_2020", "pct_emp_change_2010_2021"]]
changes_df

	county	pop_change_2010_2020	pct_emp_change_2010_2021
0	Autauga, AL	7.758700	9.0
1	Baldwin, AL	27.159356	28.2
...	...	...	...
3224	Yabucoa, PR	-19.807069	0.1
3225	Yauco, PR	-18.721309	-5.3

3226 rows × 3 columns

We might want to create a Boolean variable to flag whether the population increased. For this, let’s set the pop_increase variable to True if the population increased and False otherwise.

Running the expression changes_df["pop_change_2010_2020"] > 0 returns a Series of Boolean values:

changes_df["pop_change_2010_2020"] > 0

0        True
1        True
        ...  
3224    False
3225    False
Name: pop_change_2010_2020, Length: 3226, dtype: bool

We can assign this Series of Boolean values to the pop_increase variable.

changes_df["pop_increase"] = changes_df["pop_change_2010_2020"] > 0
changes_df

	county	pop_change_2010_2020	pct_emp_change_2010_2021	pop_increase
0	Autauga, AL	7.758700	9.0	True
1	Baldwin, AL	27.159356	28.2	True
...	...	...	...	...
3224	Yabucoa, PR	-19.807069	0.1	False
3225	Yauco, PR	-18.721309	-5.3	False

3226 rows × 4 columns

Similarly, we can create a Boolean variable emp_increase for employment change.

changes_df["emp_increase"] = changes_df["pct_emp_change_2010_2021"] > 0
changes_df

	county	pop_change_2010_2020	pct_emp_change_2010_2021	pop_increase	emp_increase
0	Autauga, AL	7.758700	9.0	True	True
1	Baldwin, AL	27.159356	28.2	True	True
...	...	...	...	...	...
3224	Yabucoa, PR	-19.807069	0.1	False	True
3225	Yauco, PR	-18.721309	-5.3	False	False

3226 rows × 5 columns

We can now filter the DataFrame to find counties where the population increased but employment decreased.

# Counties where population increased but employment decreased
pop_up_emp_down = changes_df.query("pop_increase == True & emp_increase == False")
pop_up_emp_down

	county	pop_change_2010_2020	pct_emp_change_2010_2021	pop_increase	emp_increase
71	Bethel, AK	9.716099	-0.7	True	False
75	Dillingham, AK	0.206313	-16.1	True	False
...	...	...	...	...	...
3127	Campbell, WY	1.935708	-14.8	True	False
3137	Natrona, WY	5.970842	-0.2	True	False

242 rows × 5 columns

You could also write this in shorthand like so:

# Counties where population increased but employment decreased
pop_up_emp_down = changes_df.query("pop_increase & ~(emp_increase)")
pop_up_emp_down

	county	pop_change_2010_2020	pct_emp_change_2010_2021	pop_increase	emp_increase
71	Bethel, AK	9.716099	-0.7	True	False
75	Dillingham, AK	0.206313	-16.1	True	False
...	...	...	...	...	...
3127	Campbell, WY	1.935708	-14.8	True	False
3137	Natrona, WY	5.970842	-0.2	True	False

242 rows × 5 columns

There are several such counties, which might be of interest for further analysis.

Practice

17.8.1 Practice Q: Categorize Counties by Foreign-Born Population

In a previous practice question, we calculated the percentage of foreign-born residents in each county. Now, create a Boolean variable foreign_born_pct_gt_30 that is True if the percentage is greater than 30%.

When you’re done, query the DataFrame to show only counties where foreign_born_pct_gt_30 is True. You should get 24 rows.

# Your code here

17.9 The Copy-on-Write Warning

Earlier in this lesson, we enabled “copy-on-write” mode. Let’s see what happens when this feature is disabled.

pd.set_option("mode.copy_on_write", False)

# Create a small subset of our data
subset = counties.query("state == 'AL'")

subset

	state	county	pop_20	area_sq_miles	hh_inc_21	econ_type	unemp_20	foreign_born_num	pop_change_2010_2020	pct_emp_change_2010_2021
0	AL	Autauga, AL	58877.0	594.456107	66444.0	Nonspecialized	5.4	1241.0	7.758700	9.0
1	AL	Baldwin, AL	233140.0	1589.836014	65658.0	Recreation	6.2	7938.0	27.159356	28.2
...	...	...	...	...	...	...	...	...	...	...
65	AL	Wilcox, AL	10518.0	887.857611	30071.0	Manufacturing	16.2	36.0	-9.168809	3.8
66	AL	Winston, AL	23491.0	612.998002	47176.0	Manufacturing	5.2	408.0	-3.855579	18.1

67 rows × 10 columns

When we attempt to modify the subset, we receive a warning:

# Modify the subset
subset['unemp_20'] = subset['unemp_20'].round(0)

/var/folders/vr/shb6ffvj2rl61kh7qqczhrgh0000gp/T/ipykernel_40756/317403666.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

While we won’t delve deeply into the technical details of this warning (as it involves complex pandas internals), it’s worth noting that the warning includes a link to the pandas documentation. This documentation contains the setting we used at the beginning of our lesson.

If you ever need to reference this setting again, you can simply click the link in the warning message to access the documentation. The documentation page also provides more detailed information about this particular issue.

Also note that from Pandas 3.0 (probably to be released in 2025) this warning will be removed, as the default behavior will be to copy on write.

17.10 Wrap-Up

Transforming data is a fundamental step in any data analysis workflow. pandas makes it straightforward to create and modify variables within your DataFrames using simple and intuitive syntax.

In this lesson, you’ve learned how to:

Create new variables by assigning to new columns.
Modify existing variables.
Perform calculations involving multiple variables.
Create Boolean variables based on conditions.

Congratulations on completing this lesson!

17.1 Introduction

17.2 Learning Objectives

17.3 Imports

17.4 Dataset

17.5 Creating New Variables

17.5.1 Practice Q: Area in Acres

17.6 Modifying Existing Variables

17.6.1 Practice Q: Rounding area_acres

17.7 Calculations with Multiple Variables

17.7.1 Practice Q: Calculate Foreign-Born Percentage

17.8 Creating Boolean Variables

17.8.1 Practice Q: Categorize Counties by Foreign-Born Population

17.9 The Copy-on-Write Warning

17.10 Wrap-Up

17.6.1 Practice Q: Rounding `area_acres`