15 Subsetting columns – Introduction to Data Science with Python

15.1 Introduction

Today we will begin our exploration of pandas for data manipulation!

Our first focus will be on selecting and renaming columns. Often your dataset comes with many columns that you do not need, and you would like to narrow it down to just a few. Pandas makes this easy. Let’s see how.

15.2 Learning objectives

You can keep or drop columns from a DataFrame using square brackets [], filter(), and drop().
You can select columns based on regex patterns with filter().
You can use rename() to change column names.
You can use regex to clean column names.

15.3 About pandas

Pandas is a popular library for data manipulation and analysis. It is designed to make it easy to work with tabular data in Python.

Install pandas with the following command in your terminal if it is not already installed:

pip install pandas

Then import pandas with the following command in your script:

import pandas as pd

15.4 The Yaounde COVID-19 dataset

In this lesson, we analyse results from a COVID-19 survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for antibodies.

You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0

To download the dataset, visit this link: https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/yaounde_data.zip

Then unzip the file and place the yaounde_data.csv file in the data folder in the same directory as your notebook.

Let’s load and examine the dataset:

yao = pd.read_csv("data/yaounde_data.csv")
yao

	id	date_surveyed	age	age_category	age_category_3	sex	highest_education	occupation	weight_kg	height_cm	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45	45 - 64	Adult	Female	Secondary	Informal worker	95	169	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	55	45 - 64	Adult	Male	University	Salaried worker	96	185	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	31	30 - 44	Adult	Female	Secondary	Unemployed	66	169	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	17	15 - 29	Child	Female	Secondary	Unemployed	67	162	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 53 columns

15.5 Selecting columns with square brackets `[]`

In pandas, the most common way to select a column is simply to use square brackets [] and the column name. For example, to select the age and sex columns, we type:

yao[["age", "sex"]]

	age	sex
0	45	Female
1	55	Male
...	...	...
969	31	Female
970	17	Female

971 rows × 2 columns

Note the double square brackets [[]]. Without it, you will get an error:

yao["age", "sex"]

KeyError: ('age', 'sex')

If you want to select a single column, you may omit the double square brackets, but your output will no longer be a DataFrame. Compare the following:

yao["age"] # does not return a DataFrame

0      45
1      55
       ..
969    31
970    17
Name: age, Length: 971, dtype: int64

yao[["age"]]  # returns a DataFrame

	age
0	45
1	55
...	...
969	31
970	17

971 rows × 1 columns

Key Point

15.6 Storing data subsets

Note that these selections are not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset with only three columns:

yao_subset = yao[["age", "sex", "igg_result"]]
yao_subset

	age	sex	igg_result
0	45	Female	Negative
1	55	Male	Positive
...	...	...	...
969	31	Female	Negative
970	17	Female	Negative

971 rows × 3 columns

And if we want to overwrite a DataFrame, we can assign the subset back to the original DataFrame. Let’s overwrite the yao_subset DataFrame to have only the age column:

yao_subset = yao_subset[["age"]]
yao_subset

	age
0	45
1	55
...	...
969	31
970	17

971 rows × 1 columns

The yao_subset DataFrame has gone from having 3 columns to having 1 column.

Practice

15.6.1 Practice Q: Select Columns with `[]`

Use the [] operator to select the “weight_kg” and “height_cm” variables in the yao DataFrame. Assign the result to a new DataFrame called yao_weight_height. Then print this new DataFrame.

# Your code here

Pro tip

There are many ways to select columns in pandas. In your free time, you may choose to explore the .loc[] and .take() methods, which provide additional functionality.

15.7 Excluding columns with `drop()`

Sometimes it is more useful to drop columns you do not need than to explicitly select the ones that you do need.

To drop columns, we can use the drop() method with the columns argument. To drop the age column, we type:

yao.drop(columns=["age"])

	id	date_surveyed	age_category	age_category_3	sex	highest_education	occupation	weight_kg	height_cm	is_smoker	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45 - 64	Adult	Female	Secondary	Informal worker	95	169	Non-smoker	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	45 - 64	Adult	Male	University	Salaried worker	96	185	Ex-smoker	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	30 - 44	Adult	Female	Secondary	Unemployed	66	169	Non-smoker	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	15 - 29	Child	Female	Secondary	Unemployed	67	162	Non-smoker	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 52 columns

To drop several columns:

yao.drop(columns=["age", "sex"])

	id	date_surveyed	age_category	age_category_3	highest_education	occupation	weight_kg	height_cm	is_smoker	is_pregnant	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45 - 64	Adult	Secondary	Informal worker	95	169	Non-smoker	No	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	45 - 64	Adult	University	Salaried worker	96	185	Ex-smoker	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	30 - 44	Adult	Secondary	Unemployed	66	169	Non-smoker	No	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	15 - 29	Child	Secondary	Unemployed	67	162	Non-smoker	No response	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 51 columns

Again, note that this is not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset age and sex dropped:

yao_subset = yao.drop(columns=["age", "sex"])
yao_subset

	id	date_surveyed	age_category	age_category_3	highest_education	occupation	weight_kg	height_cm	is_smoker	is_pregnant	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45 - 64	Adult	Secondary	Informal worker	95	169	Non-smoker	No	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	45 - 64	Adult	University	Salaried worker	96	185	Ex-smoker	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	30 - 44	Adult	Secondary	Unemployed	66	169	Non-smoker	No	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	15 - 29	Child	Secondary	Unemployed	67	162	Non-smoker	No response	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 51 columns

Practice

15.7.1 Practice Q: Drop Columns with `drop()`

From the yao DataFrame, remove the columns highest_education and consultation. Assign the result to a new DataFrame called yao_no_education_consultation. Print this new DataFrame.

# Your code here

15.8 Using `filter()` to select columns by regex

The filter() method and its regex argument offer a powerful way to select columns based on patterns in their names. As an example, to select columns containing the string “ig”, we can write:

yao.filter(regex="ig")

	highest_education	weight_kg	height_cm	neighborhood	igg_result	igm_result	symp_fatigue
0	Secondary	95	169	Briqueterie	Negative	Negative	No
1	University	96	185	Briqueterie	Positive	Negative	No
...	...	...	...	...	...	...	...
969	Secondary	66	169	Tsinga Oliga	Negative	Negative	No
970	Secondary	67	162	Tsinga Oliga	Negative	Negative	No

971 rows × 7 columns

The argument regex specifies the pattern to match. Regex stands for regular expression and refers to a sequence of characters that define a search pattern.

To select columns starting with the string “ig”, we write:

yao.filter(regex="^ig")

	igg_result	igm_result
0	Negative	Negative
1	Positive	Negative
...	...	...
969	Negative	Negative
970	Negative	Negative

971 rows × 2 columns

The symbol ^ is a regex character that matches the beginning of the string.

To select columns ending with the string “result”, we can write:

yao.filter(regex="result$")

	igg_result	igm_result
0	Negative	Negative
1	Positive	Negative
...	...	...
969	Negative	Negative
970	Negative	Negative

971 rows × 2 columns

The character $ is regex that matches the end of the string.

Pro Tip

Regex is notoriously difficult to remember, but LLMs like ChatGPT are very good at generating the right patterns. Simply ask, for example, “What is the regex for strings starting with ‘ig’”

Practice

15.8.1 Practice Q: Select Columns with Regex

Select all columns in the yao DataFrame that start with “is_”. Assign the result to a new DataFrame called yao_is_columns. Then print this new DataFrame.

# Your code here

15.9 Change column names with `rename()`

We can use the rename() method to change column names:

yao.rename(columns={"age": "patient_age", "sex": "patient_sex"})

	id	date_surveyed	patient_age	age_category	age_category_3	patient_sex	highest_education	occupation	weight_kg	height_cm	...	is_drug_antibio	is_drug_hydrocortisone	is_drug_other_anti_inflam	is_drug_antiviral	is_drug_chloro	is_drug_tradn	is_drug_oxygen	is_drug_other	is_drug_no_resp	is_drug_none
0	BRIQUETERIE_000_0001	2020-10-22	45	45 - 64	Adult	Female	Secondary	Informal worker	95	169	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	BRIQUETERIE_000_0002	2020-10-24	55	45 - 64	Adult	Male	University	Salaried worker	96	185	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
969	TSINGAOLIGA_026_0002	2020-11-11	31	30 - 44	Adult	Female	Secondary	Unemployed	66	169	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
970	TSINGAOLIGA_026_0003	2020-11-11	17	15 - 29	Child	Female	Secondary	Unemployed	67	162	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

971 rows × 53 columns

Practice

15.9.1 Practice Q: Rename Columns with `rename()`

Rename the age_category column in the yao DataFrame to age_cat. Assign the result to a new DataFrame called yao_age_cat. Then print this new DataFrame.

# Your code here

15.10 Cleaning messy column names

For cleaning column names, you can use regular expressions with the str.replace() method in pandas.

Here’s how you can do it on a test DataFrame with messy column names. Messy column names are names with spaces, special characters, or other non-alphanumeric characters.

test_df = pd.DataFrame(
    {"good_name": range(3), "bad name": range(3), "bad*@name*2": range(3)}
)
test_df

	good_name	bad name	bad@name2
0	0	0	0
1	1	1	1
2	2	2	2

Such column names are not ideal because, for example, we cannot select them with the dot operator the way we can for clean names:

test_df.good_name  # this works

0    0
1    1
2    2
Name: good_name, dtype: int64

But this does not work:

test_df.bad name

      test_df.bad name
                 ^
SyntaxError: invalid syntax

We can automatically clean such names using the str.replace() method along with regular expressions.

clean_names = test_df.columns.str.replace(r'[^a-zA-Z0-9]', '_', regex=True)

The regular expression r'[^a-zA-Z0-9]' matches any character that is not a letter (either uppercase or lowercase) or a digit. The str.replace() method replaces these characters with an underscore (‘_’) to make the column names more legible and usable in dot notation.

Now we can replace the column names in the DataFrame with the cleaned names:

test_df.columns = clean_names
test_df

	good_name	bad_name	bad__name_2
0	0	0	0
1	1	1	1
2	2	2	2

Practice

15.10.1 Practice Q: Clean Column Names with Regex

Consider the data frame defined below with messy column names. Use the str.replace() method to clean the column names.

cleaning_practice = pd.DataFrame(
    {"Aloha": range(3), "Bell Chart": range(3), "Animals@the zoo": range(3)}
)
cleaning_practice

	Aloha	Bell Chart	Animals@the zoo
0	0	0	0
1	1	1	1
2	2	2	2

15.11 Wrap up

Hopefully this lesson has shown you how intuitive and useful pandas is for data manipulation!

This is the first of a series of basic data wrangling techniques: see you in the next lesson to learn more.

15.1 Introduction

15.2 Learning objectives

15.3 About pandas

15.4 The Yaounde COVID-19 dataset

15.5 Selecting columns with square brackets []

15.6 Storing data subsets

15.6.1 Practice Q: Select Columns with []

15.7 Excluding columns with drop()

15.7.1 Practice Q: Drop Columns with drop()

15.8 Using filter() to select columns by regex

15.8.1 Practice Q: Select Columns with Regex

15.9 Change column names with rename()

15.9.1 Practice Q: Rename Columns with rename()

15.10 Cleaning messy column names

15.10.1 Practice Q: Clean Column Names with Regex

15.11 Wrap up

15.5 Selecting columns with square brackets `[]`

15.6.1 Practice Q: Select Columns with `[]`

15.7 Excluding columns with `drop()`

15.7.1 Practice Q: Drop Columns with `drop()`

15.8 Using `filter()` to select columns by regex

15.9 Change column names with `rename()`

15.9.1 Practice Q: Rename Columns with `rename()`