15  Subsetting columns

15.1 Introduction

Today we will begin our exploration of pandas for data manipulation!

Our first focus will be on selecting and renaming columns. Often your dataset comes with many columns that you do not need, and you would like to narrow it down to just a few. Pandas makes this easy. Let’s see how.

15.2 Learning objectives

  • You can keep or drop columns from a DataFrame using square brackets [], filter(), and drop().
  • You can select columns based on regex patterns with filter().
  • You can use rename() to change column names.
  • You can use regex to clean column names.

15.3 About pandas

Pandas is a popular library for data manipulation and analysis. It is designed to make it easy to work with tabular data in Python.

Install pandas with the following command in your terminal if it is not already installed:

pip install pandas 

Then import pandas with the following command in your script:

import pandas as pd

15.4 The Yaounde COVID-19 dataset

In this lesson, we analyse results from a COVID-19 survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for antibodies.

You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0

To download the dataset, visit this link: https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/yaounde_data.zip

Then unzip the file and place the yaounde_data.csv file in the data folder in the same directory as your notebook.

Let’s load and examine the dataset:

yao = pd.read_csv("data/yaounde_data.csv")
yao
id date_surveyed age age_category age_category_3 sex highest_education occupation weight_kg height_cm ... is_drug_antibio is_drug_hydrocortisone is_drug_other_anti_inflam is_drug_antiviral is_drug_chloro is_drug_tradn is_drug_oxygen is_drug_other is_drug_no_resp is_drug_none
0 BRIQUETERIE_000_0001 2020-10-22 45 45 - 64 Adult Female Secondary Informal worker 95 169 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 BRIQUETERIE_000_0002 2020-10-24 55 45 - 64 Adult Male University Salaried worker 96 185 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
969 TSINGAOLIGA_026_0002 2020-11-11 31 30 - 44 Adult Female Secondary Unemployed 66 169 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
970 TSINGAOLIGA_026_0003 2020-11-11 17 15 - 29 Child Female Secondary Unemployed 67 162 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

971 rows × 53 columns

15.5 Selecting columns with square brackets []

In pandas, the most common way to select a column is simply to use square brackets [] and the column name. For example, to select the age and sex columns, we type:

yao[["age", "sex"]]
age sex
0 45 Female
1 55 Male
... ... ...
969 31 Female
970 17 Female

971 rows × 2 columns

Note the double square brackets [[]]. Without it, you will get an error:

yao["age", "sex"]
KeyError: ('age', 'sex')

If you want to select a single column, you may omit the double square brackets, but your output will no longer be a DataFrame. Compare the following:

yao["age"] # does not return a DataFrame
0      45
1      55
       ..
969    31
970    17
Name: age, Length: 971, dtype: int64
yao[["age"]]  # returns a DataFrame
age
0 45
1 55
... ...
969 31
970 17

971 rows × 1 columns

Key Point

15.6 Storing data subsets

Note that these selections are not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset with only three columns:

yao_subset = yao[["age", "sex", "igg_result"]]
yao_subset
age sex igg_result
0 45 Female Negative
1 55 Male Positive
... ... ... ...
969 31 Female Negative
970 17 Female Negative

971 rows × 3 columns

And if we want to overwrite a DataFrame, we can assign the subset back to the original DataFrame. Let’s overwrite the yao_subset DataFrame to have only the age column:

yao_subset = yao_subset[["age"]]
yao_subset
age
0 45
1 55
... ...
969 31
970 17

971 rows × 1 columns

The yao_subset DataFrame has gone from having 3 columns to having 1 column.

Practice

15.6.1 Practice Q: Select Columns with []

  • Use the [] operator to select the “weight_kg” and “height_cm” variables in the yao DataFrame. Assign the result to a new DataFrame called yao_weight_height. Then print this new DataFrame.
# Your code here
Pro tip

There are many ways to select columns in pandas. In your free time, you may choose to explore the .loc[] and .take() methods, which provide additional functionality.

15.7 Excluding columns with drop()

Sometimes it is more useful to drop columns you do not need than to explicitly select the ones that you do need.

To drop columns, we can use the drop() method with the columns argument. To drop the age column, we type:

yao.drop(columns=["age"])
id date_surveyed age_category age_category_3 sex highest_education occupation weight_kg height_cm is_smoker ... is_drug_antibio is_drug_hydrocortisone is_drug_other_anti_inflam is_drug_antiviral is_drug_chloro is_drug_tradn is_drug_oxygen is_drug_other is_drug_no_resp is_drug_none
0 BRIQUETERIE_000_0001 2020-10-22 45 - 64 Adult Female Secondary Informal worker 95 169 Non-smoker ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 BRIQUETERIE_000_0002 2020-10-24 45 - 64 Adult Male University Salaried worker 96 185 Ex-smoker ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
969 TSINGAOLIGA_026_0002 2020-11-11 30 - 44 Adult Female Secondary Unemployed 66 169 Non-smoker ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
970 TSINGAOLIGA_026_0003 2020-11-11 15 - 29 Child Female Secondary Unemployed 67 162 Non-smoker ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

971 rows × 52 columns

To drop several columns:

yao.drop(columns=["age", "sex"])
id date_surveyed age_category age_category_3 highest_education occupation weight_kg height_cm is_smoker is_pregnant ... is_drug_antibio is_drug_hydrocortisone is_drug_other_anti_inflam is_drug_antiviral is_drug_chloro is_drug_tradn is_drug_oxygen is_drug_other is_drug_no_resp is_drug_none
0 BRIQUETERIE_000_0001 2020-10-22 45 - 64 Adult Secondary Informal worker 95 169 Non-smoker No ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 BRIQUETERIE_000_0002 2020-10-24 45 - 64 Adult University Salaried worker 96 185 Ex-smoker NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
969 TSINGAOLIGA_026_0002 2020-11-11 30 - 44 Adult Secondary Unemployed 66 169 Non-smoker No ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
970 TSINGAOLIGA_026_0003 2020-11-11 15 - 29 Child Secondary Unemployed 67 162 Non-smoker No response ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

971 rows × 51 columns

Again, note that this is not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset age and sex dropped:

yao_subset = yao.drop(columns=["age", "sex"])
yao_subset
id date_surveyed age_category age_category_3 highest_education occupation weight_kg height_cm is_smoker is_pregnant ... is_drug_antibio is_drug_hydrocortisone is_drug_other_anti_inflam is_drug_antiviral is_drug_chloro is_drug_tradn is_drug_oxygen is_drug_other is_drug_no_resp is_drug_none
0 BRIQUETERIE_000_0001 2020-10-22 45 - 64 Adult Secondary Informal worker 95 169 Non-smoker No ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 BRIQUETERIE_000_0002 2020-10-24 45 - 64 Adult University Salaried worker 96 185 Ex-smoker NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
969 TSINGAOLIGA_026_0002 2020-11-11 30 - 44 Adult Secondary Unemployed 66 169 Non-smoker No ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
970 TSINGAOLIGA_026_0003 2020-11-11 15 - 29 Child Secondary Unemployed 67 162 Non-smoker No response ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

971 rows × 51 columns

Practice

15.7.1 Practice Q: Drop Columns with drop()

  • From the yao DataFrame, remove the columns highest_education and consultation. Assign the result to a new DataFrame called yao_no_education_consultation. Print this new DataFrame.
# Your code here

15.8 Using filter() to select columns by regex

The filter() method and its regex argument offer a powerful way to select columns based on patterns in their names. As an example, to select columns containing the string “ig”, we can write:

yao.filter(regex="ig")
highest_education weight_kg height_cm neighborhood igg_result igm_result symp_fatigue
0 Secondary 95 169 Briqueterie Negative Negative No
1 University 96 185 Briqueterie Positive Negative No
... ... ... ... ... ... ... ...
969 Secondary 66 169 Tsinga Oliga Negative Negative No
970 Secondary 67 162 Tsinga Oliga Negative Negative No

971 rows × 7 columns

The argument regex specifies the pattern to match. Regex stands for regular expression and refers to a sequence of characters that define a search pattern.

To select columns starting with the string “ig”, we write:

yao.filter(regex="^ig")
igg_result igm_result
0 Negative Negative
1 Positive Negative
... ... ...
969 Negative Negative
970 Negative Negative

971 rows × 2 columns

The symbol ^ is a regex character that matches the beginning of the string.

To select columns ending with the string “result”, we can write:

yao.filter(regex="result$")
igg_result igm_result
0 Negative Negative
1 Positive Negative
... ... ...
969 Negative Negative
970 Negative Negative

971 rows × 2 columns

The character $ is regex that matches the end of the string.

Pro Tip

Regex is notoriously difficult to remember, but LLMs like ChatGPT are very good at generating the right patterns. Simply ask, for example, “What is the regex for strings starting with ‘ig’”

Practice

15.8.1 Practice Q: Select Columns with Regex

  • Select all columns in the yao DataFrame that start with “is_”. Assign the result to a new DataFrame called yao_is_columns. Then print this new DataFrame.
# Your code here

15.9 Change column names with rename()

We can use the rename() method to change column names:

yao.rename(columns={"age": "patient_age", "sex": "patient_sex"})
id date_surveyed patient_age age_category age_category_3 patient_sex highest_education occupation weight_kg height_cm ... is_drug_antibio is_drug_hydrocortisone is_drug_other_anti_inflam is_drug_antiviral is_drug_chloro is_drug_tradn is_drug_oxygen is_drug_other is_drug_no_resp is_drug_none
0 BRIQUETERIE_000_0001 2020-10-22 45 45 - 64 Adult Female Secondary Informal worker 95 169 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 BRIQUETERIE_000_0002 2020-10-24 55 45 - 64 Adult Male University Salaried worker 96 185 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
969 TSINGAOLIGA_026_0002 2020-11-11 31 30 - 44 Adult Female Secondary Unemployed 66 169 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
970 TSINGAOLIGA_026_0003 2020-11-11 17 15 - 29 Child Female Secondary Unemployed 67 162 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

971 rows × 53 columns

Practice

15.9.1 Practice Q: Rename Columns with rename()

  • Rename the age_category column in the yao DataFrame to age_cat. Assign the result to a new DataFrame called yao_age_cat. Then print this new DataFrame.
# Your code here

15.10 Cleaning messy column names

For cleaning column names, you can use regular expressions with the str.replace() method in pandas.

Here’s how you can do it on a test DataFrame with messy column names. Messy column names are names with spaces, special characters, or other non-alphanumeric characters.

test_df = pd.DataFrame(
    {"good_name": range(3), "bad name": range(3), "bad*@name*2": range(3)}
)
test_df
good_name bad name bad*@name*2
0 0 0 0
1 1 1 1
2 2 2 2

Such column names are not ideal because, for example, we cannot select them with the dot operator the way we can for clean names:

test_df.good_name  # this works
0    0
1    1
2    2
Name: good_name, dtype: int64

But this does not work:

test_df.bad name
      test_df.bad name
                 ^
SyntaxError: invalid syntax

We can automatically clean such names using the str.replace() method along with regular expressions.

clean_names = test_df.columns.str.replace(r'[^a-zA-Z0-9]', '_', regex=True)

The regular expression r'[^a-zA-Z0-9]' matches any character that is not a letter (either uppercase or lowercase) or a digit. The str.replace() method replaces these characters with an underscore (‘_’) to make the column names more legible and usable in dot notation.

Now we can replace the column names in the DataFrame with the cleaned names:

test_df.columns = clean_names
test_df
good_name bad_name bad__name_2
0 0 0 0
1 1 1 1
2 2 2 2
Practice

15.10.1 Practice Q: Clean Column Names with Regex

  • Consider the data frame defined below with messy column names. Use the str.replace() method to clean the column names.
cleaning_practice = pd.DataFrame(
    {"Aloha": range(3), "Bell Chart": range(3), "Animals@the zoo": range(3)}
)
cleaning_practice
Aloha Bell Chart Animals@the zoo
0 0 0 0
1 1 1 1
2 2 2 2

15.11 Wrap up

Hopefully this lesson has shown you how intuitive and useful pandas is for data manipulation!

This is the first of a series of basic data wrangling techniques: see you in the next lesson to learn more.