pip install pandas
15 Subsetting columns
15.1 Introduction
Today we will begin our exploration of pandas for data manipulation!
Our first focus will be on selecting and renaming columns. Often your dataset comes with many columns that you do not need, and you would like to narrow it down to just a few. Pandas makes this easy. Let’s see how.
15.2 Learning objectives
- You can keep or drop columns from a DataFrame using square brackets
[]
,filter()
, anddrop()
. - You can select columns based on regex patterns with
filter()
. - You can use
rename()
to change column names. - You can use regex to clean column names.
15.3 About pandas
Pandas is a popular library for data manipulation and analysis. It is designed to make it easy to work with tabular data in Python.
Install pandas with the following command in your terminal if it is not already installed:
Then import pandas with the following command in your script:
import pandas as pd
15.4 The Yaounde COVID-19 dataset
In this lesson, we analyse results from a COVID-19 survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for antibodies.
You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0
To download the dataset, visit this link: https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/yaounde_data.zip
Then unzip the file and place the yaounde_data.csv
file in the data
folder in the same directory as your notebook.
Let’s load and examine the dataset:
= pd.read_csv("data/yaounde_data.csv")
yao yao
id | date_surveyed | age | age_category | age_category_3 | sex | highest_education | occupation | weight_kg | height_cm | ... | is_drug_antibio | is_drug_hydrocortisone | is_drug_other_anti_inflam | is_drug_antiviral | is_drug_chloro | is_drug_tradn | is_drug_oxygen | is_drug_other | is_drug_no_resp | is_drug_none | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BRIQUETERIE_000_0001 | 2020-10-22 | 45 | 45 - 64 | Adult | Female | Secondary | Informal worker | 95 | 169 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | BRIQUETERIE_000_0002 | 2020-10-24 | 55 | 45 - 64 | Adult | Male | University | Salaried worker | 96 | 185 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
969 | TSINGAOLIGA_026_0002 | 2020-11-11 | 31 | 30 - 44 | Adult | Female | Secondary | Unemployed | 66 | 169 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
970 | TSINGAOLIGA_026_0003 | 2020-11-11 | 17 | 15 - 29 | Child | Female | Secondary | Unemployed | 67 | 162 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
971 rows × 53 columns
15.5 Selecting columns with square brackets []
In pandas, the most common way to select a column is simply to use square brackets []
and the column name. For example, to select the age
and sex
columns, we type:
"age", "sex"]] yao[[
age | sex | |
---|---|---|
0 | 45 | Female |
1 | 55 | Male |
... | ... | ... |
969 | 31 | Female |
970 | 17 | Female |
971 rows × 2 columns
Note the double square brackets [[]]
. Without it, you will get an error:
"age", "sex"] yao[
KeyError: ('age', 'sex')
If you want to select a single column, you may omit the double square brackets, but your output will no longer be a DataFrame. Compare the following:
"age"] # does not return a DataFrame yao[
0 45
1 55
..
969 31
970 17
Name: age, Length: 971, dtype: int64
"age"]] # returns a DataFrame yao[[
age | |
---|---|
0 | 45 |
1 | 55 |
... | ... |
969 | 31 |
970 | 17 |
971 rows × 1 columns
15.6 Storing data subsets
Note that these selections are not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset with only three columns:
= yao[["age", "sex", "igg_result"]]
yao_subset yao_subset
age | sex | igg_result | |
---|---|---|---|
0 | 45 | Female | Negative |
1 | 55 | Male | Positive |
... | ... | ... | ... |
969 | 31 | Female | Negative |
970 | 17 | Female | Negative |
971 rows × 3 columns
And if we want to overwrite a DataFrame, we can assign the subset back to the original DataFrame. Let’s overwrite the yao_subset
DataFrame to have only the age
column:
= yao_subset[["age"]]
yao_subset yao_subset
age | |
---|---|
0 | 45 |
1 | 55 |
... | ... |
969 | 31 |
970 | 17 |
971 rows × 1 columns
The yao_subset
DataFrame has gone from having 3 columns to having 1 column.
15.6.1 Practice Q: Select Columns with []
- Use the
[]
operator to select the “weight_kg” and “height_cm” variables in theyao
DataFrame. Assign the result to a new DataFrame calledyao_weight_height
. Then print this new DataFrame.
# Your code here
There are many ways to select columns in pandas. In your free time, you may choose to explore the .loc[]
and .take()
methods, which provide additional functionality.
15.7 Excluding columns with drop()
Sometimes it is more useful to drop columns you do not need than to explicitly select the ones that you do need.
To drop columns, we can use the drop()
method with the columns
argument. To drop the age column, we type:
=["age"]) yao.drop(columns
id | date_surveyed | age_category | age_category_3 | sex | highest_education | occupation | weight_kg | height_cm | is_smoker | ... | is_drug_antibio | is_drug_hydrocortisone | is_drug_other_anti_inflam | is_drug_antiviral | is_drug_chloro | is_drug_tradn | is_drug_oxygen | is_drug_other | is_drug_no_resp | is_drug_none | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BRIQUETERIE_000_0001 | 2020-10-22 | 45 - 64 | Adult | Female | Secondary | Informal worker | 95 | 169 | Non-smoker | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | BRIQUETERIE_000_0002 | 2020-10-24 | 45 - 64 | Adult | Male | University | Salaried worker | 96 | 185 | Ex-smoker | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
969 | TSINGAOLIGA_026_0002 | 2020-11-11 | 30 - 44 | Adult | Female | Secondary | Unemployed | 66 | 169 | Non-smoker | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
970 | TSINGAOLIGA_026_0003 | 2020-11-11 | 15 - 29 | Child | Female | Secondary | Unemployed | 67 | 162 | Non-smoker | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
971 rows × 52 columns
To drop several columns:
=["age", "sex"]) yao.drop(columns
id | date_surveyed | age_category | age_category_3 | highest_education | occupation | weight_kg | height_cm | is_smoker | is_pregnant | ... | is_drug_antibio | is_drug_hydrocortisone | is_drug_other_anti_inflam | is_drug_antiviral | is_drug_chloro | is_drug_tradn | is_drug_oxygen | is_drug_other | is_drug_no_resp | is_drug_none | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BRIQUETERIE_000_0001 | 2020-10-22 | 45 - 64 | Adult | Secondary | Informal worker | 95 | 169 | Non-smoker | No | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | BRIQUETERIE_000_0002 | 2020-10-24 | 45 - 64 | Adult | University | Salaried worker | 96 | 185 | Ex-smoker | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
969 | TSINGAOLIGA_026_0002 | 2020-11-11 | 30 - 44 | Adult | Secondary | Unemployed | 66 | 169 | Non-smoker | No | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
970 | TSINGAOLIGA_026_0003 | 2020-11-11 | 15 - 29 | Child | Secondary | Unemployed | 67 | 162 | Non-smoker | No response | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
971 rows × 51 columns
Again, note that this is not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset age and sex dropped:
= yao.drop(columns=["age", "sex"])
yao_subset yao_subset
id | date_surveyed | age_category | age_category_3 | highest_education | occupation | weight_kg | height_cm | is_smoker | is_pregnant | ... | is_drug_antibio | is_drug_hydrocortisone | is_drug_other_anti_inflam | is_drug_antiviral | is_drug_chloro | is_drug_tradn | is_drug_oxygen | is_drug_other | is_drug_no_resp | is_drug_none | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BRIQUETERIE_000_0001 | 2020-10-22 | 45 - 64 | Adult | Secondary | Informal worker | 95 | 169 | Non-smoker | No | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | BRIQUETERIE_000_0002 | 2020-10-24 | 45 - 64 | Adult | University | Salaried worker | 96 | 185 | Ex-smoker | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
969 | TSINGAOLIGA_026_0002 | 2020-11-11 | 30 - 44 | Adult | Secondary | Unemployed | 66 | 169 | Non-smoker | No | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
970 | TSINGAOLIGA_026_0003 | 2020-11-11 | 15 - 29 | Child | Secondary | Unemployed | 67 | 162 | Non-smoker | No response | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
971 rows × 51 columns
15.7.1 Practice Q: Drop Columns with drop()
- From the
yao
DataFrame, remove the columnshighest_education
andconsultation
. Assign the result to a new DataFrame calledyao_no_education_consultation
. Print this new DataFrame.
# Your code here
15.8 Using filter()
to select columns by regex
The filter()
method and its regex
argument offer a powerful way to select columns based on patterns in their names. As an example, to select columns containing the string “ig”, we can write:
filter(regex="ig") yao.
highest_education | weight_kg | height_cm | neighborhood | igg_result | igm_result | symp_fatigue | |
---|---|---|---|---|---|---|---|
0 | Secondary | 95 | 169 | Briqueterie | Negative | Negative | No |
1 | University | 96 | 185 | Briqueterie | Positive | Negative | No |
... | ... | ... | ... | ... | ... | ... | ... |
969 | Secondary | 66 | 169 | Tsinga Oliga | Negative | Negative | No |
970 | Secondary | 67 | 162 | Tsinga Oliga | Negative | Negative | No |
971 rows × 7 columns
The argument regex
specifies the pattern to match. Regex stands for regular expression and refers to a sequence of characters that define a search pattern.
To select columns starting with the string “ig”, we write:
filter(regex="^ig") yao.
igg_result | igm_result | |
---|---|---|
0 | Negative | Negative |
1 | Positive | Negative |
... | ... | ... |
969 | Negative | Negative |
970 | Negative | Negative |
971 rows × 2 columns
The symbol ^
is a regex character that matches the beginning of the string.
To select columns ending with the string “result”, we can write:
filter(regex="result$") yao.
igg_result | igm_result | |
---|---|---|
0 | Negative | Negative |
1 | Positive | Negative |
... | ... | ... |
969 | Negative | Negative |
970 | Negative | Negative |
971 rows × 2 columns
The character $
is regex that matches the end of the string.
Regex is notoriously difficult to remember, but LLMs like ChatGPT are very good at generating the right patterns. Simply ask, for example, “What is the regex for strings starting with ‘ig’”
15.8.1 Practice Q: Select Columns with Regex
- Select all columns in the
yao
DataFrame that start with “is_”. Assign the result to a new DataFrame calledyao_is_columns
. Then print this new DataFrame.
# Your code here
15.9 Change column names with rename()
We can use the rename()
method to change column names:
={"age": "patient_age", "sex": "patient_sex"}) yao.rename(columns
id | date_surveyed | patient_age | age_category | age_category_3 | patient_sex | highest_education | occupation | weight_kg | height_cm | ... | is_drug_antibio | is_drug_hydrocortisone | is_drug_other_anti_inflam | is_drug_antiviral | is_drug_chloro | is_drug_tradn | is_drug_oxygen | is_drug_other | is_drug_no_resp | is_drug_none | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BRIQUETERIE_000_0001 | 2020-10-22 | 45 | 45 - 64 | Adult | Female | Secondary | Informal worker | 95 | 169 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | BRIQUETERIE_000_0002 | 2020-10-24 | 55 | 45 - 64 | Adult | Male | University | Salaried worker | 96 | 185 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
969 | TSINGAOLIGA_026_0002 | 2020-11-11 | 31 | 30 - 44 | Adult | Female | Secondary | Unemployed | 66 | 169 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
970 | TSINGAOLIGA_026_0003 | 2020-11-11 | 17 | 15 - 29 | Child | Female | Secondary | Unemployed | 67 | 162 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
971 rows × 53 columns
15.9.1 Practice Q: Rename Columns with rename()
- Rename the
age_category
column in theyao
DataFrame toage_cat
. Assign the result to a new DataFrame calledyao_age_cat
. Then print this new DataFrame.
# Your code here
15.10 Cleaning messy column names
For cleaning column names, you can use regular expressions with the str.replace()
method in pandas.
Here’s how you can do it on a test DataFrame with messy column names. Messy column names are names with spaces, special characters, or other non-alphanumeric characters.
= pd.DataFrame(
test_df "good_name": range(3), "bad name": range(3), "bad*@name*2": range(3)}
{
) test_df
good_name | bad name | bad*@name*2 | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 |
Such column names are not ideal because, for example, we cannot select them with the dot operator the way we can for clean names:
# this works test_df.good_name
0 0
1 1
2 2
Name: good_name, dtype: int64
But this does not work:
test_df.bad name
test_df.bad name
^
SyntaxError: invalid syntax
We can automatically clean such names using the str.replace()
method along with regular expressions.
= test_df.columns.str.replace(r'[^a-zA-Z0-9]', '_', regex=True) clean_names
The regular expression r'[^a-zA-Z0-9]'
matches any character that is not a letter (either uppercase or lowercase) or a digit. The str.replace()
method replaces these characters with an underscore (‘_’) to make the column names more legible and usable in dot notation.
Now we can replace the column names in the DataFrame with the cleaned names:
= clean_names
test_df.columns test_df
good_name | bad_name | bad__name_2 | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 |
15.10.1 Practice Q: Clean Column Names with Regex
- Consider the data frame defined below with messy column names. Use the
str.replace()
method to clean the column names.
= pd.DataFrame(
cleaning_practice "Aloha": range(3), "Bell Chart": range(3), "Animals@the zoo": range(3)}
{
) cleaning_practice
Aloha | Bell Chart | Animals@the zoo | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 |
2 | 2 | 2 | 2 |
15.11 Wrap up
Hopefully this lesson has shown you how intuitive and useful pandas is for data manipulation!
This is the first of a series of basic data wrangling techniques: see you in the next lesson to learn more.