Querying rows is one of the most frequently used operations in data analysis. It allows you to filter your dataset to focus on specific subsets of interest, enabling more targeted and efficient analysis.
In this lesson, we’ll explore various techniques to subset rows in pandas.
Let’s get started!
16.2 Learning objectives
You can use the query() method to keep or drop rows from a DataFrame.
You can specify conditions using relational operators like greater than (>), less than (<), equal to (==), not equal to (!=), and is an element of (isin()).
You can combine conditions with & and |.
You can negate conditions with ~.
You can use the isna() and notna() methods.
You can query based on string patterns using str.contains().
16.3 The Yaounde COVID-19 dataset
In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0
Let’s load the data into a pandas DataFrame.
import pandas as pdyaounde = pd.read_csv("data/yaounde_data.csv")# a smaller subset of variablesyao = yaounde[ ["age","sex","weight_kg","neighborhood","occupation","symptoms","is_smoker","is_pregnant","igg_result","igm_result", ]]yao.head()
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
0
45
Female
95
Briqueterie
Informal worker
Muscle pain
Non-smoker
No
Negative
Negative
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
2
23
Male
74
Briqueterie
Student
No symptoms
Smoker
NaN
Negative
Negative
3
20
Female
70
Briqueterie
Student
Rhinitis--Sneezing--Anosmia or ageusia
Non-smoker
No
Positive
Negative
4
55
Female
67
Briqueterie
Trader--Farmer
No symptoms
Non-smoker
No
Positive
Negative
16.4 Introducing query()
We can use the query() method to keep rows that satisfy a set of conditions. Let’s take a look at a simple example. If we want to keep just the male records, we run:
yao.query('sex == "Male"')
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
2
23
Male
74
Briqueterie
Student
No symptoms
Smoker
NaN
Negative
Negative
...
...
...
...
...
...
...
...
...
...
...
966
32
Male
54
Tsinga Oliga
Informal worker
Rhinitis--Sneezing--Diarrhoea
Smoker
NaN
Negative
Negative
968
35
Male
77
Tsinga Oliga
Informal worker
Headache
Smoker
NaN
Positive
Negative
422 rows × 10 columns
As you can see, the query() syntax is quite simple. (It may be a bit surprising to have to put code in quotes, but it is quite readable.)
Note the use of double equals (==) instead of single equals (=) there. The == sign tests for equality, while the single equals sign assigns a value. This is a common source of errors when you are a beginner, so watch out for it.
We can chain query() with shape[0] to count the number of male respondents.
yao.query('sex == "Male"').shape[0]
422
Reminder
The shape property returns the number of rows and columns in a DataFrame. The first element, shape[0], is the number of rows, and the second element, shape[1], is the number of columns.
For example:
yao.shape
(971, 10)
yao.shape[0] # rows
971
yao.shape[1] # columns
10
Key Point
Note that these subsets are not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset of male respondents:
yao_male = yao.query('sex == "Male"')yao_male
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
2
23
Male
74
Briqueterie
Student
No symptoms
Smoker
NaN
Negative
Negative
...
...
...
...
...
...
...
...
...
...
...
966
32
Male
54
Tsinga Oliga
Informal worker
Rhinitis--Sneezing--Diarrhoea
Smoker
NaN
Negative
Negative
968
35
Male
77
Tsinga Oliga
Informal worker
Headache
Smoker
NaN
Positive
Negative
422 rows × 10 columns
But for ease of explanation, in the examples below, we are simply printing the result, without storing it in a variable.
Practice
16.4.1 Practice Q: Subset for Pregnant Respondents
Subset the yao data frame to respondents who were pregnant during the survey (The is_pregnant column contains “Yes”, “No” or NaN). Assign the result to a new DataFrame called yao_pregnant. Then print this new DataFrame. There should be 24 rows.
# Your code here
16.5 Relational operators
The == operator introduced above is an example of a “relational” operator, as it tests the relation between two values. Here is a list of some more of these operators. You will use these often when you are querying rows in your data.
Operator
is True if
A == B
A is equal to B
A != B
A is not equal to B
A < B
A is less than B
A <= B
A is less than or equal to B
A > B
A is greater than B
A >= B
A is greater than or equal to B
A.isin([B])
A is an element of B
Let’s see how to use these with query():
yao.query('sex == "Female"') # keep rows where `sex` is femaleyao.query('sex != "Male"') # keep rows where `sex` is not "Male"yao.query("age < 6") # keep respondents under 6yao.query("age >= 70") # keep respondents aged at least 70# keep respondents whose neighbourhood is "Tsinga" or "Messa"yao.query('neighborhood.isin(["Tsinga", "Messa"])')
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
605
55
Male
70
Messa
Informal worker
No symptoms
Non-smoker
NaN
Negative
Negative
606
59
Female
59
Messa
Trader
No symptoms
Non-smoker
No
Negative
Negative
...
...
...
...
...
...
...
...
...
...
...
902
25
Female
41
Tsinga
Unemployed
No symptoms
Non-smoker
No
Negative
Negative
903
28
Male
69
Tsinga
Trader
Sneezing--Headache
Ex-smoker
NaN
Negative
Negative
129 rows × 10 columns
Practice
16.5.1 Practice Q: Subset for Children
From yao, keep only respondents who were children (under 18). Assign the result to a new DataFrame called yao_children. There should be 291 rows.
# Your code here
Practice
16.5.2 Practice Q: Subset for Tsinga and Messa
With isin(), keep only respondents who live in the “Carriere” or “Ekoudou” neighborhoods. Assign the result to a new DataFrame called yao_carriere_ekoudou. There should be 426 rows.
# Your code here
16.6 Accessing external variables in query()
The query() method allows you to access variables outside the DataFrame using the @ symbol. This is useful when you want to use dynamic values in your query conditions.
For example, say you have the variable min_age that you want to use in your query. You can do this as follows:
min_age =25# Query using external variablesyao.query('age >= @min_age')
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
0
45
Female
95
Briqueterie
Informal worker
Muscle pain
Non-smoker
No
Negative
Negative
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
968
35
Male
77
Tsinga Oliga
Informal worker
Headache
Smoker
NaN
Positive
Negative
969
31
Female
66
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No
Negative
Negative
524 rows × 10 columns
This feature is helpful when you need to filter data based on values that may change or are determined at runtime.
Practice
16.6.1 Practice Q: Subset for Young Respondents
From yao, keep respondents who are less than or equal to the variable max_age, defined below. Assign the result to a new DataFrame called yao_young. There should be 590 rows.
max_age =30# Your code here
16.7 Combining conditions with & and |
We can pass multiple conditions to query() using & (the “ampersand” symbol) for AND and | (the “vertical bar” or “pipe” symbol) for OR.
For example, to keep respondents who are either younger than 18 OR older than 65, we can write:
yao.query("age < 18 | age > 65")
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
5
17
Female
65
Briqueterie
Student
Fever--Cough--Rhinitis--Nausea or vomiting--Di...
Non-smoker
No
Negative
Negative
6
13
Female
65
Briqueterie
Student
Sneezing
Non-smoker
No
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
962
15
Male
44
Tsinga Oliga
Student
Fever--Cough--Rhinitis
Non-smoker
NaN
Positive
Negative
970
17
Female
67
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No response
Negative
Negative
331 rows × 10 columns
To keep respondents who are pregnant and are ex-smokers, we write:
Subset yao to only keep men who tested IgG positive. Assign the result to a new DataFrame called yao_igg_positive_men. There should be 148 rows after your query. Think carefully about whether to use & or |.
# Your code here
16.8 Negating conditions with the ~ operator
To negate conditions in query(), we use the ~ operator (pronounced “tilde”).
Let’s use this to drop respondents who are students:
yao.query('~ (occupation == "Student")')
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
0
45
Female
95
Briqueterie
Informal worker
Muscle pain
Non-smoker
No
Negative
Negative
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
969
31
Female
66
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No
Negative
Negative
970
17
Female
67
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No response
Negative
Negative
588 rows × 10 columns
Notice that we have to enclose the condition in parentheses.
We can also enclose multiple conditions in parentheses.
Imagine we want to give out a drug, but since it is a strong drug, we don’t want children or lightweight (under 30kg) respondents to take it. First, we can write a query to select the children and these light respondents:
yao.query("age < 18 | weight_kg < 30")
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
5
17
Female
65
Briqueterie
Student
Fever--Cough--Rhinitis--Nausea or vomiting--Di...
Non-smoker
No
Negative
Negative
6
13
Female
65
Briqueterie
Student
Sneezing
Non-smoker
No
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
962
15
Male
44
Tsinga Oliga
Student
Fever--Cough--Rhinitis
Non-smoker
NaN
Positive
Negative
970
17
Female
67
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No response
Negative
Negative
291 rows × 10 columns
Now to drop these individuals, we can negate the condition with ~:
yao.query("~ (age < 18 | weight_kg < 30)")
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
0
45
Female
95
Briqueterie
Informal worker
Muscle pain
Non-smoker
No
Negative
Negative
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
968
35
Male
77
Tsinga Oliga
Informal worker
Headache
Smoker
NaN
Positive
Negative
969
31
Female
66
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No
Negative
Negative
680 rows × 10 columns
This could also be written as:
yao.query("age >= 18 & weight_kg >= 30")
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
0
45
Female
95
Briqueterie
Informal worker
Muscle pain
Non-smoker
No
Negative
Negative
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
968
35
Male
77
Tsinga Oliga
Informal worker
Headache
Smoker
NaN
Positive
Negative
969
31
Female
66
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No
Negative
Negative
680 rows × 10 columns
But sometimes negated conditions are easier to read.
Practice
16.8.1 Practice Q: Drop Smokers and drop those over 50
We want to avoid giving a drug to older individuals and smokers. From yao, drop respondents that are either above 50 or who are smokers. Use ~ to negate the conditions. Assign the result to a new DataFrame called yao_dropped. Your output should have 810 rows.
# Your code here
16.9NaN values
The relational operators introduced so far do not work with null values like NaN.
For example, the is_pregnant column contains (NA) values for men. To keep the rows with missing is_pregnant values, we could try writing:
yao.query("is_pregnant == NaN") # does not work
But this will not work. This is because NaN is a non-existent value. So the system cannot evaluate whether it is “equal to” or “not equal to” anything.
Instead, we can use the isna() method to select rows with missing values:
yao.query("is_pregnant.isna()")
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
2
23
Male
74
Briqueterie
Student
No symptoms
Smoker
NaN
Negative
Negative
...
...
...
...
...
...
...
...
...
...
...
966
32
Male
54
Tsinga Oliga
Informal worker
Rhinitis--Sneezing--Diarrhoea
Smoker
NaN
Negative
Negative
968
35
Male
77
Tsinga Oliga
Informal worker
Headache
Smoker
NaN
Positive
Negative
422 rows × 10 columns
Or we can select rows that are not missing with notna():
yao.query("is_pregnant.notna()")
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
0
45
Female
95
Briqueterie
Informal worker
Muscle pain
Non-smoker
No
Negative
Negative
3
20
Female
70
Briqueterie
Student
Rhinitis--Sneezing--Anosmia or ageusia
Non-smoker
No
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
969
31
Female
66
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No
Negative
Negative
970
17
Female
67
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No response
Negative
Negative
549 rows × 10 columns
Practice
16.9.1 Practice Q: Keep Missing Smoking Status
From the yao dataset, keep all the respondents who had NA records for the report of their smoking status.
# Your code here
16.10 Querying Based on String Patterns
Sometimes, we need to filter our data based on whether a string column contains a certain substring. This is particularly useful when dealing with multi-answer type variables, where responses may contain multiple values separated by delimiters. Let’s explore this using the occupation column in our dataset.
First, let’s take a look at the unique values in the occupation column:
As we can see, some respondents have multiple occupations, separated by “–”. To query based on string containment, we can use the str.contains() method within our query().
16.10.1 Basic String Containment
To find all respondents who are students (either solely or in combination with other occupations), we can use:
yao.query("occupation.str.contains('Student')")
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
2
23
Male
74
Briqueterie
Student
No symptoms
Smoker
NaN
Negative
Negative
3
20
Female
70
Briqueterie
Student
Rhinitis--Sneezing--Anosmia or ageusia
Non-smoker
No
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
963
26
Female
63
Tsinga Oliga
Student
No symptoms
Non-smoker
No
Negative
Negative
964
28
Male
76
Tsinga Oliga
Student--Informal worker
No symptoms
Non-smoker
NaN
Negative
Negative
402 rows × 10 columns
This query will return all rows where the occupation column contains the word “Student”, regardless of whether it’s the only occupation or part of a multiple-occupation entry.
16.10.2 Negating String Containment
To find respondents who are not students (i.e., their occupation does not contain “Student”), you can use the ~ operator:
yao.query("~occupation.str.contains('Student')")
age
sex
weight_kg
neighborhood
occupation
symptoms
is_smoker
is_pregnant
igg_result
igm_result
0
45
Female
95
Briqueterie
Informal worker
Muscle pain
Non-smoker
No
Negative
Negative
1
55
Male
96
Briqueterie
Salaried worker
No symptoms
Ex-smoker
NaN
Positive
Negative
...
...
...
...
...
...
...
...
...
...
...
969
31
Female
66
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No
Negative
Negative
970
17
Female
67
Tsinga Oliga
Unemployed
No symptoms
Non-smoker
No response
Negative
Negative
569 rows × 10 columns
16.10.3 Using | with string containment
To find respondents who are students or farmers, we can use: