16 Querying rows

16.1 Intro

Querying rows is one of the most frequently used operations in data analysis. It allows you to filter your dataset to focus on specific subsets of interest, enabling more targeted and efficient analysis.

In this lesson, we’ll explore various techniques to subset rows in pandas.

Let’s get started!

16.2 Learning objectives

You can use the query() method to keep or drop rows from a DataFrame.
You can specify conditions using relational operators like greater than (>), less than (<), equal to (==), not equal to (!=), and is an element of (isin()).
You can combine conditions with & and |.
You can negate conditions with ~.
You can use the isna() and notna() methods.
You can query based on string patterns using str.contains().

16.3 The Yaounde COVID-19 dataset

In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.

You can download the dataset from this link: yaounde_data.csv

You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0

Let’s load the data into a pandas DataFrame.

import pandas as pd

yaounde = pd.read_csv("data/yaounde_data.csv")
# a smaller subset of variables
yao = yaounde[
    [
        "age",
        "sex",
        "weight_kg",
        "neighborhood",
        "occupation",
        "symptoms",
        "is_smoker",
        "is_pregnant",
        "igg_result",
        "igm_result",
    ]
]
yao.head()

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
0	45	Female	95	Briqueterie	Informal worker	Muscle pain	Non-smoker	No	Negative	Negative
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
2	23	Male	74	Briqueterie	Student	No symptoms	Smoker	NaN	Negative	Negative
3	20	Female	70	Briqueterie	Student	Rhinitis--Sneezing--Anosmia or ageusia	Non-smoker	No	Positive	Negative
4	55	Female	67	Briqueterie	Trader--Farmer	No symptoms	Non-smoker	No	Positive	Negative

16.4 Introducing `query()`

We can use the query() method to keep rows that satisfy a set of conditions. Let’s take a look at a simple example. If we want to keep just the male records, we run:

yao.query('sex == "Male"')

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
2	23	Male	74	Briqueterie	Student	No symptoms	Smoker	NaN	Negative	Negative
...	...	...	...	...	...	...	...	...	...	...
966	32	Male	54	Tsinga Oliga	Informal worker	Rhinitis--Sneezing--Diarrhoea	Smoker	NaN	Negative	Negative
968	35	Male	77	Tsinga Oliga	Informal worker	Headache	Smoker	NaN	Positive	Negative

422 rows × 10 columns

As you can see, the query() syntax is quite simple. (It may be a bit surprising to have to put code in quotes, but it is quite readable.)

Note the use of double equals (==) instead of single equals (=) there. The == sign tests for equality, while the single equals sign assigns a value. This is a common source of errors when you are a beginner, so watch out for it.

We can chain query() with shape[0] to count the number of male respondents.

yao.query('sex == "Male"').shape[0]

Reminder

The shape property returns the number of rows and columns in a DataFrame. The first element, shape[0], is the number of rows, and the second element, shape[1], is the number of columns.

For example:

yao.shape

(971, 10)

yao.shape[0] # rows

yao.shape[1]  # columns

Key Point

Note that these subsets are not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset of male respondents:

yao_male = yao.query('sex == "Male"')
yao_male

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
2	23	Male	74	Briqueterie	Student	No symptoms	Smoker	NaN	Negative	Negative
...	...	...	...	...	...	...	...	...	...	...
966	32	Male	54	Tsinga Oliga	Informal worker	Rhinitis--Sneezing--Diarrhoea	Smoker	NaN	Negative	Negative
968	35	Male	77	Tsinga Oliga	Informal worker	Headache	Smoker	NaN	Positive	Negative

422 rows × 10 columns

But for ease of explanation, in the examples below, we are simply printing the result, without storing it in a variable.

Practice

16.4.1 Practice Q: Subset for Pregnant Respondents

Subset the yao data frame to respondents who were pregnant during the survey (The is_pregnant column contains “Yes”, “No” or NaN). Assign the result to a new DataFrame called yao_pregnant. Then print this new DataFrame. There should be 24 rows.

# Your code here

16.5 Relational operators

The == operator introduced above is an example of a “relational” operator, as it tests the relation between two values. Here is a list of some more of these operators. You will use these often when you are querying rows in your data.

Operator	is True if
A == B	A is equal to B
A != B	A is not equal to B
A < B	A is less than B
A <= B	A is less than or equal to B
A > B	A is greater than B
A >= B	A is greater than or equal to B
A.isin([B])	A is an element of B

Let’s see how to use these with query():

yao.query('sex == "Female"')  # keep rows where `sex` is female
yao.query('sex != "Male"')  # keep rows where `sex` is not "Male"
yao.query("age < 6")  # keep respondents under 6
yao.query("age >= 70")  # keep respondents aged at least 70

# keep respondents whose neighbourhood is "Tsinga" or "Messa"
yao.query('neighborhood.isin(["Tsinga", "Messa"])')

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
605	55	Male	70	Messa	Informal worker	No symptoms	Non-smoker	NaN	Negative	Negative
606	59	Female	59	Messa	Trader	No symptoms	Non-smoker	No	Negative	Negative
...	...	...	...	...	...	...	...	...	...	...
902	25	Female	41	Tsinga	Unemployed	No symptoms	Non-smoker	No	Negative	Negative
903	28	Male	69	Tsinga	Trader	Sneezing--Headache	Ex-smoker	NaN	Negative	Negative

129 rows × 10 columns

Practice

16.5.1 Practice Q: Subset for Children

From yao, keep only respondents who were children (under 18). Assign the result to a new DataFrame called yao_children. There should be 291 rows.

# Your code here

Practice

16.5.2 Practice Q: Subset for Tsinga and Messa

With isin(), keep only respondents who live in the “Carriere” or “Ekoudou” neighborhoods. Assign the result to a new DataFrame called yao_carriere_ekoudou. There should be 426 rows.

# Your code here

16.6 Accessing external variables in `query()`

The query() method allows you to access variables outside the DataFrame using the @ symbol. This is useful when you want to use dynamic values in your query conditions.

For example, say you have the variable min_age that you want to use in your query. You can do this as follows:

min_age = 25

# Query using external variables
yao.query('age >= @min_age')

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
0	45	Female	95	Briqueterie	Informal worker	Muscle pain	Non-smoker	No	Negative	Negative
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
968	35	Male	77	Tsinga Oliga	Informal worker	Headache	Smoker	NaN	Positive	Negative
969	31	Female	66	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No	Negative	Negative

524 rows × 10 columns

This feature is helpful when you need to filter data based on values that may change or are determined at runtime.

Practice

16.6.1 Practice Q: Subset for Young Respondents

From yao, keep respondents who are less than or equal to the variable max_age, defined below. Assign the result to a new DataFrame called yao_young. There should be 590 rows.

max_age = 30
# Your code here

16.7 Combining conditions with `&` and `|`

We can pass multiple conditions to query() using & (the “ampersand” symbol) for AND and | (the “vertical bar” or “pipe” symbol) for OR.

For example, to keep respondents who are either younger than 18 OR older than 65, we can write:

yao.query("age < 18 | age > 65")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
5	17	Female	65	Briqueterie	Student	Fever--Cough--Rhinitis--Nausea or vomiting--Di...	Non-smoker	No	Negative	Negative
6	13	Female	65	Briqueterie	Student	Sneezing	Non-smoker	No	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
962	15	Male	44	Tsinga Oliga	Student	Fever--Cough--Rhinitis	Non-smoker	NaN	Positive	Negative
970	17	Female	67	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No response	Negative	Negative

331 rows × 10 columns

To keep respondents who are pregnant and are ex-smokers, we write:

yao.query('is_pregnant == "Yes" & is_smoker == "Ex-smoker"')

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
273	25	Female	90	Carriere	Home-maker	Cough--Rhinitis--Sneezing	Ex-smoker	Yes	Positive	Negative

To keep all respondents who are pregnant or ex-smokers, we write:

yao.query('is_pregnant == "Yes" | is_smoker == "Ex-smoker"')

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
14	42	Male	71	Briqueterie	Trader	No symptoms	Ex-smoker	NaN	Negative	Negative
...	...	...	...	...	...	...	...	...	...	...
953	31	Female	90	Tsinga Oliga	Salaried worker	Fever--Cough--Sore throat--Headache	Ex-smoker	No	Positive	Negative
967	23	Female	76	Tsinga Oliga	Informal worker--Trader	Headache	Non-smoker	Yes	Negative	Negative

94 rows × 10 columns

Side note

To get the unique values in a column, you can use the value_counts() method.

yao.is_smoker.value_counts()

is_smoker
Non-smoker    859
Ex-smoker      71
Smoker         39
Name: count, dtype: int64

Practice

16.7.1 Practice Q: Subset for IgG Positive Men

Subset yao to only keep men who tested IgG positive. Assign the result to a new DataFrame called yao_igg_positive_men. There should be 148 rows after your query. Think carefully about whether to use & or |.

# Your code here

16.8 Negating conditions with the `~` operator

To negate conditions in query(), we use the ~ operator (pronounced “tilde”).

Let’s use this to drop respondents who are students:

yao.query('~ (occupation == "Student")')

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
0	45	Female	95	Briqueterie	Informal worker	Muscle pain	Non-smoker	No	Negative	Negative
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
969	31	Female	66	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No	Negative	Negative
970	17	Female	67	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No response	Negative	Negative

588 rows × 10 columns

Notice that we have to enclose the condition in parentheses.

We can also enclose multiple conditions in parentheses.

Imagine we want to give out a drug, but since it is a strong drug, we don’t want children or lightweight (under 30kg) respondents to take it. First, we can write a query to select the children and these light respondents:

yao.query("age < 18 | weight_kg < 30")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
5	17	Female	65	Briqueterie	Student	Fever--Cough--Rhinitis--Nausea or vomiting--Di...	Non-smoker	No	Negative	Negative
6	13	Female	65	Briqueterie	Student	Sneezing	Non-smoker	No	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
962	15	Male	44	Tsinga Oliga	Student	Fever--Cough--Rhinitis	Non-smoker	NaN	Positive	Negative
970	17	Female	67	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No response	Negative	Negative

291 rows × 10 columns

Now to drop these individuals, we can negate the condition with ~:

yao.query("~ (age < 18 | weight_kg < 30)")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
0	45	Female	95	Briqueterie	Informal worker	Muscle pain	Non-smoker	No	Negative	Negative
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
968	35	Male	77	Tsinga Oliga	Informal worker	Headache	Smoker	NaN	Positive	Negative
969	31	Female	66	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No	Negative	Negative

680 rows × 10 columns

This could also be written as:

yao.query("age >= 18 & weight_kg >= 30")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
0	45	Female	95	Briqueterie	Informal worker	Muscle pain	Non-smoker	No	Negative	Negative
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
968	35	Male	77	Tsinga Oliga	Informal worker	Headache	Smoker	NaN	Positive	Negative
969	31	Female	66	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No	Negative	Negative

680 rows × 10 columns

But sometimes negated conditions are easier to read.

Practice

16.8.1 Practice Q: Drop Smokers and drop those over 50

We want to avoid giving a drug to older individuals and smokers. From yao, drop respondents that are either above 50 or who are smokers. Use ~ to negate the conditions. Assign the result to a new DataFrame called yao_dropped. Your output should have 810 rows.

# Your code here

16.9 `NaN` values

The relational operators introduced so far do not work with null values like NaN.

For example, the is_pregnant column contains (NA) values for men. To keep the rows with missing is_pregnant values, we could try writing:

yao.query("is_pregnant == NaN")  # does not work

But this will not work. This is because NaN is a non-existent value. So the system cannot evaluate whether it is “equal to” or “not equal to” anything.

Instead, we can use the isna() method to select rows with missing values:

yao.query("is_pregnant.isna()")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
2	23	Male	74	Briqueterie	Student	No symptoms	Smoker	NaN	Negative	Negative
...	...	...	...	...	...	...	...	...	...	...
966	32	Male	54	Tsinga Oliga	Informal worker	Rhinitis--Sneezing--Diarrhoea	Smoker	NaN	Negative	Negative
968	35	Male	77	Tsinga Oliga	Informal worker	Headache	Smoker	NaN	Positive	Negative

422 rows × 10 columns

Or we can select rows that are not missing with notna():

yao.query("is_pregnant.notna()")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
0	45	Female	95	Briqueterie	Informal worker	Muscle pain	Non-smoker	No	Negative	Negative
3	20	Female	70	Briqueterie	Student	Rhinitis--Sneezing--Anosmia or ageusia	Non-smoker	No	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
969	31	Female	66	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No	Negative	Negative
970	17	Female	67	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No response	Negative	Negative

549 rows × 10 columns

Practice

16.9.1 Practice Q: Keep Missing Smoking Status

From the yao dataset, keep all the respondents who had NA records for the report of their smoking status.

# Your code here

16.10 Querying Based on String Patterns

Sometimes, we need to filter our data based on whether a string column contains a certain substring. This is particularly useful when dealing with multi-answer type variables, where responses may contain multiple values separated by delimiters. Let’s explore this using the occupation column in our dataset.

First, let’s take a look at the unique values in the occupation column:

yao.occupation.value_counts().to_dict()

{'Student': 383,
 'Informal worker': 189,
 'Trader': 111,
 'Unemployed': 68,
 'Home-maker': 65,
 'Salaried worker': 54,
 'Retired': 27,
 'Student--Informal worker': 13,
 'Other': 13,
 'No response': 9,
 'Farmer': 5,
 'Informal worker--Trader': 4,
 'Student--Trader': 4,
 'Trader--Farmer': 4,
 'Home-maker--Informal worker': 3,
 'Home-maker--Trader': 3,
 'Retired--Informal worker': 3,
 'Informal worker--Other': 2,
 'Home-maker--Farmer': 2,
 'Student--Other': 1,
 'Farmer--Other': 1,
 'Trader--Unemployed': 1,
 'Retired--Other': 1,
 'Informal worker--Unemployed': 1,
 'Retired--Trader': 1,
 'Home-maker--Informal worker--Farmer': 1,
 'Student--Informal worker--Other': 1,
 'Informal worker--Trader--Farmer--Other': 1}

As we can see, some respondents have multiple occupations, separated by “–”. To query based on string containment, we can use the str.contains() method within our query().

16.10.1 Basic String Containment

To find all respondents who are students (either solely or in combination with other occupations), we can use:

yao.query("occupation.str.contains('Student')")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
2	23	Male	74	Briqueterie	Student	No symptoms	Smoker	NaN	Negative	Negative
3	20	Female	70	Briqueterie	Student	Rhinitis--Sneezing--Anosmia or ageusia	Non-smoker	No	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
963	26	Female	63	Tsinga Oliga	Student	No symptoms	Non-smoker	No	Negative	Negative
964	28	Male	76	Tsinga Oliga	Student--Informal worker	No symptoms	Non-smoker	NaN	Negative	Negative

402 rows × 10 columns

This query will return all rows where the occupation column contains the word “Student”, regardless of whether it’s the only occupation or part of a multiple-occupation entry.

16.10.2 Negating String Containment

To find respondents who are not students (i.e., their occupation does not contain “Student”), you can use the ~ operator:

yao.query("~occupation.str.contains('Student')")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
0	45	Female	95	Briqueterie	Informal worker	Muscle pain	Non-smoker	No	Negative	Negative
1	55	Male	96	Briqueterie	Salaried worker	No symptoms	Ex-smoker	NaN	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
969	31	Female	66	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No	Negative	Negative
970	17	Female	67	Tsinga Oliga	Unemployed	No symptoms	Non-smoker	No response	Negative	Negative

569 rows × 10 columns

16.10.3 Using `|` with string containment

To find respondents who are students or farmers, we can use:

yao.query("occupation.str.contains('Student|Farmer')")

	age	sex	weight_kg	neighborhood	occupation	symptoms	is_smoker	is_pregnant	igg_result	igm_result
2	23	Male	74	Briqueterie	Student	No symptoms	Smoker	NaN	Negative	Negative
3	20	Female	70	Briqueterie	Student	Rhinitis--Sneezing--Anosmia or ageusia	Non-smoker	No	Positive	Negative
...	...	...	...	...	...	...	...	...	...	...
963	26	Female	63	Tsinga Oliga	Student	No symptoms	Non-smoker	No	Negative	Negative
964	28	Male	76	Tsinga Oliga	Student--Informal worker	No symptoms	Non-smoker	NaN	Negative	Negative

416 rows × 10 columns

Practice

16.10.4 Practice Q: Symptoms

The symptoms column contains a list of symptoms that respondents reported.

Query yao to find respondents who reported “Cough” or “Fever” as symptoms. Your answer should have 219 rows.

# Your code here

16.11 Wrap up

Great job! You’ve learned how to select specific columns and filter rows based on various conditions.

These skills allow you to focus on relevant data and create targeted subsets for analysis.

Next, we’ll explore how to modify and transform your data, further expanding your data wrangling toolkit. See you in the next lesson!

16.1 Intro

16.2 Learning objectives

16.3 The Yaounde COVID-19 dataset

16.4 Introducing query()

16.4.1 Practice Q: Subset for Pregnant Respondents

16.5 Relational operators

16.5.1 Practice Q: Subset for Children

16.5.2 Practice Q: Subset for Tsinga and Messa

16.6 Accessing external variables in query()

16.6.1 Practice Q: Subset for Young Respondents

16.7 Combining conditions with & and |

16.7.1 Practice Q: Subset for IgG Positive Men

16.8 Negating conditions with the ~ operator

16.8.1 Practice Q: Drop Smokers and drop those over 50

16.9 NaN values

16.9.1 Practice Q: Keep Missing Smoking Status

16.10 Querying Based on String Patterns

16.10.1 Basic String Containment

16.10.2 Negating String Containment

16.10.3 Using | with string containment

16.10.4 Practice Q: Symptoms

16.11 Wrap up

16.4 Introducing `query()`

16.6 Accessing external variables in `query()`

16.7 Combining conditions with `&` and `|`

16.8 Negating conditions with the `~` operator

16.9 `NaN` values

16.10.3 Using `|` with string containment