16  Querying rows

16.1 Intro

Querying rows is one of the most frequently used operations in data analysis. It allows you to filter your dataset to focus on specific subsets of interest, enabling more targeted and efficient analysis.

In this lesson, we’ll explore various techniques to subset rows in pandas.

Let’s get started!

16.2 Learning objectives

  1. You can use the query() method to keep or drop rows from a DataFrame.
  2. You can specify conditions using relational operators like greater than (>), less than (<), equal to (==), not equal to (!=), and is an element of (isin()).
  3. You can combine conditions with & and |.
  4. You can negate conditions with ~.
  5. You can use the isna() and notna() methods.
  6. You can query based on string patterns using str.contains().

16.3 The Yaounde COVID-19 dataset

In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.

You can download the dataset from this link: yaounde_data.csv

You can find out more about this dataset here: https://www.nature.com/articles/s41467-021-25946-0

Let’s load the data into a pandas DataFrame.

import pandas as pd

yaounde = pd.read_csv("data/yaounde_data.csv")
# a smaller subset of variables
yao = yaounde[
    [
        "age",
        "sex",
        "weight_kg",
        "neighborhood",
        "occupation",
        "symptoms",
        "is_smoker",
        "is_pregnant",
        "igg_result",
        "igm_result",
    ]
]
yao.head()
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
0 45 Female 95 Briqueterie Informal worker Muscle pain Non-smoker No Negative Negative
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
2 23 Male 74 Briqueterie Student No symptoms Smoker NaN Negative Negative
3 20 Female 70 Briqueterie Student Rhinitis--Sneezing--Anosmia or ageusia Non-smoker No Positive Negative
4 55 Female 67 Briqueterie Trader--Farmer No symptoms Non-smoker No Positive Negative

16.4 Introducing query()

We can use the query() method to keep rows that satisfy a set of conditions. Let’s take a look at a simple example. If we want to keep just the male records, we run:

yao.query('sex == "Male"')
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
2 23 Male 74 Briqueterie Student No symptoms Smoker NaN Negative Negative
... ... ... ... ... ... ... ... ... ... ...
966 32 Male 54 Tsinga Oliga Informal worker Rhinitis--Sneezing--Diarrhoea Smoker NaN Negative Negative
968 35 Male 77 Tsinga Oliga Informal worker Headache Smoker NaN Positive Negative

422 rows × 10 columns

As you can see, the query() syntax is quite simple. (It may be a bit surprising to have to put code in quotes, but it is quite readable.)

Note the use of double equals (==) instead of single equals (=) there. The == sign tests for equality, while the single equals sign assigns a value. This is a common source of errors when you are a beginner, so watch out for it.

We can chain query() with shape[0] to count the number of male respondents.

yao.query('sex == "Male"').shape[0]
422
Reminder

The shape property returns the number of rows and columns in a DataFrame. The first element, shape[0], is the number of rows, and the second element, shape[1], is the number of columns.

For example:

yao.shape
(971, 10)
yao.shape[0] # rows
971
yao.shape[1]  # columns
10
Key Point

Note that these subsets are not modifying the DataFrame itself. If we want a modified version, we create a new DataFrame to store the subset. For example, below we create a subset of male respondents:

yao_male = yao.query('sex == "Male"')
yao_male
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
2 23 Male 74 Briqueterie Student No symptoms Smoker NaN Negative Negative
... ... ... ... ... ... ... ... ... ... ...
966 32 Male 54 Tsinga Oliga Informal worker Rhinitis--Sneezing--Diarrhoea Smoker NaN Negative Negative
968 35 Male 77 Tsinga Oliga Informal worker Headache Smoker NaN Positive Negative

422 rows × 10 columns

But for ease of explanation, in the examples below, we are simply printing the result, without storing it in a variable.

Practice

16.4.1 Practice Q: Subset for Pregnant Respondents

Subset the yao data frame to respondents who were pregnant during the survey (The is_pregnant column contains “Yes”, “No” or NaN). Assign the result to a new DataFrame called yao_pregnant. Then print this new DataFrame. There should be 24 rows.

# Your code here

16.5 Relational operators

The == operator introduced above is an example of a “relational” operator, as it tests the relation between two values. Here is a list of some more of these operators. You will use these often when you are querying rows in your data.

Operator is True if
A == B A is equal to B
A != B A is not equal to B
A < B A is less than B
A <= B A is less than or equal to B
A > B A is greater than B
A >= B A is greater than or equal to B
A.isin([B]) A is an element of B

Let’s see how to use these with query():

yao.query('sex == "Female"')  # keep rows where `sex` is female
yao.query('sex != "Male"')  # keep rows where `sex` is not "Male"
yao.query("age < 6")  # keep respondents under 6
yao.query("age >= 70")  # keep respondents aged at least 70

# keep respondents whose neighbourhood is "Tsinga" or "Messa"
yao.query('neighborhood.isin(["Tsinga", "Messa"])')
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
605 55 Male 70 Messa Informal worker No symptoms Non-smoker NaN Negative Negative
606 59 Female 59 Messa Trader No symptoms Non-smoker No Negative Negative
... ... ... ... ... ... ... ... ... ... ...
902 25 Female 41 Tsinga Unemployed No symptoms Non-smoker No Negative Negative
903 28 Male 69 Tsinga Trader Sneezing--Headache Ex-smoker NaN Negative Negative

129 rows × 10 columns

Practice

16.5.1 Practice Q: Subset for Children

  • From yao, keep only respondents who were children (under 18). Assign the result to a new DataFrame called yao_children. There should be 291 rows.
# Your code here
Practice

16.5.2 Practice Q: Subset for Tsinga and Messa

  • With isin(), keep only respondents who live in the “Carriere” or “Ekoudou” neighborhoods. Assign the result to a new DataFrame called yao_carriere_ekoudou. There should be 426 rows.
# Your code here

16.6 Accessing external variables in query()

The query() method allows you to access variables outside the DataFrame using the @ symbol. This is useful when you want to use dynamic values in your query conditions.

For example, say you have the variable min_age that you want to use in your query. You can do this as follows:

min_age = 25

# Query using external variables
yao.query('age >= @min_age')
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
0 45 Female 95 Briqueterie Informal worker Muscle pain Non-smoker No Negative Negative
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
... ... ... ... ... ... ... ... ... ... ...
968 35 Male 77 Tsinga Oliga Informal worker Headache Smoker NaN Positive Negative
969 31 Female 66 Tsinga Oliga Unemployed No symptoms Non-smoker No Negative Negative

524 rows × 10 columns

This feature is helpful when you need to filter data based on values that may change or are determined at runtime.

Practice

16.6.1 Practice Q: Subset for Young Respondents

  • From yao, keep respondents who are less than or equal to the variable max_age, defined below. Assign the result to a new DataFrame called yao_young. There should be 590 rows.
max_age = 30
# Your code here

16.7 Combining conditions with & and |

We can pass multiple conditions to query() using & (the “ampersand” symbol) for AND and | (the “vertical bar” or “pipe” symbol) for OR.

For example, to keep respondents who are either younger than 18 OR older than 65, we can write:

yao.query("age < 18 | age > 65")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
5 17 Female 65 Briqueterie Student Fever--Cough--Rhinitis--Nausea or vomiting--Di... Non-smoker No Negative Negative
6 13 Female 65 Briqueterie Student Sneezing Non-smoker No Positive Negative
... ... ... ... ... ... ... ... ... ... ...
962 15 Male 44 Tsinga Oliga Student Fever--Cough--Rhinitis Non-smoker NaN Positive Negative
970 17 Female 67 Tsinga Oliga Unemployed No symptoms Non-smoker No response Negative Negative

331 rows × 10 columns

To keep respondents who are pregnant and are ex-smokers, we write:

yao.query('is_pregnant == "Yes" & is_smoker == "Ex-smoker"')
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
273 25 Female 90 Carriere Home-maker Cough--Rhinitis--Sneezing Ex-smoker Yes Positive Negative

To keep all respondents who are pregnant or ex-smokers, we write:

yao.query('is_pregnant == "Yes" | is_smoker == "Ex-smoker"')
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
14 42 Male 71 Briqueterie Trader No symptoms Ex-smoker NaN Negative Negative
... ... ... ... ... ... ... ... ... ... ...
953 31 Female 90 Tsinga Oliga Salaried worker Fever--Cough--Sore throat--Headache Ex-smoker No Positive Negative
967 23 Female 76 Tsinga Oliga Informal worker--Trader Headache Non-smoker Yes Negative Negative

94 rows × 10 columns

Side note

To get the unique values in a column, you can use the value_counts() method.

yao.is_smoker.value_counts()
is_smoker
Non-smoker    859
Ex-smoker      71
Smoker         39
Name: count, dtype: int64
Practice

16.7.1 Practice Q: Subset for IgG Positive Men

Subset yao to only keep men who tested IgG positive. Assign the result to a new DataFrame called yao_igg_positive_men. There should be 148 rows after your query. Think carefully about whether to use & or |.

# Your code here

16.8 Negating conditions with the ~ operator

To negate conditions in query(), we use the ~ operator (pronounced “tilde”).

Let’s use this to drop respondents who are students:

yao.query('~ (occupation == "Student")')
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
0 45 Female 95 Briqueterie Informal worker Muscle pain Non-smoker No Negative Negative
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
... ... ... ... ... ... ... ... ... ... ...
969 31 Female 66 Tsinga Oliga Unemployed No symptoms Non-smoker No Negative Negative
970 17 Female 67 Tsinga Oliga Unemployed No symptoms Non-smoker No response Negative Negative

588 rows × 10 columns

Notice that we have to enclose the condition in parentheses.

We can also enclose multiple conditions in parentheses.

Imagine we want to give out a drug, but since it is a strong drug, we don’t want children or lightweight (under 30kg) respondents to take it. First, we can write a query to select the children and these light respondents:

yao.query("age < 18 | weight_kg < 30")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
5 17 Female 65 Briqueterie Student Fever--Cough--Rhinitis--Nausea or vomiting--Di... Non-smoker No Negative Negative
6 13 Female 65 Briqueterie Student Sneezing Non-smoker No Positive Negative
... ... ... ... ... ... ... ... ... ... ...
962 15 Male 44 Tsinga Oliga Student Fever--Cough--Rhinitis Non-smoker NaN Positive Negative
970 17 Female 67 Tsinga Oliga Unemployed No symptoms Non-smoker No response Negative Negative

291 rows × 10 columns

Now to drop these individuals, we can negate the condition with ~:

yao.query("~ (age < 18 | weight_kg < 30)")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
0 45 Female 95 Briqueterie Informal worker Muscle pain Non-smoker No Negative Negative
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
... ... ... ... ... ... ... ... ... ... ...
968 35 Male 77 Tsinga Oliga Informal worker Headache Smoker NaN Positive Negative
969 31 Female 66 Tsinga Oliga Unemployed No symptoms Non-smoker No Negative Negative

680 rows × 10 columns

This could also be written as:

yao.query("age >= 18 & weight_kg >= 30")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
0 45 Female 95 Briqueterie Informal worker Muscle pain Non-smoker No Negative Negative
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
... ... ... ... ... ... ... ... ... ... ...
968 35 Male 77 Tsinga Oliga Informal worker Headache Smoker NaN Positive Negative
969 31 Female 66 Tsinga Oliga Unemployed No symptoms Non-smoker No Negative Negative

680 rows × 10 columns

But sometimes negated conditions are easier to read.

Practice

16.8.1 Practice Q: Drop Smokers and drop those over 50

We want to avoid giving a drug to older individuals and smokers. From yao, drop respondents that are either above 50 or who are smokers. Use ~ to negate the conditions. Assign the result to a new DataFrame called yao_dropped. Your output should have 810 rows.

# Your code here

16.9 NaN values

The relational operators introduced so far do not work with null values like NaN.

For example, the is_pregnant column contains (NA) values for men. To keep the rows with missing is_pregnant values, we could try writing:

yao.query("is_pregnant == NaN")  # does not work

But this will not work. This is because NaN is a non-existent value. So the system cannot evaluate whether it is “equal to” or “not equal to” anything.

Instead, we can use the isna() method to select rows with missing values:

yao.query("is_pregnant.isna()")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
2 23 Male 74 Briqueterie Student No symptoms Smoker NaN Negative Negative
... ... ... ... ... ... ... ... ... ... ...
966 32 Male 54 Tsinga Oliga Informal worker Rhinitis--Sneezing--Diarrhoea Smoker NaN Negative Negative
968 35 Male 77 Tsinga Oliga Informal worker Headache Smoker NaN Positive Negative

422 rows × 10 columns

Or we can select rows that are not missing with notna():

yao.query("is_pregnant.notna()")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
0 45 Female 95 Briqueterie Informal worker Muscle pain Non-smoker No Negative Negative
3 20 Female 70 Briqueterie Student Rhinitis--Sneezing--Anosmia or ageusia Non-smoker No Positive Negative
... ... ... ... ... ... ... ... ... ... ...
969 31 Female 66 Tsinga Oliga Unemployed No symptoms Non-smoker No Negative Negative
970 17 Female 67 Tsinga Oliga Unemployed No symptoms Non-smoker No response Negative Negative

549 rows × 10 columns

Practice

16.9.1 Practice Q: Keep Missing Smoking Status

From the yao dataset, keep all the respondents who had NA records for the report of their smoking status.

# Your code here

16.10 Querying Based on String Patterns

Sometimes, we need to filter our data based on whether a string column contains a certain substring. This is particularly useful when dealing with multi-answer type variables, where responses may contain multiple values separated by delimiters. Let’s explore this using the occupation column in our dataset.

First, let’s take a look at the unique values in the occupation column:

yao.occupation.value_counts().to_dict()
{'Student': 383,
 'Informal worker': 189,
 'Trader': 111,
 'Unemployed': 68,
 'Home-maker': 65,
 'Salaried worker': 54,
 'Retired': 27,
 'Student--Informal worker': 13,
 'Other': 13,
 'No response': 9,
 'Farmer': 5,
 'Informal worker--Trader': 4,
 'Student--Trader': 4,
 'Trader--Farmer': 4,
 'Home-maker--Informal worker': 3,
 'Home-maker--Trader': 3,
 'Retired--Informal worker': 3,
 'Informal worker--Other': 2,
 'Home-maker--Farmer': 2,
 'Student--Other': 1,
 'Farmer--Other': 1,
 'Trader--Unemployed': 1,
 'Retired--Other': 1,
 'Informal worker--Unemployed': 1,
 'Retired--Trader': 1,
 'Home-maker--Informal worker--Farmer': 1,
 'Student--Informal worker--Other': 1,
 'Informal worker--Trader--Farmer--Other': 1}

As we can see, some respondents have multiple occupations, separated by “–”. To query based on string containment, we can use the str.contains() method within our query().

16.10.1 Basic String Containment

To find all respondents who are students (either solely or in combination with other occupations), we can use:

yao.query("occupation.str.contains('Student')")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
2 23 Male 74 Briqueterie Student No symptoms Smoker NaN Negative Negative
3 20 Female 70 Briqueterie Student Rhinitis--Sneezing--Anosmia or ageusia Non-smoker No Positive Negative
... ... ... ... ... ... ... ... ... ... ...
963 26 Female 63 Tsinga Oliga Student No symptoms Non-smoker No Negative Negative
964 28 Male 76 Tsinga Oliga Student--Informal worker No symptoms Non-smoker NaN Negative Negative

402 rows × 10 columns

This query will return all rows where the occupation column contains the word “Student”, regardless of whether it’s the only occupation or part of a multiple-occupation entry.

16.10.2 Negating String Containment

To find respondents who are not students (i.e., their occupation does not contain “Student”), you can use the ~ operator:

yao.query("~occupation.str.contains('Student')")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
0 45 Female 95 Briqueterie Informal worker Muscle pain Non-smoker No Negative Negative
1 55 Male 96 Briqueterie Salaried worker No symptoms Ex-smoker NaN Positive Negative
... ... ... ... ... ... ... ... ... ... ...
969 31 Female 66 Tsinga Oliga Unemployed No symptoms Non-smoker No Negative Negative
970 17 Female 67 Tsinga Oliga Unemployed No symptoms Non-smoker No response Negative Negative

569 rows × 10 columns

16.10.3 Using | with string containment

To find respondents who are students or farmers, we can use:

yao.query("occupation.str.contains('Student|Farmer')")
age sex weight_kg neighborhood occupation symptoms is_smoker is_pregnant igg_result igm_result
2 23 Male 74 Briqueterie Student No symptoms Smoker NaN Negative Negative
3 20 Female 70 Briqueterie Student Rhinitis--Sneezing--Anosmia or ageusia Non-smoker No Positive Negative
... ... ... ... ... ... ... ... ... ... ...
963 26 Female 63 Tsinga Oliga Student No symptoms Non-smoker No Negative Negative
964 28 Male 76 Tsinga Oliga Student--Informal worker No symptoms Non-smoker NaN Negative Negative

416 rows × 10 columns

Practice

16.10.4 Practice Q: Symptoms

The symptoms column contains a list of symptoms that respondents reported.

Query yao to find respondents who reported “Cough” or “Fever” as symptoms. Your answer should have 219 rows.

# Your code here

16.11 Wrap up

Great job! You’ve learned how to select specific columns and filter rows based on various conditions.

These skills allow you to focus on relevant data and create targeted subsets for analysis.

Next, we’ll explore how to modify and transform your data, further expanding your data wrangling toolkit. See you in the next lesson!