10  Bivariate & Multivariate Graphs with Plotly Express

10.1 Introduction

In this lesson, you’ll learn how to create bivariate and multivariate graphs using Plotly Express. These types of graphs are essential for exploring relationships between two or more variables, whether they are quantitative or categorical. Understanding these relationships can provide deeper insights into your data.

Let’s dive in!

10.2 Learning Objectives

By the end of this lesson, you will be able to:

  • Create scatter plots for quantitative vs. quantitative data
  • Generate grouped histograms and violin plots for quantitative vs. categorical data
  • Create grouped, stacked, and percent-stacked bar charts for categorical vs. categorical data
  • Visualize time series data using bar charts and line charts
  • Create bubble charts to display relationships between three or more variables
  • Use faceting to compare distributions across subsets of data

10.3 Imports

This lesson requires plotly.express, pandas, numpy, and vega_datasets. Install them if you haven’t already.

import plotly.express as px
import pandas as pd
import numpy as np
from vega_datasets import data

10.4 Numeric vs. Numeric Data

When both variables are quantitative, scatter plots are an excellent way to visualize their relationship.

10.4.1 Scatter Plot

Let’s create a scatter plot to examine the relationship between total_bill and tip in the tips dataset. The tips dataset is included in Plotly Express and contains information about restaurant bills and tips that were collected by a waiter in a US restaurant.

First, we’ll load the dataset and view the first five rows:

tips = px.data.tips()
tips
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

Next, we’ll create a basic scatter plot. We do this with the px.scatter function.

px.scatter(tips, x='total_bill', y='tip')

From the scatter plot, we can observe that as the total bill increases, the tip amount tends to increase as well.

Let’s enhance the scatter plot by adding labels and a title.

px.scatter(
    tips,
    x="total_bill",
    y="tip",
    labels={"total_bill": "Total Bill ($)", "tip": "Tip ($)"},
    title="Relationship Between Total Bill and Tip Amount",
)

Recall that you can see additional information about the function by typing px.scatter? in a cell and executing the cell.

px.scatter?
Practice

10.4.2 Practice Q: Life Expectancy vs. GDP Per Capita

Using the Gapminder dataset (the 2007 subset, g_2007, defined below), create a scatter plot showing the relationship between gdpPercap (GDP per capita) and lifeExp (life expectancy).

According to the plot, what is the relationship between GDP per capita and life expectancy?

gapminder = px.data.gapminder()
g_2007 = gapminder.query('year == 2007')
g_2007.head()
# Your code here
country continent year lifeExp pop gdpPercap iso_alpha iso_num
11 Afghanistan Asia 2007 43.828 31889923 974.580338 AFG 4
23 Albania Europe 2007 76.423 3600523 5937.029526 ALB 8
35 Algeria Africa 2007 72.301 33333216 6223.367465 DZA 12
47 Angola Africa 2007 42.731 12420476 4797.231267 AGO 24
59 Argentina Americas 2007 75.320 40301927 12779.379640 ARG 32

10.5 Numeric vs. Categorical Data

When one variable is quantitative and the other is categorical, we can use grouped histograms, violin plots, or box plots to visualize the distribution of the quantitative variable across different categories.

10.5.1 Grouped Histograms

First, here’s how you can create a regular histogram of all tips:

px.histogram(tips, x='tip')

To create a grouped histogram, use the color parameter to specify the categorical variable. Here, we’ll color the histogram by sex:

px.histogram(tips, x='tip', color='sex')

By default, the histograms for each category are stacked. To change this behavior, you can use the barmode parameter. For example, barmode='overlay' will create an overlaid histogram:

px.histogram(tips, x="tip", color="sex", barmode="overlay")

This creates two semi-transparent histograms overlaid on top of each other, allowing for direct comparison of the distributions.

Practice

10.5.2 Practice Q: Age Distribution by Gender

Using the la_riots dataset from vega_datasets, create a grouped histogram of age by gender. Compare the age distributions between different genders.

According to the plot, was the oldest victim male or female?

la_riots = data.la_riots()
la_riots.head()
# Your code here
first_name last_name age gender race death_date address neighborhood type longitude latitude
0 Cesar A. Aguilar 18.0 Male Latino 1992-04-30 2009 W. 6th St. Westlake Officer-involved shooting -118.273976 34.059281
1 George Alvarez 42.0 Male Latino 1992-05-01 Main & College streets Chinatown Not riot-related -118.234098 34.062690
2 Wilson Alvarez 40.0 Male Latino 1992-05-23 3100 Rosecrans Ave. Hawthorne Homicide -118.326816 33.901662
3 Brian E. Andrew 30.0 Male Black 1992-04-30 Rosecrans & Chester avenues Compton Officer-involved shooting -118.215390 33.903457
4 Vivian Austin 87.0 Female Black 1992-05-03 1600 W. 60th St. Harvard Park Death -118.304741 33.985667

10.5.3 Violin & Box Plots

Violin plots are useful for comparing the distribution of a quantitative variable across different categories. They show the probability density of the data at different values and can include a box plot to summarize key statistics.

First, let’s create a violin plot of all tips:

px.violin(tips, y="tip")

We can add a box plot to the violin plot by setting the box parameter to True:

px.violin(tips, y="tip", box=True)

For just the box plot, we can use px.box:

px.box(tips, y="tip")

To add jitter points to the violin or box plots, we can use the points = 'all' parameter.

px.violin(tips, y="tip", points="all")

Now, to create a violin plot of tips by gender, use the x parameter to specify the categorical variable:

px.violin(tips, y="tip", x="sex", box=True)

We can also add a color axis to differentiate the violins:

px.violin(tips, y="tip", x="sex", color="sex", box=True)
Practice

10.5.4 Practice Q: Life Expectancy by Continent

Using the g_2007 dataset, create a violin plot showing the distribution of lifeExp by continent.

According to the plot, which continent has the highest median country life expectancy?

g_2007 = gapminder.query("year == 2007")
g_2007.head()
# Your code here
country continent year lifeExp pop gdpPercap iso_alpha iso_num
11 Afghanistan Asia 2007 43.828 31889923 974.580338 AFG 4
23 Albania Europe 2007 76.423 3600523 5937.029526 ALB 8
35 Algeria Africa 2007 72.301 33333216 6223.367465 DZA 12
47 Angola Africa 2007 42.731 12420476 4797.231267 AGO 24
59 Argentina Americas 2007 75.320 40301927 12779.379640 ARG 32

10.5.5 Summary Bar Charts (Mean and Standard Deviation)

Sometimes it’s useful to display the mean and standard deviation of a quantitative variable across different categories. This can be visualized using a bar chart with error bars.

First, let’s calculate the mean and standard deviation of tips for each gender. You have not yet learned how to do this, but you will in a later lesson.

# Calculate the mean and standard deviation
summary_df = (
    tips.groupby("sex")
    .agg(mean_tip=("tip", "mean"), std_tip=("tip", "std"))
    .reset_index()
)
summary_df
sex mean_tip std_tip
0 Female 2.833448 1.159495
1 Male 3.089618 1.489102

Next, we’ll create a bar chart using px.bar and add error bars using the error_y parameter:

# Create the bar chart
px.bar(summary_df, x="sex", y="mean_tip", error_y="std_tip")

This bar chart displays the average tip amount for each gender, with error bars representing the standard deviation.

Practice

10.5.6 Practice Q: Average Total Bill by Day

Using the tips dataset, create a bar chart of mean total_bill by day with standard deviation error bars. You should copy and paste the code from the example above and modify it to create this plot.

According to the plot, which day has the highest average total bill?

tips.head()  # View the tips dataset
# Your code here
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Side Note: Difference between px.bar and px.histogram

Notice that this is the first time we are using the px.bar function. For past plots, we have used px.histogram to make bar charts.

The bar chart function generally expects that the numeric variable being plotted is already in it’s own column, while the histogram function does the grouping for you.

For example, in the cell below, we use px.histogram to make a bar chart of the sex column. The resulting plot compares the number of male and female customers in the dataset.

px.histogram(tips, x='sex')

To make the same plot using px.bar, we first need to group by the sex column and count the number of rows for each sex.

sex_counts = tips['sex'].value_counts().reset_index()
sex_counts
sex count
0 Male 157
1 Female 87

We can then plot the day column using px.bar:

px.bar(sex_counts, x="sex", y="count")

This produces a bar chart with one bar for each sex.

10.6 Categorical vs. Categorical Data

When both variables are categorical, bar charts with a color axis are effective for visualizing the frequency distribution across categories. We will focus on three types of bar charts: stacked bar charts, percent-stacked bar charts, and grouped/clustered bar charts.

10.6.1 Stacked Bar Charts

Stacked bar charts show the total counts and the breakdown within each category. To make a stacked bar chart, use the color parameter to specify the categorical variable:

px.histogram(
    tips,
    x='day',
    color='sex'
)

Let’s add numbers to the bars to show the exact counts, and also improve the color palette with custom colors.

px.histogram(
    tips,
    x="day",
    color="sex",
    text_auto=True,
    color_discrete_sequence=["#deb221", "#2f828a"],
)

This stacked bar chart shows the total number of customers each day, broken down by gender.

Practice

10.6.2 Practice Q: High and Low Income Countries by Continent

Using the g_2007_income dataset, create a stacked bar chart showing the count of high and low income countries in each continent.

gap_dat = px.data.gapminder()

g_2007_income = (
    gap_dat.query("year == 2007")
    .drop(columns=["year", "iso_alpha", "iso_num"])
    .assign(
        income_group=lambda df: np.where(
            df.gdpPercap > 15000, "High Income", "Low & Middle Income"
        )
    )
)

g_2007_income.head()
# Your code here
country continent lifeExp pop gdpPercap income_group
11 Afghanistan Asia 43.828 31889923 974.580338 Low & Middle Income
23 Albania Europe 76.423 3600523 5937.029526 Low & Middle Income
35 Algeria Africa 72.301 33333216 6223.367465 Low & Middle Income
47 Angola Africa 42.731 12420476 4797.231267 Low & Middle Income
59 Argentina Americas 75.320 40301927 12779.379640 Low & Middle Income

10.6.3 Percent-Stacked Bar Charts

To show proportions instead of counts, we can create percent-stacked bar charts by setting the barnorm parameter to 'percent':

# Create the percent-stacked bar chart
px.histogram(tips, x="day", color="sex", barnorm="percent")

This chart normalizes the bar heights to represent percentages, showing the proportion of each gender for each day.

We can also add text labels to the bars to show the exact percentages:

px.histogram(tips, x="day", color="sex", barnorm="percent", text_auto=".1f")

The symbol .1f in the text_auto parameter formats the text labels to one decimal place.

Practice

10.6.4 Practice Q: Proportion of High and Low Income Countries by Continent

Again using the g_2007_income dataset, create a percent-stacked bar chart showing the proportion of high and low income countries in each continent. Add text labels to the bars to show the exact percentages.

According the plot, which continent has the highest proportion of high income countries? Are there any limitations to this plot?

# Your code here

10.6.5 Clustered Bar Charts

For clustered bar charts, set the barmode parameter to 'group' to place the bars for each category side by side:

px.histogram(tips, x="day", color="sex", barmode="group")

This layout makes it easier to compare values across categories directly.

10.7 Time Series Data

Time series data represents observations collected at different points in time. It’s crucial for analyzing trends, patterns, and changes over time. Let’s explore some basic time series visualizations using Nigeria’s population data from the Gapminder dataset.

First, let’s prepare our data:

# Load the Gapminder dataset
gapminder = px.data.gapminder()

# Subset the data for Nigeria
nigeria_pop = gapminder.query('country == "Nigeria"')[['year', 'pop']]
nigeria_pop
year pop
1128 1952 33119096
1129 1957 37173340
1130 1962 41871351
1131 1967 47287752
1132 1972 53740085
1133 1977 62209173
1134 1982 73039376
1135 1987 81551520
1136 1992 93364244
1137 1997 106207839
1138 2002 119901274
1139 2007 135031164

10.7.1 Bar Chart

A bar chart can be used to plot time series data.

# Bar chart
px.bar(nigeria_pop, x="year", y="pop")

This bar chart gives us a clear view of how Nigeria’s population has changed over the years, with each bar representing the population at a specific year.

10.7.2 Line Chart

A line chart is excellent for showing continuous changes over time:

# Line chart
px.line(nigeria_pop, x="year", y="pop")

The line chart connects the population values, making it easier to see the overall trend of population growth.

Adding markers to a line chart can highlight specific data points:

# Line chart with points
px.line(nigeria_pop, x='year', y='pop', markers=True)

We can also compare the population growth of multiple countries by adding a color parameter:

nigeria_ghana = gapminder.query('country in ["Nigeria", "Ghana"]')
px.line(nigeria_ghana, x="year", y="pop", color="country", markers=True)

This chart allows us to compare the population trends of Nigeria and Ghana over time.

Practice

10.7.3 Practice Q: GDP per Capita Time Series

Using the Gapminder dataset, create a time series visualization for the GDP per capita of Iraq.

# Your code here

What happened to Iraq in the 1980s that might explain the graph shown?

10.8 Plots with three or more variables

Although bivariate visualizations are the most common types of visualizations, plots with three or more variables are also sometimes useful. Let’s explore a few examples.

10.8.1 Bubble Charts

Bubble charts show the relationship between three variables by mapping the size of the points to a third variable. Below, we plot the relationship between gdpPercap and lifeExp with the size of the points representing the population of the country.

px.scatter(g_2007, x="gdpPercap", y="lifeExp", size="pop")

We can easily spot the largest countries by population, such as China, India, and the United States. We can also add a color axis to differentiate between continents:

px.scatter(g_2007, x="gdpPercap", y="lifeExp", size="pop", color="continent")

Now we have four different variables being plotted:

  • gdpPercap on the x-axis
  • lifeExp on the y-axis
  • pop as the size of the points
  • continent as the color of the points
Practice

10.8.2 Practice Q: Tips Bubble Chart

Using the tips dataset, create a bubble chart showing the relationship between total_bill and tip with the size of the points representing the size of the party, and the color representing the day of the week.

Use the plot to answer the question:

  • The highest two tip amounts were on which days and what was the table size?
tips.head()
# Your code here
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

10.8.3 Facet Plots

Faceting splits a single plot into multiple plots, with each plot showing a different subset of the data. This is useful for comparing distributions across subsets.

For example, we can facet the bubble chart by continent:

px.scatter(
    g_2007,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    color="continent",
    facet_col="continent",
)

We can change the arrangement of the facets by changing the facet_col_wrap parameter. For example, facet_col_wrap=2 will wrap the facets into two columns:

px.scatter(
    g_2007,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    color="continent",
    facet_col="continent",
    facet_col_wrap=2,
)

Similarly, we can facet the violin plots of tips by day of the week:

px.violin(
    tips,
    x="sex",
    y="tip",
    color="sex",
    facet_col="day",
    facet_col_wrap=2,
)

Faceting allows us to compare distributions across different days, providing more granular insights.

Practice

10.8.4 Practice Q: Tips Facet Plot

Using the tips dataset, create a percent-stacked bar chart of the time column, colored by the sex column, and facetted by the day column.

Which day-time has the highest proportion of male customers (e.g. Friday Lunch, Saturday Dinner, etc.)?

tips.head()
# Your code here
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

10.9 Summary

In this lesson, you learned how to create bivariate and multivariate graphs using Plotly Express. Understanding these visualization techniques will help you explore and communicate relationships in your data more effectively.

See you in the next lesson!

10.10 Solutions

10.10.1 Solution Q: Life Expectancy vs. GDP Per Capita

px.scatter(
    g_2007,
    x="gdpPercap",
    y="lifeExp",
    labels={"gdpPercap": "GDP per Capita", "lifeExp": "Life Expectancy"},
    title="Life Expectancy vs. GDP per Capita (2007)",
)

According to the plot, there is a positive relationship between GDP per capita and life expectancy, though it appears to level off at higher GDP values.

10.10.2 Solution Q: Age Distribution by Gender

px.histogram(
    la_riots,
    x="age",
    color="gender",
    barmode="overlay",
    title="Age Distribution by Gender in LA Riots Dataset"
)

According to the plot, the oldest victim was female.

10.10.3 Solution Q: Life Expectancy by Continent

px.violin(
    g_2007,
    y="lifeExp",
    x="continent",
    box=True,
    title="Life Expectancy Distribution by Continent (2007)",
)

According to the plot, Oceania has the highest median country life expectancy.

10.10.4 Solution Q: Average Total Bill by Day

# Calculate the mean and standard deviation
summary_df = (
    tips.groupby("day")
    .agg(mean_bill=("total_bill", "mean"), std_bill=("total_bill", "std"))
    .reset_index()
)

# Create the bar chart
px.bar(
    summary_df,
    x="day",
    y="mean_bill",
    error_y="std_bill",
    title="Average Total Bill by Day"
)

According to the plot, Saturday has the highest average total bill.

10.10.5 Solution Q: High and Low Income Countries by Continent

px.histogram(
    g_2007_income,
    x="continent",
    color="income_group",
    title="Count of High and Low Income Countries by Continent",
    text_auto=True
)

10.10.6 Solution Q: Proportion of High and Low Income Countries by Continent

px.histogram(
    g_2007_income,
    x="continent",
    color="income_group",
    barnorm="percent",
    text_auto=".1f",
    title="Proportion of High and Low Income Countries by Continent"
)

According to the plot, North America has the highest proportion of high-income countries. A limitation is that this plot treats all countries equally regardless of their population or size.

10.10.7 Solution Q: GDP per Capita Time Series

iraq_gdp = gapminder.query('country == "Iraq"')
px.line(
    iraq_gdp,
    x="year",
    y="gdpPercap",
    markers=True,
    title="Iraq GDP per Capita Over Time"
)

The Iran-Iraq War (1980-1988) likely explains the significant drop in GDP per capita during the 1980s.

10.10.8 Solution Q: Tips Bubble Chart

px.scatter(
    tips,
    x="total_bill",
    y="tip",
    size="size",
    color="day",
    title="Tips vs Total Bill by Party Size and Day"
)

According to the plot, the highest two tips were given on Saturday and Sunday, both from parties of 4 people.

10.10.9 Solution Q: Tips Facet Plot

px.histogram(
    tips,
    x="time",
    color="sex",
    facet_col="day",
    barnorm="percent",
    text_auto=".1f",
    title="Proportion of Male vs Female Customers by Day and Time"
)

According to the plot, Saturday lunch has the highest proportion of male customers.