import plotly.express as px
import pandas as pd
import numpy as np
from vega_datasets import data
10 Bivariate & Multivariate Graphs with Plotly Express
10.1 Introduction
In this lesson, you’ll learn how to create bivariate and multivariate graphs using Plotly Express. These types of graphs are essential for exploring relationships between two or more variables, whether they are quantitative or categorical. Understanding these relationships can provide deeper insights into your data.
Let’s dive in!
10.2 Learning Objectives
By the end of this lesson, you will be able to:
- Create scatter plots for quantitative vs. quantitative data
- Generate grouped histograms and violin plots for quantitative vs. categorical data
- Create grouped, stacked, and percent-stacked bar charts for categorical vs. categorical data
- Visualize time series data using bar charts and line charts
- Create bubble charts to display relationships between three or more variables
- Use faceting to compare distributions across subsets of data
10.3 Imports
This lesson requires plotly.express
, pandas
, numpy
, and vega_datasets
. Install them if you haven’t already.
10.4 Numeric vs. Numeric Data
When both variables are quantitative, scatter plots are an excellent way to visualize their relationship.
10.4.1 Scatter Plot
Let’s create a scatter plot to examine the relationship between total_bill
and tip
in the tips dataset. The tips dataset is included in Plotly Express and contains information about restaurant bills and tips that were collected by a waiter in a US restaurant.
First, we’ll load the dataset and view the first five rows:
= px.data.tips()
tips tips
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
Next, we’ll create a basic scatter plot. We do this with the px.scatter
function.
='total_bill', y='tip') px.scatter(tips, x
From the scatter plot, we can observe that as the total bill increases, the tip amount tends to increase as well.
Let’s enhance the scatter plot by adding labels and a title.
px.scatter(
tips,="total_bill",
x="tip",
y={"total_bill": "Total Bill ($)", "tip": "Tip ($)"},
labels="Relationship Between Total Bill and Tip Amount",
title )
Recall that you can see additional information about the function by typing px.scatter?
in a cell and executing the cell.
px.scatter?
10.4.2 Practice Q: Life Expectancy vs. GDP Per Capita
Using the Gapminder dataset (the 2007 subset, g_2007
, defined below), create a scatter plot showing the relationship between gdpPercap
(GDP per capita) and lifeExp
(life expectancy).
According to the plot, what is the relationship between GDP per capita and life expectancy?
= px.data.gapminder()
gapminder = gapminder.query('year == 2007')
g_2007
g_2007.head()# Your code here
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 | AFG | 4 |
23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
35 | Algeria | Africa | 2007 | 72.301 | 33333216 | 6223.367465 | DZA | 12 |
47 | Angola | Africa | 2007 | 42.731 | 12420476 | 4797.231267 | AGO | 24 |
59 | Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.379640 | ARG | 32 |
10.5 Numeric vs. Categorical Data
When one variable is quantitative and the other is categorical, we can use grouped histograms, violin plots, or box plots to visualize the distribution of the quantitative variable across different categories.
10.5.1 Grouped Histograms
First, here’s how you can create a regular histogram of all tips:
='tip') px.histogram(tips, x
To create a grouped histogram, use the color
parameter to specify the categorical variable. Here, we’ll color the histogram by sex
:
='tip', color='sex') px.histogram(tips, x
By default, the histograms for each category are stacked. To change this behavior, you can use the barmode
parameter. For example, barmode='overlay'
will create an overlaid histogram:
="tip", color="sex", barmode="overlay") px.histogram(tips, x
This creates two semi-transparent histograms overlaid on top of each other, allowing for direct comparison of the distributions.
10.5.2 Practice Q: Age Distribution by Gender
Using the la_riots
dataset from vega_datasets
, create a grouped histogram of age
by gender
. Compare the age distributions between different genders.
According to the plot, was the oldest victim male or female?
= data.la_riots()
la_riots
la_riots.head()# Your code here
first_name | last_name | age | gender | race | death_date | address | neighborhood | type | longitude | latitude | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Cesar A. | Aguilar | 18.0 | Male | Latino | 1992-04-30 | 2009 W. 6th St. | Westlake | Officer-involved shooting | -118.273976 | 34.059281 |
1 | George | Alvarez | 42.0 | Male | Latino | 1992-05-01 | Main & College streets | Chinatown | Not riot-related | -118.234098 | 34.062690 |
2 | Wilson | Alvarez | 40.0 | Male | Latino | 1992-05-23 | 3100 Rosecrans Ave. | Hawthorne | Homicide | -118.326816 | 33.901662 |
3 | Brian E. | Andrew | 30.0 | Male | Black | 1992-04-30 | Rosecrans & Chester avenues | Compton | Officer-involved shooting | -118.215390 | 33.903457 |
4 | Vivian | Austin | 87.0 | Female | Black | 1992-05-03 | 1600 W. 60th St. | Harvard Park | Death | -118.304741 | 33.985667 |
10.5.3 Violin & Box Plots
Violin plots are useful for comparing the distribution of a quantitative variable across different categories. They show the probability density of the data at different values and can include a box plot to summarize key statistics.
First, let’s create a violin plot of all tips:
="tip") px.violin(tips, y
We can add a box plot to the violin plot by setting the box
parameter to True
:
="tip", box=True) px.violin(tips, y
For just the box plot, we can use px.box
:
="tip") px.box(tips, y
To add jitter points to the violin or box plots, we can use the points = 'all'
parameter.
="tip", points="all") px.violin(tips, y
Now, to create a violin plot of tips by gender, use the x
parameter to specify the categorical variable:
="tip", x="sex", box=True) px.violin(tips, y
We can also add a color axis to differentiate the violins:
="tip", x="sex", color="sex", box=True) px.violin(tips, y
10.5.4 Practice Q: Life Expectancy by Continent
Using the g_2007
dataset, create a violin plot showing the distribution of lifeExp
by continent
.
According to the plot, which continent has the highest median country life expectancy?
= gapminder.query("year == 2007")
g_2007
g_2007.head()# Your code here
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 | AFG | 4 |
23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
35 | Algeria | Africa | 2007 | 72.301 | 33333216 | 6223.367465 | DZA | 12 |
47 | Angola | Africa | 2007 | 42.731 | 12420476 | 4797.231267 | AGO | 24 |
59 | Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.379640 | ARG | 32 |
10.5.5 Summary Bar Charts (Mean and Standard Deviation)
Sometimes it’s useful to display the mean and standard deviation of a quantitative variable across different categories. This can be visualized using a bar chart with error bars.
First, let’s calculate the mean and standard deviation of tips for each gender. You have not yet learned how to do this, but you will in a later lesson.
# Calculate the mean and standard deviation
= (
summary_df "sex")
tips.groupby(=("tip", "mean"), std_tip=("tip", "std"))
.agg(mean_tip
.reset_index()
) summary_df
sex | mean_tip | std_tip | |
---|---|---|---|
0 | Female | 2.833448 | 1.159495 |
1 | Male | 3.089618 | 1.489102 |
Next, we’ll create a bar chart using px.bar
and add error bars using the error_y
parameter:
# Create the bar chart
="sex", y="mean_tip", error_y="std_tip") px.bar(summary_df, x
This bar chart displays the average tip amount for each gender, with error bars representing the standard deviation.
10.5.6 Practice Q: Average Total Bill by Day
Using the tips
dataset, create a bar chart of mean total_bill
by day
with standard deviation error bars. You should copy and paste the code from the example above and modify it to create this plot.
According to the plot, which day has the highest average total bill?
# View the tips dataset
tips.head() # Your code here
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
px.bar
and px.histogram
Notice that this is the first time we are using the px.bar
function. For past plots, we have used px.histogram
to make bar charts.
The bar chart function generally expects that the numeric variable being plotted is already in it’s own column, while the histogram function does the grouping for you.
For example, in the cell below, we use px.histogram
to make a bar chart of the sex
column. The resulting plot compares the number of male and female customers in the dataset.
='sex') px.histogram(tips, x
To make the same plot using px.bar
, we first need to group by the sex
column and count the number of rows for each sex.
= tips['sex'].value_counts().reset_index()
sex_counts sex_counts
sex | count | |
---|---|---|
0 | Male | 157 |
1 | Female | 87 |
We can then plot the day
column using px.bar
:
="sex", y="count") px.bar(sex_counts, x
This produces a bar chart with one bar for each sex.
10.6 Categorical vs. Categorical Data
When both variables are categorical, bar charts with a color axis are effective for visualizing the frequency distribution across categories. We will focus on three types of bar charts: stacked bar charts, percent-stacked bar charts, and grouped/clustered bar charts.
10.6.1 Stacked Bar Charts
Stacked bar charts show the total counts and the breakdown within each category. To make a stacked bar chart, use the color
parameter to specify the categorical variable:
px.histogram(
tips,='day',
x='sex'
color )
Let’s add numbers to the bars to show the exact counts, and also improve the color palette with custom colors.
px.histogram(
tips,="day",
x="sex",
color=True,
text_auto=["#deb221", "#2f828a"],
color_discrete_sequence )
This stacked bar chart shows the total number of customers each day, broken down by gender.
10.6.2 Practice Q: High and Low Income Countries by Continent
Using the g_2007_income
dataset, create a stacked bar chart showing the count of high and low income countries in each continent.
= px.data.gapminder()
gap_dat
= (
g_2007_income "year == 2007")
gap_dat.query(=["year", "iso_alpha", "iso_num"])
.drop(columns
.assign(=lambda df: np.where(
income_group> 15000, "High Income", "Low & Middle Income"
df.gdpPercap
)
)
)
g_2007_income.head()# Your code here
country | continent | lifeExp | pop | gdpPercap | income_group | |
---|---|---|---|---|---|---|
11 | Afghanistan | Asia | 43.828 | 31889923 | 974.580338 | Low & Middle Income |
23 | Albania | Europe | 76.423 | 3600523 | 5937.029526 | Low & Middle Income |
35 | Algeria | Africa | 72.301 | 33333216 | 6223.367465 | Low & Middle Income |
47 | Angola | Africa | 42.731 | 12420476 | 4797.231267 | Low & Middle Income |
59 | Argentina | Americas | 75.320 | 40301927 | 12779.379640 | Low & Middle Income |
10.6.3 Percent-Stacked Bar Charts
To show proportions instead of counts, we can create percent-stacked bar charts by setting the barnorm
parameter to 'percent'
:
# Create the percent-stacked bar chart
="day", color="sex", barnorm="percent") px.histogram(tips, x
This chart normalizes the bar heights to represent percentages, showing the proportion of each gender for each day.
We can also add text labels to the bars to show the exact percentages:
="day", color="sex", barnorm="percent", text_auto=".1f") px.histogram(tips, x
The symbol .1f
in the text_auto
parameter formats the text labels to one decimal place.
10.6.4 Practice Q: Proportion of High and Low Income Countries by Continent
Again using the g_2007_income
dataset, create a percent-stacked bar chart showing the proportion of high and low income countries in each continent. Add text labels to the bars to show the exact percentages.
According the plot, which continent has the highest proportion of high income countries? Are there any limitations to this plot?
# Your code here
10.6.5 Clustered Bar Charts
For clustered bar charts, set the barmode
parameter to 'group'
to place the bars for each category side by side:
="day", color="sex", barmode="group") px.histogram(tips, x
This layout makes it easier to compare values across categories directly.
10.7 Time Series Data
Time series data represents observations collected at different points in time. It’s crucial for analyzing trends, patterns, and changes over time. Let’s explore some basic time series visualizations using Nigeria’s population data from the Gapminder dataset.
First, let’s prepare our data:
# Load the Gapminder dataset
= px.data.gapminder()
gapminder
# Subset the data for Nigeria
= gapminder.query('country == "Nigeria"')[['year', 'pop']]
nigeria_pop nigeria_pop
year | pop | |
---|---|---|
1128 | 1952 | 33119096 |
1129 | 1957 | 37173340 |
1130 | 1962 | 41871351 |
1131 | 1967 | 47287752 |
1132 | 1972 | 53740085 |
1133 | 1977 | 62209173 |
1134 | 1982 | 73039376 |
1135 | 1987 | 81551520 |
1136 | 1992 | 93364244 |
1137 | 1997 | 106207839 |
1138 | 2002 | 119901274 |
1139 | 2007 | 135031164 |
10.7.1 Bar Chart
A bar chart can be used to plot time series data.
# Bar chart
="year", y="pop") px.bar(nigeria_pop, x
This bar chart gives us a clear view of how Nigeria’s population has changed over the years, with each bar representing the population at a specific year.
10.7.2 Line Chart
A line chart is excellent for showing continuous changes over time:
# Line chart
="year", y="pop") px.line(nigeria_pop, x
The line chart connects the population values, making it easier to see the overall trend of population growth.
Adding markers to a line chart can highlight specific data points:
# Line chart with points
='year', y='pop', markers=True) px.line(nigeria_pop, x
We can also compare the population growth of multiple countries by adding a color
parameter:
= gapminder.query('country in ["Nigeria", "Ghana"]')
nigeria_ghana ="year", y="pop", color="country", markers=True) px.line(nigeria_ghana, x
This chart allows us to compare the population trends of Nigeria and Ghana over time.
10.7.3 Practice Q: GDP per Capita Time Series
Using the Gapminder dataset, create a time series visualization for the GDP per capita of Iraq.
# Your code here
What happened to Iraq in the 1980s that might explain the graph shown?
10.8 Plots with three or more variables
Although bivariate visualizations are the most common types of visualizations, plots with three or more variables are also sometimes useful. Let’s explore a few examples.
10.8.1 Bubble Charts
Bubble charts show the relationship between three variables by mapping the size of the points to a third variable. Below, we plot the relationship between gdpPercap
and lifeExp
with the size of the points representing the population of the country.
="gdpPercap", y="lifeExp", size="pop") px.scatter(g_2007, x
We can easily spot the largest countries by population, such as China, India, and the United States. We can also add a color axis to differentiate between continents:
="gdpPercap", y="lifeExp", size="pop", color="continent") px.scatter(g_2007, x
Now we have four different variables being plotted:
gdpPercap
on the x-axislifeExp
on the y-axispop
as the size of the pointscontinent
as the color of the points
10.8.2 Practice Q: Tips Bubble Chart
Using the tips
dataset, create a bubble chart showing the relationship between total_bill
and tip
with the size of the points representing the size
of the party, and the color representing the day
of the week.
Use the plot to answer the question:
- The highest two tip amounts were on which days and what was the table size?
tips.head()# Your code here
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
10.8.3 Facet Plots
Faceting splits a single plot into multiple plots, with each plot showing a different subset of the data. This is useful for comparing distributions across subsets.
For example, we can facet the bubble chart by continent:
px.scatter(
g_2007,="gdpPercap",
x="lifeExp",
y="pop",
size="continent",
color="continent",
facet_col )
We can change the arrangement of the facets by changing the facet_col_wrap
parameter. For example, facet_col_wrap=2
will wrap the facets into two columns:
px.scatter(
g_2007,="gdpPercap",
x="lifeExp",
y="pop",
size="continent",
color="continent",
facet_col=2,
facet_col_wrap )
Similarly, we can facet the violin plots of tips by day of the week:
px.violin(
tips,="sex",
x="tip",
y="sex",
color="day",
facet_col=2,
facet_col_wrap )
Faceting allows us to compare distributions across different days, providing more granular insights.
10.8.4 Practice Q: Tips Facet Plot
Using the tips
dataset, create a percent-stacked bar chart of the time
column, colored by the sex
column, and facetted by the day
column.
Which day-time has the highest proportion of male customers (e.g. Friday Lunch, Saturday Dinner, etc.)?
tips.head()# Your code here
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
10.9 Summary
In this lesson, you learned how to create bivariate and multivariate graphs using Plotly Express. Understanding these visualization techniques will help you explore and communicate relationships in your data more effectively.
See you in the next lesson!
10.10 Solutions
10.10.1 Solution Q: Life Expectancy vs. GDP Per Capita
px.scatter(
g_2007,="gdpPercap",
x="lifeExp",
y={"gdpPercap": "GDP per Capita", "lifeExp": "Life Expectancy"},
labels="Life Expectancy vs. GDP per Capita (2007)",
title )
According to the plot, there is a positive relationship between GDP per capita and life expectancy, though it appears to level off at higher GDP values.
10.10.2 Solution Q: Age Distribution by Gender
px.histogram(
la_riots,="age",
x="gender",
color="overlay",
barmode="Age Distribution by Gender in LA Riots Dataset"
title )
According to the plot, the oldest victim was female.
10.10.3 Solution Q: Life Expectancy by Continent
px.violin(
g_2007,="lifeExp",
y="continent",
x=True,
box="Life Expectancy Distribution by Continent (2007)",
title )
According to the plot, Oceania has the highest median country life expectancy.
10.10.4 Solution Q: Average Total Bill by Day
# Calculate the mean and standard deviation
= (
summary_df "day")
tips.groupby(=("total_bill", "mean"), std_bill=("total_bill", "std"))
.agg(mean_bill
.reset_index()
)
# Create the bar chart
px.bar(
summary_df,="day",
x="mean_bill",
y="std_bill",
error_y="Average Total Bill by Day"
title )
According to the plot, Saturday has the highest average total bill.
10.10.5 Solution Q: High and Low Income Countries by Continent
px.histogram(
g_2007_income,="continent",
x="income_group",
color="Count of High and Low Income Countries by Continent",
title=True
text_auto )
10.10.6 Solution Q: Proportion of High and Low Income Countries by Continent
px.histogram(
g_2007_income,="continent",
x="income_group",
color="percent",
barnorm=".1f",
text_auto="Proportion of High and Low Income Countries by Continent"
title )
According to the plot, North America has the highest proportion of high-income countries. A limitation is that this plot treats all countries equally regardless of their population or size.
10.10.7 Solution Q: GDP per Capita Time Series
= gapminder.query('country == "Iraq"')
iraq_gdp
px.line(
iraq_gdp,="year",
x="gdpPercap",
y=True,
markers="Iraq GDP per Capita Over Time"
title )
The Iran-Iraq War (1980-1988) likely explains the significant drop in GDP per capita during the 1980s.
10.10.8 Solution Q: Tips Bubble Chart
px.scatter(
tips,="total_bill",
x="tip",
y="size",
size="day",
color="Tips vs Total Bill by Party Size and Day"
title )
According to the plot, the highest two tips were given on Saturday and Sunday, both from parties of 4 people.
10.10.9 Solution Q: Tips Facet Plot
px.histogram(
tips,="time",
x="sex",
color="day",
facet_col="percent",
barnorm=".1f",
text_auto="Proportion of Male vs Female Customers by Day and Time"
title )
According to the plot, Saturday lunch has the highest proportion of male customers.