import plotly.express as px
import pandas as pd
import numpy as np
from vega_datasets import data10 Bivariate & Multivariate Graphs with Plotly Express
10.1 Introduction
In this lesson, you’ll learn how to create bivariate and multivariate graphs using Plotly Express. These types of graphs are essential for exploring relationships between two or more variables, whether they are quantitative or categorical. Understanding these relationships can provide deeper insights into your data.
Let’s dive in!
10.2 Learning Objectives
By the end of this lesson, you will be able to:
- Create scatter plots for quantitative vs. quantitative data
- Generate grouped histograms and violin plots for quantitative vs. categorical data
- Create grouped, stacked, and percent-stacked bar charts for categorical vs. categorical data
- Visualize time series data using bar charts and line charts
- Create bubble charts to display relationships between three or more variables
- Use faceting to compare distributions across subsets of data
10.3 Imports
This lesson requires plotly.express, pandas, numpy, and vega_datasets. Install them if you haven’t already.
10.4 Numeric vs. Numeric Data
When both variables are quantitative, scatter plots are an excellent way to visualize their relationship.
10.4.1 Scatter Plot
Let’s create a scatter plot to examine the relationship between total_bill and tip in the tips dataset. The tips dataset is included in Plotly Express and contains information about restaurant bills and tips that were collected by a waiter in a US restaurant.
First, we’ll load the dataset and view the first five rows:
tips = px.data.tips()
tips| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
Next, we’ll create a basic scatter plot. We do this with the px.scatter function.
px.scatter(tips, x='total_bill', y='tip')From the scatter plot, we can observe that as the total bill increases, the tip amount tends to increase as well.
Let’s enhance the scatter plot by adding labels and a title.
px.scatter(
tips,
x="total_bill",
y="tip",
labels={"total_bill": "Total Bill ($)", "tip": "Tip ($)"},
title="Relationship Between Total Bill and Tip Amount",
)Recall that you can see additional information about the function by typing px.scatter? in a cell and executing the cell.
px.scatter?10.4.2 Practice Q: Life Expectancy vs. GDP Per Capita
Using the Gapminder dataset (the 2007 subset, g_2007, defined below), create a scatter plot showing the relationship between gdpPercap (GDP per capita) and lifeExp (life expectancy).
According to the plot, what is the relationship between GDP per capita and life expectancy?
gapminder = px.data.gapminder()
g_2007 = gapminder.query('year == 2007')
g_2007.head()
# Your code here| country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
|---|---|---|---|---|---|---|---|---|
| 11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 | AFG | 4 |
| 23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
| 35 | Algeria | Africa | 2007 | 72.301 | 33333216 | 6223.367465 | DZA | 12 |
| 47 | Angola | Africa | 2007 | 42.731 | 12420476 | 4797.231267 | AGO | 24 |
| 59 | Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.379640 | ARG | 32 |
According to the plot, there is a positive relationship between GDP per capita and life expectancy, though it appears to level off at higher GDP values.
10.5 Numeric vs. Categorical Data
When one variable is quantitative and the other is categorical, we can use grouped histograms, violin plots, or box plots to visualize the distribution of the quantitative variable across different categories.
10.5.1 Grouped Histograms
First, here’s how you can create a regular histogram of all tips:
px.histogram(tips, x='tip')To create a grouped histogram, use the color parameter to specify the categorical variable. Here, we’ll color the histogram by sex:
px.histogram(tips, x='tip', color='sex')By default, the histograms for each category are stacked. To change this behavior, you can use the barmode parameter. For example, barmode='overlay' will create an overlaid histogram:
px.histogram(tips, x="tip", color="sex", barmode="overlay")This creates two semi-transparent histograms overlaid on top of each other, allowing for direct comparison of the distributions.
10.5.2 Practice Q: Age Distribution by Gender
Using the la_riots dataset from vega_datasets, create a grouped histogram of age by gender. Compare the age distributions between different genders.
According to the plot, was the oldest victim male or female?
la_riots = data.la_riots()
la_riots.head()
# Your code here| first_name | last_name | age | gender | race | death_date | address | neighborhood | type | longitude | latitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cesar A. | Aguilar | 18.0 | Male | Latino | 1992-04-30 | 2009 W. 6th St. | Westlake | Officer-involved shooting | -118.273976 | 34.059281 |
| 1 | George | Alvarez | 42.0 | Male | Latino | 1992-05-01 | Main & College streets | Chinatown | Not riot-related | -118.234098 | 34.062690 |
| 2 | Wilson | Alvarez | 40.0 | Male | Latino | 1992-05-23 | 3100 Rosecrans Ave. | Hawthorne | Homicide | -118.326816 | 33.901662 |
| 3 | Brian E. | Andrew | 30.0 | Male | Black | 1992-04-30 | Rosecrans & Chester avenues | Compton | Officer-involved shooting | -118.215390 | 33.903457 |
| 4 | Vivian | Austin | 87.0 | Female | Black | 1992-05-03 | 1600 W. 60th St. | Harvard Park | Death | -118.304741 | 33.985667 |
According to the plot, the oldest victim was female.
10.5.3 Violin & Box Plots
Violin plots are useful for comparing the distribution of a quantitative variable across different categories. They show the probability density of the data at different values and can include a box plot to summarize key statistics.
First, let’s create a violin plot of all tips:
px.violin(tips, y="tip")We can add a box plot to the violin plot by setting the box parameter to True:
px.violin(tips, y="tip", box=True)For just the box plot, we can use px.box:
px.box(tips, y="tip")To add jitter points to the violin or box plots, we can use the points = 'all' parameter.
px.violin(tips, y="tip", points="all")Now, to create a violin plot of tips by gender, use the x parameter to specify the categorical variable:
px.violin(tips, y="tip", x="sex", box=True)We can also add a color axis to differentiate the violins:
px.violin(tips, y="tip", x="sex", color="sex", box=True)10.5.4 Practice Q: Life Expectancy by Continent
Using the g_2007 dataset, create a violin plot showing the distribution of lifeExp by continent.
According to the plot, which continent has the highest median country life expectancy?
g_2007 = gapminder.query("year == 2007")
g_2007.head()
# Your code here| country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
|---|---|---|---|---|---|---|---|---|
| 11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 | AFG | 4 |
| 23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
| 35 | Algeria | Africa | 2007 | 72.301 | 33333216 | 6223.367465 | DZA | 12 |
| 47 | Angola | Africa | 2007 | 42.731 | 12420476 | 4797.231267 | AGO | 24 |
| 59 | Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.379640 | ARG | 32 |
According to the plot, Oceania has the highest median country life expectancy.
10.5.5 Summary Bar Charts (Mean and Standard Deviation)
Sometimes it’s useful to display the mean and standard deviation of a quantitative variable across different categories. This can be visualized using a bar chart with error bars.
First, let’s calculate the mean and standard deviation of tips for each gender. You have not yet learned how to do this, but you will in a later lesson.
# Calculate the mean and standard deviation
summary_df = (
tips.groupby("sex")
.agg(mean_tip=("tip", "mean"), std_tip=("tip", "std"))
.reset_index()
)
summary_df| sex | mean_tip | std_tip | |
|---|---|---|---|
| 0 | Female | 2.833448 | 1.159495 |
| 1 | Male | 3.089618 | 1.489102 |
Next, we’ll create a bar chart using px.bar and add error bars using the error_y parameter:
# Create the bar chart
px.bar(summary_df, x="sex", y="mean_tip", error_y="std_tip")This bar chart displays the average tip amount for each gender, with error bars representing the standard deviation.
10.5.6 Practice Q: Average Total Bill by Day
Using the tips dataset, create a bar chart of mean total_bill by day with standard deviation error bars. You should copy and paste the code from the example above and modify it to create this plot.
According to the plot, which day has the highest average total bill?
tips.head() # View the tips dataset
# Your code here| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
According to the plot, Sunday has the highest average bill.
px.bar and px.histogram
Notice that this is the first time we are using the px.bar function. For past plots, we have used px.histogram to make bar charts.
The bar chart function generally expects that the numeric variable being plotted is already in it’s own column, while the histogram function does the grouping for you.
For example, in the cell below, we use px.histogram to make a bar chart of the sex column. The resulting plot compares the number of male and female customers in the dataset.
px.histogram(tips, x='sex')To make the same plot using px.bar, we first need to group by the sex column and count the number of rows for each sex.
sex_counts = tips['sex'].value_counts().reset_index()
sex_counts| sex | count | |
|---|---|---|
| 0 | Male | 157 |
| 1 | Female | 87 |
We can then plot the day column using px.bar:
px.bar(sex_counts, x="sex", y="count")This produces a bar chart with one bar for each sex.
10.6 Categorical vs. Categorical Data
When both variables are categorical, bar charts with a color axis are effective for visualizing the frequency distribution across categories. We will focus on three types of bar charts: stacked bar charts, percent-stacked bar charts, and grouped/clustered bar charts.
10.6.1 Stacked Bar Charts
Stacked bar charts show the total counts and the breakdown within each category. To make a stacked bar chart, use the color parameter to specify the categorical variable:
px.histogram(
tips,
x='day',
color='sex'
)Let’s add numbers to the bars to show the exact counts, and also improve the color palette with custom colors.
px.histogram(
tips,
x="day",
color="sex",
text_auto=True,
color_discrete_sequence=["#deb221", "#2f828a"],
)This stacked bar chart shows the total number of customers each day, broken down by gender.
10.6.2 Practice Q: High and Low Income Countries by Continent
Using the g_2007_income dataset, create a stacked bar chart showing the count of high and low income countries in each continent.
gap_dat = px.data.gapminder()
g_2007_income = (
gap_dat.query("year == 2007")
.drop(columns=["year", "iso_alpha", "iso_num"])
.assign(
income_group=lambda df: np.where(
df.gdpPercap > 15000, "High Income", "Low & Middle Income"
)
)
)
g_2007_income.head()
# Your code here| country | continent | lifeExp | pop | gdpPercap | income_group | |
|---|---|---|---|---|---|---|
| 11 | Afghanistan | Asia | 43.828 | 31889923 | 974.580338 | Low & Middle Income |
| 23 | Albania | Europe | 76.423 | 3600523 | 5937.029526 | Low & Middle Income |
| 35 | Algeria | Africa | 72.301 | 33333216 | 6223.367465 | Low & Middle Income |
| 47 | Angola | Africa | 42.731 | 12420476 | 4797.231267 | Low & Middle Income |
| 59 | Argentina | Americas | 75.320 | 40301927 | 12779.379640 | Low & Middle Income |
10.6.3 Percent-Stacked Bar Charts
To show proportions instead of counts, we can create percent-stacked bar charts by setting the barnorm parameter to 'percent':
# Create the percent-stacked bar chart
px.histogram(tips, x="day", color="sex", barnorm="percent")This chart normalizes the bar heights to represent percentages, showing the proportion of each gender for each day.
We can also add text labels to the bars to show the exact percentages:
px.histogram(tips, x="day", color="sex", barnorm="percent", text_auto=".1f")The symbol .1f in the text_auto parameter formats the text labels to one decimal place.
10.6.4 Practice Q: Proportion of High and Low Income Countries by Continent
Again using the g_2007_income dataset, create a percent-stacked bar chart showing the proportion of high and low income countries in each continent. Add text labels to the bars to show the exact percentages.
According the plot, which continent has the highest proportion of high income countries? Are there any limitations to this plot?
# Your code here10.6.5 Clustered Bar Charts
For clustered bar charts, set the barmode parameter to 'group' to place the bars for each category side by side:
px.histogram(tips, x="day", color="sex", barmode="group")This layout makes it easier to compare values across categories directly.
10.7 Time Series Data
Time series data represents observations collected at different points in time. It’s crucial for analyzing trends, patterns, and changes over time. Let’s explore some basic time series visualizations using Nigeria’s population data from the Gapminder dataset.
First, let’s prepare our data:
# Load the Gapminder dataset
gapminder = px.data.gapminder()
# Subset the data for Nigeria
nigeria_pop = gapminder.query('country == "Nigeria"')[['year', 'pop']]
nigeria_pop| year | pop | |
|---|---|---|
| 1128 | 1952 | 33119096 |
| 1129 | 1957 | 37173340 |
| 1130 | 1962 | 41871351 |
| 1131 | 1967 | 47287752 |
| 1132 | 1972 | 53740085 |
| 1133 | 1977 | 62209173 |
| 1134 | 1982 | 73039376 |
| 1135 | 1987 | 81551520 |
| 1136 | 1992 | 93364244 |
| 1137 | 1997 | 106207839 |
| 1138 | 2002 | 119901274 |
| 1139 | 2007 | 135031164 |
10.7.1 Bar Chart
A bar chart can be used to plot time series data.
# Bar chart
px.bar(nigeria_pop, x="year", y="pop")This bar chart gives us a clear view of how Nigeria’s population has changed over the years, with each bar representing the population at a specific year.
10.7.2 Line Chart
A line chart is excellent for showing continuous changes over time:
# Line chart
px.line(nigeria_pop, x="year", y="pop")The line chart connects the population values, making it easier to see the overall trend of population growth.
Adding markers to a line chart can highlight specific data points:
# Line chart with points
px.line(nigeria_pop, x='year', y='pop', markers=True)We can also compare the population growth of multiple countries by adding a color parameter:
nigeria_ghana = gapminder.query('country in ["Nigeria", "Ghana"]')
px.line(nigeria_ghana, x="year", y="pop", color="country", markers=True)This chart allows us to compare the population trends of Nigeria and Ghana over time.
10.7.3 Practice Q: GDP per Capita Time Series
Using the Gapminder dataset, create a time series visualization for the GDP per capita of Iraq.
# Your code hereWhat happened to Iraq in the 1980s that might explain the graph shown?
10.8 Plots with three or more variables
Although bivariate visualizations are the most common types of visualizations, plots with three or more variables are also sometimes useful. Let’s explore a few examples.
10.8.1 Bubble Charts
Bubble charts show the relationship between three variables by mapping the size of the points to a third variable. Below, we plot the relationship between gdpPercap and lifeExp with the size of the points representing the population of the country.
px.scatter(g_2007, x="gdpPercap", y="lifeExp", size="pop")We can easily spot the largest countries by population, such as China, India, and the United States. We can also add a color axis to differentiate between continents:
px.scatter(g_2007, x="gdpPercap", y="lifeExp", size="pop", color="continent")Now we have four different variables being plotted:
gdpPercapon the x-axislifeExpon the y-axispopas the size of the pointscontinentas the color of the points
10.8.2 Practice Q: Tips Bubble Chart
Using the tips dataset, create a bubble chart showing the relationship between total_bill and tip with the size of the points representing the size of the party, and the color representing the day of the week.
Use the plot to answer the question:
- The highest two tip amounts were on which days and what was the table size?
tips.head()
# Your code here| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
10.8.3 Facet Plots
Faceting splits a single plot into multiple plots, with each plot showing a different subset of the data. This is useful for comparing distributions across subsets.
For example, we can facet the bubble chart by continent:
px.scatter(
g_2007,
x="gdpPercap",
y="lifeExp",
size="pop",
color="continent",
facet_col="continent",
)We can change the arrangement of the facets by changing the facet_col_wrap parameter. For example, facet_col_wrap=2 will wrap the facets into two columns:
px.scatter(
g_2007,
x="gdpPercap",
y="lifeExp",
size="pop",
color="continent",
facet_col="continent",
facet_col_wrap=2,
)Similarly, we can facet the violin plots of tips by day of the week:
px.violin(
tips,
x="sex",
y="tip",
color="sex",
facet_col="day",
facet_col_wrap=2,
)Faceting allows us to compare distributions across different days, providing more granular insights.
10.8.4 Practice Q: Tips Facet Plot
Using the tips dataset, create a percent-stacked bar chart of the time column, colored by the sex column, and facetted by the day column.
Which day-time has the highest proportion of male customers (e.g. Friday Lunch, Saturday Dinner, etc.)?
tips.head()
# Your code here| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
10.9 Summary
In this lesson, you learned how to create bivariate and multivariate graphs using Plotly Express. Understanding these visualization techniques will help you explore and communicate relationships in your data more effectively.
See you in the next lesson!