10 Bivariate & Multivariate Graphs with Plotly Express – Introduction to Data Science with Python

10.1 Introduction

In this lesson, you’ll learn how to create bivariate and multivariate graphs using Plotly Express. These types of graphs are essential for exploring relationships between two or more variables, whether they are quantitative or categorical. Understanding these relationships can provide deeper insights into your data.

Let’s dive in!

10.2 Learning Objectives

By the end of this lesson, you will be able to:

Create scatter plots for quantitative vs. quantitative data
Generate grouped histograms and violin plots for quantitative vs. categorical data
Create grouped, stacked, and percent-stacked bar charts for categorical vs. categorical data
Visualize time series data using bar charts and line charts
Create bubble charts to display relationships between three or more variables
Use faceting to compare distributions across subsets of data

10.3 Imports

This lesson requires plotly.express, pandas, numpy, and vega_datasets. Install them if you haven’t already.

import plotly.express as px
import pandas as pd
import numpy as np
from vega_datasets import data

10.4 Numeric vs. Numeric Data

When both variables are quantitative, scatter plots are an excellent way to visualize their relationship.

10.4.1 Scatter Plot

Let’s create a scatter plot to examine the relationship between total_bill and tip in the tips dataset. The tips dataset is included in Plotly Express and contains information about restaurant bills and tips that were collected by a waiter in a US restaurant.

First, we’ll load the dataset and view the first five rows:

tips = px.data.tips()
tips

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

Next, we’ll create a basic scatter plot. We do this with the px.scatter function.

px.scatter(tips, x='total_bill', y='tip')

From the scatter plot, we can observe that as the total bill increases, the tip amount tends to increase as well.

Let’s enhance the scatter plot by adding labels and a title.

px.scatter(
    tips,
    x="total_bill",
    y="tip",
    labels={"total_bill": "Total Bill ($)", "tip": "Tip ($)"},
    title="Relationship Between Total Bill and Tip Amount",
)

Recall that you can see additional information about the function by typing px.scatter? in a cell and executing the cell.

px.scatter?

10.4.2 Practice Q: Life Expectancy vs. GDP Per Capita

Practice

Using the Gapminder dataset (the 2007 subset, g_2007, defined below), create a scatter plot showing the relationship between gdpPercap (GDP per capita) and lifeExp (life expectancy).

According to the plot, what is the relationship between GDP per capita and life expectancy?

gapminder = px.data.gapminder()
g_2007 = gapminder.query('year == 2007')
g_2007.head()
# Your code here

	country	continent	year	lifeExp	pop	gdpPercap	iso_alpha	iso_num
11	Afghanistan	Asia	2007	43.828	31889923	974.580338	AFG	4
23	Albania	Europe	2007	76.423	3600523	5937.029526	ALB	8
35	Algeria	Africa	2007	72.301	33333216	6223.367465	DZA	12
47	Angola	Africa	2007	42.731	12420476	4797.231267	AGO	24
59	Argentina	Americas	2007	75.320	40301927	12779.379640	ARG	32

According to the plot, there is a positive relationship between GDP per capita and life expectancy, though it appears to level off at higher GDP values.

10.5 Numeric vs. Categorical Data

When one variable is quantitative and the other is categorical, we can use grouped histograms, violin plots, or box plots to visualize the distribution of the quantitative variable across different categories.

10.5.1 Grouped Histograms

First, here’s how you can create a regular histogram of all tips:

px.histogram(tips, x='tip')

To create a grouped histogram, use the color parameter to specify the categorical variable. Here, we’ll color the histogram by sex:

px.histogram(tips, x='tip', color='sex')

By default, the histograms for each category are stacked. To change this behavior, you can use the barmode parameter. For example, barmode='overlay' will create an overlaid histogram:

px.histogram(tips, x="tip", color="sex", barmode="overlay")

This creates two semi-transparent histograms overlaid on top of each other, allowing for direct comparison of the distributions.

10.5.2 Practice Q: Age Distribution by Gender

Practice

Using the la_riots dataset from vega_datasets, create a grouped histogram of age by gender. Compare the age distributions between different genders.

According to the plot, was the oldest victim male or female?

la_riots = data.la_riots()
la_riots.head()
# Your code here

	first_name	last_name	age	gender	race	death_date	address	neighborhood	type	longitude	latitude
0	Cesar A.	Aguilar	18.0	Male	Latino	1992-04-30	2009 W. 6th St.	Westlake	Officer-involved shooting	-118.273976	34.059281
1	George	Alvarez	42.0	Male	Latino	1992-05-01	Main & College streets	Chinatown	Not riot-related	-118.234098	34.062690
2	Wilson	Alvarez	40.0	Male	Latino	1992-05-23	3100 Rosecrans Ave.	Hawthorne	Homicide	-118.326816	33.901662
3	Brian E.	Andrew	30.0	Male	Black	1992-04-30	Rosecrans & Chester avenues	Compton	Officer-involved shooting	-118.215390	33.903457
4	Vivian	Austin	87.0	Female	Black	1992-05-03	1600 W. 60th St.	Harvard Park	Death	-118.304741	33.985667

According to the plot, the oldest victim was female.

10.5.3 Violin & Box Plots

Violin plots are useful for comparing the distribution of a quantitative variable across different categories. They show the probability density of the data at different values and can include a box plot to summarize key statistics.

First, let’s create a violin plot of all tips:

px.violin(tips, y="tip")

We can add a box plot to the violin plot by setting the box parameter to True:

px.violin(tips, y="tip", box=True)

For just the box plot, we can use px.box:

px.box(tips, y="tip")

To add jitter points to the violin or box plots, we can use the points = 'all' parameter.

px.violin(tips, y="tip", points="all")

Now, to create a violin plot of tips by gender, use the x parameter to specify the categorical variable:

px.violin(tips, y="tip", x="sex", box=True)

We can also add a color axis to differentiate the violins:

px.violin(tips, y="tip", x="sex", color="sex", box=True)

Practice

10.5.4 Practice Q: Life Expectancy by Continent

Using the g_2007 dataset, create a violin plot showing the distribution of lifeExp by continent.

According to the plot, which continent has the highest median country life expectancy?

g_2007 = gapminder.query("year == 2007")
g_2007.head()
# Your code here

	country	continent	year	lifeExp	pop	gdpPercap	iso_alpha	iso_num
11	Afghanistan	Asia	2007	43.828	31889923	974.580338	AFG	4
23	Albania	Europe	2007	76.423	3600523	5937.029526	ALB	8
35	Algeria	Africa	2007	72.301	33333216	6223.367465	DZA	12
47	Angola	Africa	2007	42.731	12420476	4797.231267	AGO	24
59	Argentina	Americas	2007	75.320	40301927	12779.379640	ARG	32

According to the plot, Oceania has the highest median country life expectancy.

10.5.5 Summary Bar Charts (Mean and Standard Deviation)

Sometimes it’s useful to display the mean and standard deviation of a quantitative variable across different categories. This can be visualized using a bar chart with error bars.

First, let’s calculate the mean and standard deviation of tips for each gender. You have not yet learned how to do this, but you will in a later lesson.

# Calculate the mean and standard deviation
summary_df = (
    tips.groupby("sex")
    .agg(mean_tip=("tip", "mean"), std_tip=("tip", "std"))
    .reset_index()
)
summary_df

	sex	mean_tip	std_tip
0	Female	2.833448	1.159495
1	Male	3.089618	1.489102

Next, we’ll create a bar chart using px.bar and add error bars using the error_y parameter:

# Create the bar chart
px.bar(summary_df, x="sex", y="mean_tip", error_y="std_tip")

This bar chart displays the average tip amount for each gender, with error bars representing the standard deviation.

Practice

10.5.6 Practice Q: Average Total Bill by Day

Using the tips dataset, create a bar chart of mean total_bill by day with standard deviation error bars. You should copy and paste the code from the example above and modify it to create this plot.

According to the plot, which day has the highest average total bill?

tips.head()  # View the tips dataset
# Your code here

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

According to the plot, Sunday has the highest average bill.

Side Note: Difference between px.bar and px.histogram

Notice that this is the first time we are using the px.bar function. For past plots, we have used px.histogram to make bar charts.

The bar chart function generally expects that the numeric variable being plotted is already in it’s own column, while the histogram function does the grouping for you.

For example, in the cell below, we use px.histogram to make a bar chart of the sex column. The resulting plot compares the number of male and female customers in the dataset.

px.histogram(tips, x='sex')

To make the same plot using px.bar, we first need to group by the sex column and count the number of rows for each sex.

sex_counts = tips['sex'].value_counts().reset_index()
sex_counts

	sex	count
0	Male	157
1	Female	87

We can then plot the day column using px.bar:

px.bar(sex_counts, x="sex", y="count")

This produces a bar chart with one bar for each sex.

10.6 Categorical vs. Categorical Data

When both variables are categorical, bar charts with a color axis are effective for visualizing the frequency distribution across categories. We will focus on three types of bar charts: stacked bar charts, percent-stacked bar charts, and grouped/clustered bar charts.

10.6.1 Stacked Bar Charts

Stacked bar charts show the total counts and the breakdown within each category. To make a stacked bar chart, use the color parameter to specify the categorical variable:

px.histogram(
    tips,
    x='day',
    color='sex'
)

Let’s add numbers to the bars to show the exact counts, and also improve the color palette with custom colors.

px.histogram(
    tips,
    x="day",
    color="sex",
    text_auto=True,
    color_discrete_sequence=["#deb221", "#2f828a"],
)

This stacked bar chart shows the total number of customers each day, broken down by gender.

Practice

10.6.2 Practice Q: High and Low Income Countries by Continent

Using the g_2007_income dataset, create a stacked bar chart showing the count of high and low income countries in each continent.

gap_dat = px.data.gapminder()

g_2007_income = (
    gap_dat.query("year == 2007")
    .drop(columns=["year", "iso_alpha", "iso_num"])
    .assign(
        income_group=lambda df: np.where(
            df.gdpPercap > 15000, "High Income", "Low & Middle Income"
        )
    )
)

g_2007_income.head()
# Your code here

	country	continent	lifeExp	pop	gdpPercap	income_group
11	Afghanistan	Asia	43.828	31889923	974.580338	Low & Middle Income
23	Albania	Europe	76.423	3600523	5937.029526	Low & Middle Income
35	Algeria	Africa	72.301	33333216	6223.367465	Low & Middle Income
47	Angola	Africa	42.731	12420476	4797.231267	Low & Middle Income
59	Argentina	Americas	75.320	40301927	12779.379640	Low & Middle Income

10.6.3 Percent-Stacked Bar Charts

To show proportions instead of counts, we can create percent-stacked bar charts by setting the barnorm parameter to 'percent':

# Create the percent-stacked bar chart
px.histogram(tips, x="day", color="sex", barnorm="percent")

This chart normalizes the bar heights to represent percentages, showing the proportion of each gender for each day.

We can also add text labels to the bars to show the exact percentages:

px.histogram(tips, x="day", color="sex", barnorm="percent", text_auto=".1f")

The symbol .1f in the text_auto parameter formats the text labels to one decimal place.

Practice

10.6.4 Practice Q: Proportion of High and Low Income Countries by Continent

Again using the g_2007_income dataset, create a percent-stacked bar chart showing the proportion of high and low income countries in each continent. Add text labels to the bars to show the exact percentages.

According the plot, which continent has the highest proportion of high income countries? Are there any limitations to this plot?

# Your code here

10.6.5 Clustered Bar Charts

For clustered bar charts, set the barmode parameter to 'group' to place the bars for each category side by side:

px.histogram(tips, x="day", color="sex", barmode="group")

This layout makes it easier to compare values across categories directly.

10.7 Time Series Data

Time series data represents observations collected at different points in time. It’s crucial for analyzing trends, patterns, and changes over time. Let’s explore some basic time series visualizations using Nigeria’s population data from the Gapminder dataset.

First, let’s prepare our data:

# Load the Gapminder dataset
gapminder = px.data.gapminder()

# Subset the data for Nigeria
nigeria_pop = gapminder.query('country == "Nigeria"')[['year', 'pop']]
nigeria_pop

	year	pop
1128	1952	33119096
1129	1957	37173340
1130	1962	41871351
1131	1967	47287752
1132	1972	53740085
1133	1977	62209173
1134	1982	73039376
1135	1987	81551520
1136	1992	93364244
1137	1997	106207839
1138	2002	119901274
1139	2007	135031164

10.7.1 Bar Chart

A bar chart can be used to plot time series data.

# Bar chart
px.bar(nigeria_pop, x="year", y="pop")

This bar chart gives us a clear view of how Nigeria’s population has changed over the years, with each bar representing the population at a specific year.

10.7.2 Line Chart

A line chart is excellent for showing continuous changes over time:

# Line chart
px.line(nigeria_pop, x="year", y="pop")

The line chart connects the population values, making it easier to see the overall trend of population growth.

Adding markers to a line chart can highlight specific data points:

# Line chart with points
px.line(nigeria_pop, x='year', y='pop', markers=True)

We can also compare the population growth of multiple countries by adding a color parameter:

nigeria_ghana = gapminder.query('country in ["Nigeria", "Ghana"]')
px.line(nigeria_ghana, x="year", y="pop", color="country", markers=True)

This chart allows us to compare the population trends of Nigeria and Ghana over time.

Practice

10.7.3 Practice Q: GDP per Capita Time Series

Using the Gapminder dataset, create a time series visualization for the GDP per capita of Iraq.

# Your code here

What happened to Iraq in the 1980s that might explain the graph shown?

10.8 Plots with three or more variables

Although bivariate visualizations are the most common types of visualizations, plots with three or more variables are also sometimes useful. Let’s explore a few examples.

10.8.1 Bubble Charts

Bubble charts show the relationship between three variables by mapping the size of the points to a third variable. Below, we plot the relationship between gdpPercap and lifeExp with the size of the points representing the population of the country.

px.scatter(g_2007, x="gdpPercap", y="lifeExp", size="pop")

We can easily spot the largest countries by population, such as China, India, and the United States. We can also add a color axis to differentiate between continents:

px.scatter(g_2007, x="gdpPercap", y="lifeExp", size="pop", color="continent")

Now we have four different variables being plotted:

gdpPercap on the x-axis
lifeExp on the y-axis
pop as the size of the points
continent as the color of the points

Practice

10.8.2 Practice Q: Tips Bubble Chart

Using the tips dataset, create a bubble chart showing the relationship between total_bill and tip with the size of the points representing the size of the party, and the color representing the day of the week.

Use the plot to answer the question:

The highest two tip amounts were on which days and what was the table size?

tips.head()
# Your code here

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

10.8.3 Facet Plots

Faceting splits a single plot into multiple plots, with each plot showing a different subset of the data. This is useful for comparing distributions across subsets.

For example, we can facet the bubble chart by continent:

px.scatter(
    g_2007,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    color="continent",
    facet_col="continent",
)

We can change the arrangement of the facets by changing the facet_col_wrap parameter. For example, facet_col_wrap=2 will wrap the facets into two columns:

px.scatter(
    g_2007,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    color="continent",
    facet_col="continent",
    facet_col_wrap=2,
)

Similarly, we can facet the violin plots of tips by day of the week:

px.violin(
    tips,
    x="sex",
    y="tip",
    color="sex",
    facet_col="day",
    facet_col_wrap=2,
)

Faceting allows us to compare distributions across different days, providing more granular insights.

Practice

10.8.4 Practice Q: Tips Facet Plot

Using the tips dataset, create a percent-stacked bar chart of the time column, colored by the sex column, and facetted by the day column.

Which day-time has the highest proportion of male customers (e.g. Friday Lunch, Saturday Dinner, etc.)?

tips.head()
# Your code here

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

10.9 Summary

In this lesson, you learned how to create bivariate and multivariate graphs using Plotly Express. Understanding these visualization techniques will help you explore and communicate relationships in your data more effectively.

See you in the next lesson!