import pandas as pd
import plotly.express as px
23 Reshaping Data with melt() and pivot()
23.1 Packages
23.2 Data
Run the code below to load and define the datasets to be used in this lesson.
# Temperatures dataset
= pd.DataFrame(
temperatures
{"country": ["Sweden", "Denmark", "Norway"],
"1994": [1, 2, 3],
"1995": [3, 4, 5],
"1996": [5, 6, 7],
}
)
# Fuels Wide dataset
= pd.read_csv(
fuels_wide "https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/oil_per_capita_wide.csv"
)
# Eurostat Births Wide dataset
= pd.read_csv(
euro_births_wide "https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/euro_births_wide.csv"
)
# Contracts dataset
= pd.read_csv(
contracts "https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/chicago_contracts_20_23.csv"
)
# Population dataset
= pd.read_csv(
population "https://raw.githubusercontent.com/the-graph-courses/idap_book/main/data/tidyr_population.csv"
)
23.3 Introduction
Reshaping is a data manipulation technique that involves re-orienting the rows and columns of a dataset. This is often required to make data easier to analyze or understand.
In this lesson, we will cover how to effectively reshape data using pandas
functions.
23.4 Learning Objectives
- Understand what wide data format is and what long data format is.
- Learn how to reshape wide data to long data using
melt()
. - Learn how to reshape long data to wide data using
pivot()
.
23.5 What Do “Wide” and “Long” Mean?
The terms “wide” and “long” are best understood in the context of example datasets. Let’s take a look at some now.
Imagine that you have three products for which you collect sales data over three months.
Wide Format:
Product | Jan | Feb | Mar |
---|---|---|---|
A | 100 | 120 | 110 |
B | 90 | 95 | 100 |
C | 80 | 85 | 90 |
Long Format:
Product | Month | Sales |
---|---|---|
A | Jan | 100 |
A | Feb | 120 |
A | Mar | 110 |
B | Jan | 90 |
B | Feb | 95 |
B | Mar | 100 |
C | Jan | 80 |
C | Feb | 85 |
C | Mar | 90 |
Take a minute to study the two datasets to make sure you understand the relationship between them.
In the wide dataset, each observational unit (each product) occupies only one row, and each measurement (sales in Jan, Feb, Mar) is in a separate column.
In the long dataset, on the other hand, each observational unit (each product) occupies multiple rows, with one row for each measurement.
Here’s another example with mock data, where the observational units are countries:
Long Format:
Country | Year | GDP |
---|---|---|
USA | 2020 | 21433 |
USA | 2021 | 22940 |
China | 2020 | 14723 |
China | 2021 | 17734 |
Wide Format:
Country | GDP_2020 | GDP_2021 |
---|---|---|
USA | 21433 | 22940 |
China | 14723 | 17734 |
The examples above are both time-series datasets because the measurements are repeated across time. But the concepts of wide and long are relevant to other kinds of data too.
Consider the example below, showing the number of employees in different departments of three companies:
Wide Format:
Company | HR | Sales | IT |
---|---|---|---|
A | 10 | 20 | 15 |
B | 8 | 25 | 20 |
C | 12 | 18 | 22 |
Long Format:
Company | Department | Employees |
---|---|---|
A | HR | 10 |
A | Sales | 20 |
A | IT | 15 |
B | HR | 8 |
B | Sales | 25 |
B | IT | 20 |
C | HR | 12 |
C | Sales | 18 |
C | IT | 22 |
In the wide dataset, each observational unit (each company) occupies only one row, with the repeated measurements (number of employees in different departments) spread across multiple columns.
In the long dataset, each observational unit is spread over multiple rows.
The observational units, sometimes called statistical units, are the primary entities or items described by the dataset.
In the first example, the observational units were products; in the second example, countries; and in the third example, companies.
23.6 Practice Q: Wide or Long?
Consider the temperatures
dataset created earlier:
temperatures
country | 1994 | 1995 | 1996 | |
---|---|---|---|---|
0 | Sweden | 1 | 3 | 5 |
1 | Denmark | 2 | 4 | 6 |
2 | Norway | 3 | 5 | 7 |
Is this data in a wide or long format?
23.7 When Should You Use Wide vs. Long Data?
The truth is, it really depends on what you want to do! The wide format is great for displaying data because it’s easy to visually compare values this way. Long data is best for some data analysis tasks, like grouping and plotting.
It is essential to know how to switch from one format to the other easily. Switching from the wide to the long format, or the other way around, is called reshaping.
23.8 From Wide to Long with melt()
To practice reshaping from a wide to a long format, we’ll consider data from Our World in Data on fossil fuel consumption per capita. You can find the data here.
Below, we view this data on fossil fuel consumption per capita:
fuels_wide
Entity | Code | y_1970 | y_1980 | y_1990 | y_2000 | y_2010 | y_2020 | |
---|---|---|---|---|---|---|---|---|
0 | Algeria | DZA | 1764.8470 | 3532.7976 | 4381.6636 | 3351.2180 | 5064.9863 | 4877.2680 |
1 | Argentina | ARG | 11677.9680 | 10598.3990 | 7046.2485 | 7146.8154 | 7966.7827 | 6399.2114 |
2 | Australia | AUS | 23040.4550 | 25007.4380 | 23046.9510 | 23976.3550 | 23584.3070 | 20332.4100 |
3 | Austria | AUT | 14338.8090 | 19064.0920 | 16595.1930 | 18189.0920 | 18424.1170 | 14934.0650 |
4 | Azerbaijan | AZE | NaN | NaN | 13516.0190 | 9119.3470 | 4031.9407 | 5615.1157 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
76 | United States | USA | 40813.9530 | 42365.6500 | 37525.5160 | 37730.1600 | 31791.3070 | 26895.4770 |
77 | Uzbekistan | UZB | NaN | NaN | 6324.8677 | 3197.1330 | 1880.1338 | 1859.1548 |
78 | Venezuela | VEN | 11138.2210 | 16234.0960 | 12404.5570 | 11239.9260 | 14948.3070 | 4742.6226 |
79 | Vietnam | VNM | 1757.6117 | 439.9465 | 523.2565 | 1280.3065 | 2296.7590 | 2927.7446 |
80 | World | OWID_WRL | 7217.8340 | 8002.0854 | 7074.2583 | 6990.4272 | 6879.6110 | 6216.8060 |
81 rows × 8 columns
We observe that each observational unit (each country) occupies only one row, with the repeated measurements of fossil fuel consumption (in kilowatt-hour equivalents) spread out across multiple columns. Hence, this dataset is in a wide format.
To convert it to a long format, we can use the convenient melt
function. Within melt
, can define the id variables, which we do not want to reshape:
= fuels_wide.melt(id_vars=["Entity", "Code"])
fuels_long fuels_long
Entity | Code | variable | value | |
---|---|---|---|---|
0 | Algeria | DZA | y_1970 | 1764.8470 |
1 | Argentina | ARG | y_1970 | 11677.9680 |
2 | Australia | AUS | y_1970 | 23040.4550 |
3 | Austria | AUT | y_1970 | 14338.8090 |
4 | Azerbaijan | AZE | y_1970 | NaN |
... | ... | ... | ... | ... |
481 | United States | USA | y_2020 | 26895.4770 |
482 | Uzbekistan | UZB | y_2020 | 1859.1548 |
483 | Venezuela | VEN | y_2020 | 4742.6226 |
484 | Vietnam | VNM | y_2020 | 2927.7446 |
485 | World | OWID_WRL | y_2020 | 6216.8060 |
486 rows × 4 columns
Very easy!
Let’s sort it so it’s easier to read:
= fuels_long.sort_values(by=['Entity', 'variable'])
fuels_long fuels_long
Entity | Code | variable | value | |
---|---|---|---|---|
0 | Algeria | DZA | y_1970 | 1764.8470 |
81 | Algeria | DZA | y_1980 | 3532.7976 |
162 | Algeria | DZA | y_1990 | 4381.6636 |
243 | Algeria | DZA | y_2000 | 3351.2180 |
324 | Algeria | DZA | y_2010 | 5064.9863 |
... | ... | ... | ... | ... |
161 | World | OWID_WRL | y_1980 | 8002.0854 |
242 | World | OWID_WRL | y_1990 | 7074.2583 |
323 | World | OWID_WRL | y_2000 | 6990.4272 |
404 | World | OWID_WRL | y_2010 | 6879.6110 |
485 | World | OWID_WRL | y_2020 | 6216.8060 |
486 rows × 4 columns
The years are now indicated in the variable variable
, and all the consumption values occupy a single variable, value
. We may wish to rename the variable
column to year
, and the value
column to oil_consumption
. This can be done directly in the melt
function:
= fuels_wide.melt(
fuels_long =['Entity', 'Code'],
id_vars='year',
var_name='oil_consumption'
value_name=['Entity', 'year'])
).sort_values(by fuels_long
Entity | Code | year | oil_consumption | |
---|---|---|---|---|
0 | Algeria | DZA | y_1970 | 1764.8470 |
81 | Algeria | DZA | y_1980 | 3532.7976 |
162 | Algeria | DZA | y_1990 | 4381.6636 |
243 | Algeria | DZA | y_2000 | 3351.2180 |
324 | Algeria | DZA | y_2010 | 5064.9863 |
... | ... | ... | ... | ... |
161 | World | OWID_WRL | y_1980 | 8002.0854 |
242 | World | OWID_WRL | y_1990 | 7074.2583 |
323 | World | OWID_WRL | y_2000 | 6990.4272 |
404 | World | OWID_WRL | y_2010 | 6879.6110 |
485 | World | OWID_WRL | y_2020 | 6216.8060 |
486 rows × 4 columns
You may also want to remove the y_
in front of each year. This can be achieved with a string operation.
'year'] = fuels_long['year'].str.replace('y_', '').astype(int)
fuels_long[ fuels_long
Entity | Code | year | oil_consumption | |
---|---|---|---|---|
0 | Algeria | DZA | 1970 | 1764.8470 |
81 | Algeria | DZA | 1980 | 3532.7976 |
162 | Algeria | DZA | 1990 | 4381.6636 |
243 | Algeria | DZA | 2000 | 3351.2180 |
324 | Algeria | DZA | 2010 | 5064.9863 |
... | ... | ... | ... | ... |
161 | World | OWID_WRL | 1980 | 8002.0854 |
242 | World | OWID_WRL | 1990 | 7074.2583 |
323 | World | OWID_WRL | 2000 | 6990.4272 |
404 | World | OWID_WRL | 2010 | 6879.6110 |
485 | World | OWID_WRL | 2020 | 6216.8060 |
486 rows × 4 columns
Here’s what we did above:
- Used
str.replace()
to remove they_
prefix from each year. - Converted the
year
column to integers usingastype(int)
. - Sorted the data by “Entity” and “year” using
sort_values()
.
23.9 Practice Q: Temperatures to Long
Convert the temperatures
dataset shown below into a long format. Your answer should have the following column names: “country”, “year”, and “avg_temp”.
# Your code here
temperatures
country | 1994 | 1995 | 1996 | |
---|---|---|---|---|
0 | Sweden | 1 | 3 | 5 |
1 | Denmark | 2 | 4 | 6 |
2 | Norway | 3 | 5 | 7 |
23.10 Practice Q: Eurostat Births to Long
For this practice question, you will use the euro_births_wide
dataset from Eurostat. It shows the annual number of births in 50 European countries:
euro_births_wide.head()
country | x2015 | x2016 | x2017 | x2018 | x2019 | x2020 | x2021 | |
---|---|---|---|---|---|---|---|---|
0 | Belgium | 122274.0 | 121896.0 | 119690.0 | 118319.0 | 117695.0 | 114350.0 | 118349.0 |
1 | Bulgaria | 65950.0 | 64984.0 | 63955.0 | 62197.0 | 61538.0 | 59086.0 | 58678.0 |
2 | Czechia | 110764.0 | 112663.0 | 114405.0 | 114036.0 | 112231.0 | 110200.0 | 111793.0 |
3 | Denmark | 58205.0 | 61614.0 | 61397.0 | 61476.0 | 61167.0 | 60937.0 | 63473.0 |
4 | Germany | 737575.0 | 792141.0 | 784901.0 | 787523.0 | 778090.0 | 773144.0 | 795492.0 |
The data is in a wide format. Convert it to a long format DataFrame that has the following column names: “country”, “year”, and “births_count”.
Remove the x prefix from the year columns and convert them to integers.
23.11 Using Long Data for Analysis
Let’s see why long data is often better for analysis.
Consider again the fuels_wide
dataset:
fuels_wide.head()
Entity | Code | y_1970 | y_1980 | y_1990 | y_2000 | y_2010 | y_2020 | |
---|---|---|---|---|---|---|---|---|
0 | Algeria | DZA | 1764.847 | 3532.7976 | 4381.6636 | 3351.2180 | 5064.9863 | 4877.2680 |
1 | Argentina | ARG | 11677.968 | 10598.3990 | 7046.2485 | 7146.8154 | 7966.7827 | 6399.2114 |
2 | Australia | AUS | 23040.455 | 25007.4380 | 23046.9510 | 23976.3550 | 23584.3070 | 20332.4100 |
3 | Austria | AUT | 14338.809 | 19064.0920 | 16595.1930 | 18189.0920 | 18424.1170 | 14934.0650 |
4 | Azerbaijan | AZE | NaN | NaN | 13516.0190 | 9119.3470 | 4031.9407 | 5615.1157 |
fuels_long.head()
Entity | Code | year | oil_consumption | |
---|---|---|---|---|
0 | Algeria | DZA | 1970 | 1764.8470 |
81 | Algeria | DZA | 1980 | 3532.7976 |
162 | Algeria | DZA | 1990 | 4381.6636 |
243 | Algeria | DZA | 2000 | 3351.2180 |
324 | Algeria | DZA | 2010 | 5064.9863 |
If we want to find the average fossil fuel consumption per country, this is very easy to do with the long format:
'Entity')['oil_consumption'].mean() fuels_long.groupby(
Entity
Algeria 3828.796750
Argentina 8472.570833
Australia 23164.652667
Austria 16924.228000
Azerbaijan 8070.605600
...
United States 36187.010500
Uzbekistan 3315.322325
Venezuela 11784.621600
Vietnam 1537.604133
World 7063.503650
Name: oil_consumption, Length: 81, dtype: float64
But with the wide format, this is not so easy:
'y_1970', 'y_1980', 'y_1990', 'y_2000', 'y_2010', 'y_2020']].mean(axis=1) fuels_wide[[
0 3828.796750
1 8472.570833
2 23164.652667
3 16924.228000
4 8070.605600
...
76 36187.010500
77 3315.322325
78 11784.621600
79 1537.604133
80 7063.503650
Length: 81, dtype: float64
Imagine if you had 100 years of data!
And mean is a fairly simple operation. How would you calculate the standard deviation of fossil fuel consumption per country?
Long data is also very useful for plotting.
For example, to plot the average fossil fuel consumption per country over time, we can use the following code:
= fuels_long.query('Entity in ["Peru", "Iran", "China"]')
subset ='year', y='oil_consumption', color='Entity', title='Average Fossil Fuel Consumption per Country') px.line(subset, x
To create a plot like this with the wide format is not directly possible, since the data you want to plot is scattered across multiple columns.
So as you can see, while wide data is great for display, long data is very useful for analysis and plotting.
23.12 From Long to Wide
Now you know how to reshape from wide to long with melt()
. How about going the other way, from long to wide? For this, you can use the pivot()
function.
But before we see how to use this function to manipulate long data, let’s first consider where you’re likely to run into long data.
While wide data tends to come from external sources (as we have seen above), long data, on the other hand, is likely to be created by you while data wrangling, especially in the course of grouped aggregations.
Let’s see an example of this now.
We will use a dataset of contracts granted by the city of Chicago from 2020 to 2023. You can find more information about the data here.
contracts
year | approval_date | description | contract_num | revision_num | specification_num | contract_type | start_date | end_date | department | vendor_name | vendor_id | address_1 | address_2 | city | state | zip | award_amount | procurement_type | contract_pdf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 2020-01-02 | LEASE | 24406 | 32 | 96136 | PROPERTY LEASE | NaN | NaN | NaN | 8700 BUILDING LLC | 89123305A | 7300 S NARRAGANSETT | NaN | BEDFORD PARK | Illinois | 60638 | 321.1 | NaN | NaN |
1 | 2020 | 2020-01-03 | DFSS-HHS-CS-CEL: | 113798 | 0 | 1070196 | DELEGATE AGENCY | 12/01/2019 | 11/30/2022 | DEPT OF FAMILY AND SUPPORT SERVICES | CATHOLIC CHARITIES OF THE ARCHDIOCESE OF CHICAGO | 102484615A | 1 E BANKS ST | NaN | CHICAGO | Illinois | 60670 | 17692515.0 | NaN | NaN |
2 | 2020 | 2020-01-03 | DFSS-HHS-CS-CEL: | 113819 | 0 | 1070196 | DELEGATE AGENCY | 12/01/2019 | 11/30/2022 | DEPT OF FAMILY AND SUPPORT SERVICES | KIMBALL DAYCARE CENTER & KINDERGARTEN INC | 105458567Z | 1636-1638 N KIMBALL AVE | NaN | CHICAGO | Illinois | 60647 | 11461500.0 | NaN | http://ecm.cityofchicago.org/eSMARTContracts/s... |
3 | 2020 | 2020-01-03 | DFSS-HHS-CS-CEL: | 113818 | 0 | 1070196 | DELEGATE AGENCY | 12/01/2019 | 11/30/2022 | DEPT OF FAMILY AND SUPPORT SERVICES | JUDAH INTERNATIONAL OUTREACH MINISTRIES, INC | 94219962X | 856 N PULASKI RD | NaN | CHICAGO | Illinois | 60651 | 2356515.0 | NaN | http://ecm.cityofchicago.org/eSMARTContracts/s... |
4 | 2020 | 2020-01-03 | DFSS-HHS-CS-CEL: | 113820 | 0 | 1070196 | DELEGATE AGENCY | 12/01/2019 | 11/30/2022 | DEPT OF FAMILY AND SUPPORT SERVICES | Marillac St. Vincent Family Services Inc DBA S... | 97791861L | 212 S FRANCISCO AVENUE EFT | NaN | CHICAGO | Illinois | 60612 | 3666015.0 | NaN | http://ecm.cityofchicago.org/eSMARTContracts/s... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
28823 | 2023 | 2023-12-29 | DFSS-CORP-HL-PSH: | 220413 | 3 | 1221503 | DELEGATE AGENCY | NaN | NaN | DEPT OF FAMILY AND SUPPORT SERVICES | INNER VOICE INC. | 6231926M | 1621 W WALNUT ST FL 1ST | NaN | CHICAGO | Illinois | 60612 | 0.0 | NaN | NaN |
28824 | 2023 | 2023-12-29 | DFSS-CORP-YS-OST: | 253846 | 0 | 1247493 | DELEGATE AGENCY | NaN | NaN | DEPT OF FAMILY AND SUPPORT SERVICES | AFTER-SCHOOL MATTERS, INC.|CLEANED-UP | 72580818P | 66 E RANDOLPH ST FL 1ST | NaN | CHICAGO | Illinois | 60601 | 32000.0 | NaN | NaN |
28825 | 2023 | 2023-12-29 | DFSS-IDHS-HL-INTHS: | 253843 | 0 | 1235949 | DELEGATE AGENCY | NaN | NaN | DEPT OF FAMILY AND SUPPORT SERVICES | BREAKTHROUGH URBAN MINISTRIES, INC. | 94722896V | 402 N ST LOUIS AVENUE EFT | NaN | CHICAGO | Illinois | 60624 | 14400.0 | NaN | NaN |
28826 | 2023 | 2023-12-29 | CDPH-RW-PA: ESS-HRSA PO 116685 CHICAGO HOUSE A... | 192085 | 1 | 1095441 | DELEGATE AGENCY | NaN | NaN | DEPARTMENT OF HEALTH | CHICAGO HOUSE & SOCIAL SERVICE AGENCY | 105470138T | 2229 S MICHIGAN AVE 304 EFT | NaN | CHICAGO | Illinois | 60616 | -32025.2 | NaN | NaN |
28827 | 2023 | 2023-12-29 | DFSS-HHS-CS-CEL: | 222199 | 1 | 1070196 | DELEGATE AGENCY | NaN | NaN | DEPT OF FAMILY AND SUPPORT SERVICES | ALLISON'S INFANT & TODDLER CENTER INC | 62751817Z | 234 E 115TH ST FL 1ST | NaN | CHICAGO | Illinois | 60628 | 141923.0 | NaN | NaN |
28828 rows × 20 columns
Each row corresponds to one contract, and we have each contract’s ID number, the year in which it was granted, the amount of the contract, and the vendor’s name and address, among other variables.
Now, consider the following grouped summary of the contracts
dataset, which shows the number of contracts by state of the vendor in each year:
= contracts.groupby("state")["year"].value_counts().reset_index()
contracts_summary contracts_summary
state | year | count | |
---|---|---|---|
0 | Alabama | 2023 | 7 |
1 | Alabama | 2021 | 2 |
2 | Alabama | 2020 | 1 |
3 | Alabama | 2022 | 1 |
4 | Arizona | 2020 | 3 |
... | ... | ... | ... |
128 | Washington | 2021 | 1 |
129 | Wisconsin | 2023 | 25 |
130 | Wisconsin | 2020 | 18 |
131 | Wisconsin | 2022 | 17 |
132 | Wisconsin | 2021 | 15 |
133 rows × 3 columns
The output of this grouped operation is a quintessentially “long” dataset. Each observational unit (each state) occupies multiple rows, with one row for each measurement (each year).
Now, let’s see how to convert such long data into a wide format with pivot()
.
The code is quite straightforward:
= contracts_summary.pivot(
contracts_wide ="state", columns="year", values="count"
index
).reset_index() contracts_wide.head()
year | state | 2020 | 2021 | 2022 | 2023 |
---|---|---|---|---|---|
0 | Alabama | 1.0 | 2.0 | 1.0 | 7.0 |
1 | Arizona | 3.0 | 1.0 | 3.0 | 2.0 |
2 | Arkansas | 1.0 | NaN | 1.0 | NaN |
3 | British Columbia | NaN | 1.0 | NaN | NaN |
4 | California | 36.0 | 42.0 | 43.0 | 38.0 |
As you can see, pivot()
has three important arguments:
index
defines which column(s) to use as the new index. In our case, it’s the “state” since we want each row to represent one state.columns
identifies which variable to use to define column names in the wide format. In our case, it’s the “year”. You can see that the years are now the column names.values
specifies which values will become the core of the wide data format. In our case, it’s the number of contracts “count”.
You might also want to have the years be your primary observational units, with each year occupying one row. This can be carried out similarly to the above example, but with year
as the index and state
as the columns:
= contracts_summary.pivot(
contracts_wide_year ="year", columns="state", values="count"
index
).reset_index() contracts_wide_year
state | year | Alabama | Arizona | Arkansas | British Columbia | California | Canada | Colorado | Connecticut | Delaware | ... | Oregon | Pennsylvania | Rhode Island | South Carolina | Tennessee | Texas | Vermont | Virginia | Washington | Wisconsin |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 1.0 | 3.0 | 1.0 | NaN | 36.0 | 1.0 | 6.0 | 1.0 | NaN | ... | 5.0 | 20.0 | 1.0 | 2.0 | 2.0 | 25.0 | NaN | 4.0 | NaN | 18.0 |
1 | 2021 | 2.0 | 1.0 | NaN | 1.0 | 42.0 | 1.0 | 2.0 | 5.0 | NaN | ... | NaN | 24.0 | NaN | 3.0 | 2.0 | 24.0 | 1.0 | 4.0 | 1.0 | 15.0 |
2 | 2022 | 1.0 | 3.0 | 1.0 | NaN | 43.0 | 1.0 | 7.0 | 3.0 | 1.0 | ... | NaN | 31.0 | NaN | 2.0 | 3.0 | 37.0 | NaN | 7.0 | NaN | 17.0 |
3 | 2023 | 7.0 | 2.0 | NaN | NaN | 38.0 | NaN | 6.0 | 5.0 | 2.0 | ... | NaN | 37.0 | NaN | 1.0 | 3.0 | 28.0 | NaN | 9.0 | NaN | 25.0 |
4 rows × 44 columns
Here, the unique observation units (our rows) are now the years (2020, 2021, 2022, 2023).
23.13 Practice Q: Temperatures back to Wide
Convert the long temperatures_long
dataset you created above back to a wide format. Your answer should have the following column names: “country”, “1994”, “1995”, and “1996”.
# Your code here
23.14 Practice Q: Population to Wide
The population
dataset shows the populations of 219 countries over time.
Reshape this data into a wide format.
population
country | year | population | |
---|---|---|---|
0 | Afghanistan | 1995 | 17586073 |
1 | Afghanistan | 1996 | 18415307 |
2 | Afghanistan | 1997 | 19021226 |
3 | Afghanistan | 1998 | 19496836 |
4 | Afghanistan | 1999 | 19987071 |
... | ... | ... | ... |
4040 | Zimbabwe | 2009 | 12888918 |
4041 | Zimbabwe | 2010 | 13076978 |
4042 | Zimbabwe | 2011 | 13358738 |
4043 | Zimbabwe | 2012 | 13724317 |
4044 | Zimbabwe | 2013 | 14149648 |
4045 rows × 3 columns
23.15 Reshaping Can Be Hard
We have mostly looked at very simple examples of reshaping here, but in the wild, reshaping can be difficult to do accurately.
When you run into such cases, we recommend looking at the official documentation of reshaping from the pandas team, as it is quite rich in examples.
23.16 Wrap-Up
Congratulations! You’ve gotten the hang of reshaping data with pandas.
You now understand the differences between wide and long formats and can skillfully use melt()
and pivot()
to transform your data as needed.
23.17 Solution Q: Wide or Long?
The data is in a wide format.
23.18 Solution Q: Temperatures to Long
# Melt the wide data into long format
= temperatures.melt(
temperatures_long =["country"], var_name="year", value_name="avgtemp"
id_vars
)
# Display the long format data
temperatures_long
country | year | avgtemp | |
---|---|---|---|
0 | Sweden | 1994 | 1 |
1 | Denmark | 1994 | 2 |
2 | Norway | 1994 | 3 |
3 | Sweden | 1995 | 3 |
4 | Denmark | 1995 | 4 |
5 | Norway | 1995 | 5 |
6 | Sweden | 1996 | 5 |
7 | Denmark | 1996 | 6 |
8 | Norway | 1996 | 7 |
23.19 Solution Q: Eurostat Births to Long
# Melt the wide data into long format
= euro_births_wide.melt(
births_long =["country"], var_name="year", value_name="births_count"
id_vars
)
# Display the long format data
births_long
country | year | births_count | |
---|---|---|---|
0 | Belgium | x2015 | 122274.0 |
1 | Bulgaria | x2015 | 65950.0 |
2 | Czechia | x2015 | 110764.0 |
3 | Denmark | x2015 | 58205.0 |
4 | Germany | x2015 | 737575.0 |
... | ... | ... | ... |
345 | Ukraine | x2021 | 212.0 |
346 | Armenia | x2021 | 271983.0 |
347 | Azerbaijan | x2021 | NaN |
348 | Georgia | x2021 | 112284.0 |
349 | NaN | x2021 | 45946.0 |
350 rows × 3 columns
23.20 Solution Q: Temperatures back to Wide
# Pivot the long data into wide format
= temperatures_long.pivot(
temperatures_wide ="country", columns="year", values="avgtemp"
index
).reset_index()
# Display the wide format data
temperatures_wide
year | country | 1994 | 1995 | 1996 |
---|---|---|---|---|
0 | Denmark | 2 | 4 | 6 |
1 | Norway | 3 | 5 | 7 |
2 | Sweden | 1 | 3 | 5 |
23.21 Solution Q: Population to Wide
# Pivot the long data into wide format
= population.pivot(
population_wide ="country", columns="year", values="population"
index
).reset_index()
# Display the wide format data
population_wide
year | country | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 17586073.0 | 18415307.0 | 19021226.0 | 19496836.0 | 19987071.0 | 20595360.0 | 21347782.0 | 22202806.0 | 23116142.0 | 24018682.0 | 24860855.0 | 25631282.0 | 26349243.0 | 27032197.0 | 27708187.0 | 28397812.0 | 29105480.0 | 29824536.0 | 30551674.0 |
1 | Albania | 3357858.0 | 3341043.0 | 3331317.0 | 3325456.0 | 3317941.0 | 3304948.0 | 3286084.0 | 3263596.0 | 3239385.0 | 3216197.0 | 3196130.0 | 3179573.0 | 3166222.0 | 3156608.0 | 3151185.0 | 3150143.0 | 3153883.0 | 3162083.0 | 3173271.0 |
2 | Algeria | 29315463.0 | 29845208.0 | 30345466.0 | 30820435.0 | 31276295.0 | 31719449.0 | 32150198.0 | 32572977.0 | 33003442.0 | 33461345.0 | 33960903.0 | 34507214.0 | 35097043.0 | 35725377.0 | 36383302.0 | 37062820.0 | 37762962.0 | 38481705.0 | 39208194.0 |
3 | American Samoa | 52874.0 | 53926.0 | 54942.0 | 55899.0 | 56768.0 | 57522.0 | 58176.0 | 58729.0 | 59117.0 | 59262.0 | 59117.0 | 58652.0 | 57919.0 | 57053.0 | 56245.0 | 55636.0 | 55274.0 | 55128.0 | 55165.0 |
4 | Andorra | 63854.0 | 64274.0 | 64090.0 | 63799.0 | 64084.0 | 65399.0 | 68000.0 | 71639.0 | 75643.0 | 79060.0 | 81223.0 | 81877.0 | 81292.0 | 79969.0 | 78659.0 | 77907.0 | 77865.0 | 78360.0 | 79218.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
212 | Wallis and Futuna Islands | 14143.0 | 14221.0 | 14309.0 | 14394.0 | 14460.0 | 14497.0 | 14501.0 | 14476.0 | 14422.0 | 14344.0 | 14246.0 | 14126.0 | 13988.0 | 13840.0 | 13697.0 | 13565.0 | 13451.0 | 13353.0 | 13272.0 |
213 | West Bank and Gaza Strip | 2598393.0 | 2722497.0 | 2851993.0 | 2980563.0 | 3099951.0 | 3204572.0 | 3291620.0 | 3363542.0 | 3426549.0 | 3489743.0 | 3559856.0 | 3638829.0 | 3725076.0 | 3817551.0 | 3914035.0 | 4012880.0 | 4114199.0 | 4218771.0 | 4326295.0 |
214 | Yemen | 15018201.0 | 15578640.0 | 16088019.0 | 16564235.0 | 17035531.0 | 17522537.0 | 18029989.0 | 18551068.0 | 19081306.0 | 19612696.0 | 20139661.0 | 20661714.0 | 21182162.0 | 21703571.0 | 22229625.0 | 22763008.0 | 23304206.0 | 23852409.0 | 24407381.0 |
215 | Zambia | 8841338.0 | 9073311.0 | 9320089.0 | 9577483.0 | 9839179.0 | 10100981.0 | 10362137.0 | 10625423.0 | 10894519.0 | 11174650.0 | 11470022.0 | 11781612.0 | 12109620.0 | 12456527.0 | 12825031.0 | 13216985.0 | 13633796.0 | 14075099.0 | 14538640.0 |
216 | Zimbabwe | 11639364.0 | 11846110.0 | 12045813.0 | 12229500.0 | 12384727.0 | 12503652.0 | 12586763.0 | 12640922.0 | 12673103.0 | 12693047.0 | 12710589.0 | 12724308.0 | 12740160.0 | 12784041.0 | 12888918.0 | 13076978.0 | 13358738.0 | 13724317.0 | 14149648.0 |
217 rows × 20 columns