5.12. Pandas pivot and melt#
The pandas.DataFrame.pivot
method is a powerful feature in the pandas library for reshaping data in a DataFrame. It allows you to transform your data from a long format (where data is stored in rows) to a wide format (where data is stored in columns) or vice versa. This can be particularly useful when you need to analyze or visualize your data in a different structure.
Syntax:
DataFrame.pivot(index=None, columns=None, values=None)
index
: This parameter specifies the column whose unique values will become the new index (row labels) of the pivoted DataFrame. It can be a column name or a list of column names.columns
: This parameter specifies the column whose unique values will become the new column headers of the pivoted DataFrame. It can be a column name or a list of column names.values
: This parameter specifies the column(s) containing the data to populate the pivoted DataFrame. It can be a column name or a list of column names. If not provided, all remaining columns not used asindex
orcolumns
will be used.
We can see the full syntax here.
Example: We can use pivot
to transform data from a long format (e.g., for time-series data) to a wide format, making it easier to analyze.
import pandas as pd
# Create a DataFrame with air quality data
# This dataset is a fictional dataset:
air_quality_data = pd.DataFrame({
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'Pollutant': ['CO', 'NO2', 'CO', 'NO2'],
'Value': [2.5, 20.1, 2.7, 22.3]
})
# Print the original air quality data DataFrame
print("Original Air Quality Data:")
display(air_quality_data)
# Pivot the air quality data to a wide format
pivoted_air_quality = air_quality_data.pivot(index='Date', columns='Pollutant', values='Value')
# Print the pivoted air quality data
print("\nPivoted Air Quality Data (Wide Format):")
display(pivoted_air_quality)
Original Air Quality Data:
Date | Pollutant | Value | |
---|---|---|---|
0 | 2023-01-01 | CO | 2.5 |
1 | 2023-01-01 | NO2 | 20.1 |
2 | 2023-01-02 | CO | 2.7 |
3 | 2023-01-02 | NO2 | 22.3 |
Pivoted Air Quality Data (Wide Format):
Pollutant | CO | NO2 |
---|---|---|
Date | ||
2023-01-01 | 2.5 | 20.1 |
2023-01-02 | 2.7 | 22.3 |
The pandas.melt
function is a powerful tool for converting a DataFrame from wide format to long format, making it easier to work with certain types of data. It essentially “unpivots” a DataFrame by melting columns into rows. This can be particularly useful when you want to analyze or visualize data that is initially organized with column headers representing different variables.
Syntax:
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
frame
: This parameter specifies the DataFrame you want to melt.id_vars
: This parameter specifies the column(s) that should remain as identifier variables (columns that won’t be melted). It can be a column name or a list of column names.value_vars
: This parameter specifies the column(s) that should be melted (unpivoted). If not specified, all columns not listed inid_vars
will be melted.var_name
: This parameter specifies the name of the new column that will store the variable names (from the melted columns).value_name
: This parameter specifies the name of the new column that will store the values (from the melted columns).col_level
: This parameter is used when working with MultiIndex columns. It specifies which level of the column index should be melted.
We can see the full syntax here.
Example: Let’s apply the melt
function to the previous air quality data example:
# Reset the index to convert 'Date' back to a regular column
pivoted_air_quality.reset_index(inplace=True)
display(pivoted_air_quality)
# Use the melt function to transform the data to long format
melted_air_quality = pd.melt(pivoted_air_quality, id_vars=['Date'], value_vars=['CO', 'NO2'],
var_name='Pollutant', value_name='Value')
# Print the melted air quality data in long format
print("Melted Air Quality Data (Long Format):")
display(melted_air_quality)
Pollutant | Date | CO | NO2 |
---|---|---|---|
0 | 2023-01-01 | 2.5 | 20.1 |
1 | 2023-01-02 | 2.7 | 22.3 |
Melted Air Quality Data (Long Format):
Date | Pollutant | Value | |
---|---|---|---|
0 | 2023-01-01 | CO | 2.5 |
1 | 2023-01-02 | CO | 2.7 |
2 | 2023-01-01 | NO2 | 20.1 |
3 | 2023-01-02 | NO2 | 22.3 |
Example: Utilizing the groupby function in conjunction with pivot.
import numpy as np
import pandas as pd
# ===========================================
# Dataset
# ===========================================
# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)
# Define the number of rows in the DataFrame
num_rows = 100
# Generate random data using NumPy functions directly
data = {
'Product': rng.choice(['A', 'B'], num_rows),
'Region': rng.choice(['North', 'South'], num_rows),
'Sales': rng.integers(100, 201, size=num_rows),
'Price': rng.integers(10, 23, size=num_rows)
}
# Create a Pandas DataFrame from the generated data
df = pd.DataFrame(data)
display(df.head())
# ===========================================
# Calculate Total Sales by Region and Product
# ===========================================
# Grouping by 'Region' and 'Product' and calculating the total sales
grouped = df.groupby(['Region', 'Product'])['Sales'].sum().reset_index()
# Display the grouped DataFrame
print("Total Sales by Region and Product:")
display(grouped)
# ===========================================
# Pivot Data to Create a Summary Table
# ===========================================
# Pivoting the data to create a summary table
pivot_table = grouped.pivot(index='Region', columns='Product', values='Sales')
# Display the pivot table
print("\nSummary Table (Region vs. Product):")
display(pivot_table)
Product | Region | Sales | Price | |
---|---|---|---|---|
0 | A | South | 136 | 14 |
1 | B | North | 191 | 20 |
2 | B | South | 150 | 10 |
3 | A | North | 170 | 21 |
4 | A | South | 146 | 17 |
Total Sales by Region and Product:
Region | Product | Sales | |
---|---|---|---|
0 | North | A | 4242 |
1 | North | B | 3311 |
2 | South | A | 2922 |
3 | South | B | 4318 |
Summary Table (Region vs. Product):
Product | A | B |
---|---|---|
Region | ||
North | 4242 | 3311 |
South | 2922 | 4318 |