Pandas pivot and melt

5.12. Pandas pivot and melt#

The pandas.DataFrame.pivot method is a powerful feature in the pandas library for reshaping data in a DataFrame. It allows you to transform your data from a long format (where data is stored in rows) to a wide format (where data is stored in columns) or vice versa. This can be particularly useful when you need to analyze or visualize your data in a different structure.

Syntax:

DataFrame.pivot(index=None, columns=None, values=None)

index: This parameter specifies the column whose unique values will become the new index (row labels) of the pivoted DataFrame. It can be a column name or a list of column names.
columns: This parameter specifies the column whose unique values will become the new column headers of the pivoted DataFrame. It can be a column name or a list of column names.
values: This parameter specifies the column(s) containing the data to populate the pivoted DataFrame. It can be a column name or a list of column names. If not provided, all remaining columns not used as index or columns will be used.

We can see the full syntax here.

Example: We can use pivot to transform data from a long format (e.g., for time-series data) to a wide format, making it easier to analyze.

import pandas as pd

# Create a DataFrame with air quality data
# This dataset is a fictional dataset:
air_quality_data = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Pollutant': ['CO', 'NO2', 'CO', 'NO2'],
    'Value': [2.5, 20.1, 2.7, 22.3]
})

# Print the original air quality data DataFrame
print("Original Air Quality Data:")
display(air_quality_data)

# Pivot the air quality data to a wide format
pivoted_air_quality = air_quality_data.pivot(index='Date', columns='Pollutant', values='Value')

# Print the pivoted air quality data
print("\nPivoted Air Quality Data (Wide Format):")
display(pivoted_air_quality)

Original Air Quality Data:

	Date	Pollutant	Value
0	2023-01-01	CO	2.5
1	2023-01-01	NO2	20.1
2	2023-01-02	CO	2.7
3	2023-01-02	NO2	22.3

Pivoted Air Quality Data (Wide Format):

Pollutant	CO	NO2
Date
2023-01-01	2.5	20.1
2023-01-02	2.7	22.3

The pandas.melt function is a powerful tool for converting a DataFrame from wide format to long format, making it easier to work with certain types of data. It essentially “unpivots” a DataFrame by melting columns into rows. This can be particularly useful when you want to analyze or visualize data that is initially organized with column headers representing different variables.

Syntax:

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)

frame: This parameter specifies the DataFrame you want to melt.
id_vars: This parameter specifies the column(s) that should remain as identifier variables (columns that won’t be melted). It can be a column name or a list of column names.
value_vars: This parameter specifies the column(s) that should be melted (unpivoted). If not specified, all columns not listed in id_vars will be melted.
var_name: This parameter specifies the name of the new column that will store the variable names (from the melted columns).
value_name: This parameter specifies the name of the new column that will store the values (from the melted columns).
col_level: This parameter is used when working with MultiIndex columns. It specifies which level of the column index should be melted.

We can see the full syntax here.

Example: Let’s apply the melt function to the previous air quality data example:

# Reset the index to convert 'Date' back to a regular column
pivoted_air_quality.reset_index(inplace=True)
display(pivoted_air_quality)

# Use the melt function to transform the data to long format
melted_air_quality = pd.melt(pivoted_air_quality, id_vars=['Date'], value_vars=['CO', 'NO2'], 
                              var_name='Pollutant', value_name='Value')

# Print the melted air quality data in long format
print("Melted Air Quality Data (Long Format):")
display(melted_air_quality)

Pollutant	Date	CO	NO2
0	2023-01-01	2.5	20.1
1	2023-01-02	2.7	22.3

Melted Air Quality Data (Long Format):

	Date	Pollutant	Value
0	2023-01-01	CO	2.5
1	2023-01-02	CO	2.7
2	2023-01-01	NO2	20.1
3	2023-01-02	NO2	22.3

Example: Utilizing the groupby function in conjunction with pivot.

import numpy as np
import pandas as pd

# ===========================================
# Dataset
# ===========================================

# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)

# Define the number of rows in the DataFrame
num_rows = 100

# Generate random data using NumPy functions directly
data = {
    'Product': rng.choice(['A', 'B'], num_rows),
    'Region': rng.choice(['North', 'South'], num_rows),
    'Sales': rng.integers(100, 201, size=num_rows),
    'Price': rng.integers(10, 23, size=num_rows)
}

# Create a Pandas DataFrame from the generated data
df = pd.DataFrame(data)
display(df.head())

# ===========================================
# Calculate Total Sales by Region and Product
# ===========================================

# Grouping by 'Region' and 'Product' and calculating the total sales
grouped = df.groupby(['Region', 'Product'])['Sales'].sum().reset_index()

# Display the grouped DataFrame
print("Total Sales by Region and Product:")
display(grouped)

# ===========================================
# Pivot Data to Create a Summary Table
# ===========================================

# Pivoting the data to create a summary table
pivot_table = grouped.pivot(index='Region', columns='Product', values='Sales')

# Display the pivot table
print("\nSummary Table (Region vs. Product):")
display(pivot_table)

	Product	Region	Sales	Price
0	A	South	136	14
1	B	North	191	20
2	B	South	150	10
3	A	North	170	21
4	A	South	146	17

Total Sales by Region and Product:

	Region	Product	Sales
0	North	A	4242
1	North	B	3311
2	South	A	2922
3	South	B	4318

Summary Table (Region vs. Product):

Product	A	B
Region
North	4242	3311
South	2922	4318