Pandas pivot and melt

5.12. Pandas pivot and melt#

The pandas.DataFrame.pivot method is a powerful feature in the pandas library for reshaping data in a DataFrame. It allows you to transform your data from a long format (where data is stored in rows) to a wide format (where data is stored in columns) or vice versa. This can be particularly useful when you need to analyze or visualize your data in a different structure.

Syntax:

DataFrame.pivot(index=None, columns=None, values=None)
  • index: This parameter specifies the column whose unique values will become the new index (row labels) of the pivoted DataFrame. It can be a column name or a list of column names.

  • columns: This parameter specifies the column whose unique values will become the new column headers of the pivoted DataFrame. It can be a column name or a list of column names.

  • values: This parameter specifies the column(s) containing the data to populate the pivoted DataFrame. It can be a column name or a list of column names. If not provided, all remaining columns not used as index or columns will be used.

We can see the full syntax here.

Example: We can use pivot to transform data from a long format (e.g., for time-series data) to a wide format, making it easier to analyze.

import pandas as pd

# Create a DataFrame with air quality data
# This dataset is a fictional dataset:
air_quality_data = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Pollutant': ['CO', 'NO2', 'CO', 'NO2'],
    'Value': [2.5, 20.1, 2.7, 22.3]
})

# Print the original air quality data DataFrame
print("Original Air Quality Data:")
display(air_quality_data)

# Pivot the air quality data to a wide format
pivoted_air_quality = air_quality_data.pivot(index='Date', columns='Pollutant', values='Value')

# Print the pivoted air quality data
print("\nPivoted Air Quality Data (Wide Format):")
display(pivoted_air_quality)
Original Air Quality Data:
Date Pollutant Value
0 2023-01-01 CO 2.5
1 2023-01-01 NO2 20.1
2 2023-01-02 CO 2.7
3 2023-01-02 NO2 22.3
Pivoted Air Quality Data (Wide Format):
Pollutant CO NO2
Date
2023-01-01 2.5 20.1
2023-01-02 2.7 22.3

The pandas.melt function is a powerful tool for converting a DataFrame from wide format to long format, making it easier to work with certain types of data. It essentially “unpivots” a DataFrame by melting columns into rows. This can be particularly useful when you want to analyze or visualize data that is initially organized with column headers representing different variables.

Syntax:

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
  • frame: This parameter specifies the DataFrame you want to melt.

  • id_vars: This parameter specifies the column(s) that should remain as identifier variables (columns that won’t be melted). It can be a column name or a list of column names.

  • value_vars: This parameter specifies the column(s) that should be melted (unpivoted). If not specified, all columns not listed in id_vars will be melted.

  • var_name: This parameter specifies the name of the new column that will store the variable names (from the melted columns).

  • value_name: This parameter specifies the name of the new column that will store the values (from the melted columns).

  • col_level: This parameter is used when working with MultiIndex columns. It specifies which level of the column index should be melted.

We can see the full syntax here.

Example: Let’s apply the melt function to the previous air quality data example:

# Reset the index to convert 'Date' back to a regular column
pivoted_air_quality.reset_index(inplace=True)
display(pivoted_air_quality)

# Use the melt function to transform the data to long format
melted_air_quality = pd.melt(pivoted_air_quality, id_vars=['Date'], value_vars=['CO', 'NO2'], 
                              var_name='Pollutant', value_name='Value')

# Print the melted air quality data in long format
print("Melted Air Quality Data (Long Format):")
display(melted_air_quality)
Pollutant Date CO NO2
0 2023-01-01 2.5 20.1
1 2023-01-02 2.7 22.3
Melted Air Quality Data (Long Format):
Date Pollutant Value
0 2023-01-01 CO 2.5
1 2023-01-02 CO 2.7
2 2023-01-01 NO2 20.1
3 2023-01-02 NO2 22.3

Example: Utilizing the groupby function in conjunction with pivot.

import numpy as np
import pandas as pd

# ===========================================
# Dataset
# ===========================================

# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)

# Define the number of rows in the DataFrame
num_rows = 100

# Generate random data using NumPy functions directly
data = {
    'Product': rng.choice(['A', 'B'], num_rows),
    'Region': rng.choice(['North', 'South'], num_rows),
    'Sales': rng.integers(100, 201, size=num_rows),
    'Price': rng.integers(10, 23, size=num_rows)
}

# Create a Pandas DataFrame from the generated data
df = pd.DataFrame(data)
display(df.head())

# ===========================================
# Calculate Total Sales by Region and Product
# ===========================================

# Grouping by 'Region' and 'Product' and calculating the total sales
grouped = df.groupby(['Region', 'Product'])['Sales'].sum().reset_index()

# Display the grouped DataFrame
print("Total Sales by Region and Product:")
display(grouped)

# ===========================================
# Pivot Data to Create a Summary Table
# ===========================================

# Pivoting the data to create a summary table
pivot_table = grouped.pivot(index='Region', columns='Product', values='Sales')

# Display the pivot table
print("\nSummary Table (Region vs. Product):")
display(pivot_table)
Product Region Sales Price
0 A South 136 14
1 B North 191 20
2 B South 150 10
3 A North 170 21
4 A South 146 17
Total Sales by Region and Product:
Region Product Sales
0 North A 4242
1 North B 3311
2 South A 2922
3 South B 4318
Summary Table (Region vs. Product):
Product A B
Region
North 4242 3311
South 2922 4318