6.6. Aggregation and Grouping in Pandas#
In data analysis and manipulation, the concepts of aggregation and grouping play a crucial role, allowing us to derive insights from complex datasets. This section will delve deeper into the fundamental concepts of aggregation and grouping in the context of pandas, a powerful Python library for data manipulation and analysis.
6.6.1. Grouping Data with groupby()
#
Grouping data involves categorizing and splitting a dataset into smaller subsets based on specific criteria. The groupby()
function in pandas is a fundamental tool for accomplishing this task. It allows us to group a DataFrame based on one or more columns, enabling subsequent analysis and computation within these groups [McKinney, 2022, Pandas Developers, 2023].
6.6.1.1. Basic Syntax#
The basic syntax for using the groupby()
function is as follows [Pandas Developers, 2023]:
grouped = df.groupby('column_name')
Full syntax can be found here.
Example This dataset records meteorological observations for the “UNIVERSITY OF CALGARY” weather station. It was obtained from the source https://climate.weather.gc.ca/ and includes the following columns:
Station Name: The name of the weather station, which is “UNIVERSITY OF CALGARY” in this case.
Date/Time: The date and time of the observation in the format MM/DD/YYYY.
Year: The year of the observation, extracted from the Date/Time column.
Month: The month of the observation, extracted from the Date/Time column.
Day: The day of the observation, extracted from the Date/Time column.
Max Temp (°C): The maximum temperature recorded on that date in degrees Celsius.
Min Temp (°C): The minimum temperature recorded on that date in degrees Celsius.
Mean Temp (°C): The mean temperature recorded on that date in degrees Celsius, typically calculated as the average of the maximum and minimum temperatures.
Total Rain (mm): The total amount of rainfall in millimeters on that date.
Total Snow (cm): The total amount of snowfall in centimeters on that date.
Total Precip (mm): The cumulative total of precipitation, which combines rain and snow, in millimeters on that date.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Datasets/UofC_Daily_1990_Q1Q2.csv')
# Display the first few rows of the DataFrame
display(df.tail())
# Grouping by 'Month'
grouped = df.groupby('Month')
print('grouped would be an object as follows:')
print(grouped)
Station Name | Date/Time | Year | Month | Day | Max Temp (°C) | Min Temp (°C) | Mean Temp (°C) | Total Rain (mm) | Total Snow (cm) | Total Precip (mm) | |
---|---|---|---|---|---|---|---|---|---|---|---|
176 | UNIVERSITY OF CALGARY | 6/26/1990 | 1990 | 6 | 26 | 26.0 | 11.5 | 18.8 | 0.0 | 0.0 | 0.0 |
177 | UNIVERSITY OF CALGARY | 6/27/1990 | 1990 | 6 | 27 | 22.0 | 10.0 | 16.0 | 6.8 | 0.0 | 6.8 |
178 | UNIVERSITY OF CALGARY | 6/28/1990 | 1990 | 6 | 28 | 21.0 | 8.0 | 14.5 | 1.4 | 0.0 | 1.4 |
179 | UNIVERSITY OF CALGARY | 6/29/1990 | 1990 | 6 | 29 | 22.0 | 7.0 | 14.5 | 3.3 | 0.0 | 3.3 |
180 | UNIVERSITY OF CALGARY | 6/30/1990 | 1990 | 6 | 30 | 24.5 | 12.0 | 18.3 | 0.5 | 0.0 | 0.5 |
grouped would be an object as follows:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F17FB728D0>
6.6.2. Aggregation: Extracting Insights from Groups#
Once you’ve successfully grouped your data using the groupby()
function, the next step is to harness the power of aggregation. Aggregation involves summarizing and extracting meaningful information from the grouped data by applying various functions to each group. This process unveils hidden patterns, trends, and characteristics within your dataset.
Pandas offers a multitude of aggregation functions that allow you to compute summary statistics and insights for each group. These functions include [McKinney, 2022, Pandas Developers, 2023]:
Mean: Calculates the average value of a specific column within each group.
Sum: Computes the total sum of values in a column for each group.
Count: Determines the number of non-null values within a column for each group.
Max: Identifies the maximum value within a column for each group.
Min: Finds the minimum value within a column for each group.
…and many more.
def Line(n=40):
print(n * "_")
grouped = df.groupby('Month')
# Applying aggregation functions on the groups
# Calculating the mean of 'Value' for each group
average_values = grouped['Max Temp (°C)'].mean().round(2)
print("Average Values:")
print(average_values)
Line()
# Calculating the sum of 'Value' for each group
sum_values = grouped['Total Precip (mm)'].sum().round(2)
print("Sum Values:")
print(sum_values)
Line()
# Counting the number of occurrences in each group
count_values = grouped['Max Temp (°C)'].count().round(2)
print("Count Values:")
print(count_values)
Line()
# Finding the maximum value in each group
max_value = grouped['Max Temp (°C)'].max().round(2)
print("Max Values:")
print(max_value)
Line()
# Finding the minimum value in each group
min_value = grouped['Max Temp (°C)'].min().round(2)
print("Min Values:")
print(min_value)
Line()
Average Values:
Month
1 0.73
2 0.70
3 7.71
4 10.50
5 14.58
6 19.13
Name: Max Temp (°C), dtype: float64
________________________________________
Sum Values:
Month
1 22.5
2 19.5
3 239.0
4 315.0
5 452.0
6 574.0
Name: Max Temp (°C), dtype: float64
________________________________________
Count Values:
Month
1 31
2 28
3 31
4 30
5 31
6 30
Name: Max Temp (°C), dtype: int64
________________________________________
Max Values:
Month
1 10.0
2 16.0
3 18.0
4 20.0
5 25.0
6 26.5
Name: Max Temp (°C), dtype: float64
________________________________________
Min Values:
Month
1 -26.0
2 -24.5
3 -9.0
4 -3.5
5 7.0
6 8.0
Name: Max Temp (°C), dtype: float64
________________________________________
The code then calculates various aggregation functions (mean()
, sum()
, count()
, max()
, min()
) on the grouped data and prints out the results for each function. The Line()
function is used to separate the output of each aggregation function for better readability.
Here’s a comprehensive list of all the aggregation functions available in pandas’ groupby()
operation [Pandas Developers, 2023]:
Aggregation Function |
Description |
---|---|
|
Applies a custom aggregation function to each group. |
|
Applies a function to each group. |
|
Counts the non-null values in each group. |
|
Computes the cumulative maximum of values in each group. |
|
Computes the cumulative minimum of values in each group. |
|
Computes the cumulative product of values in each group. |
|
Computes the cumulative sum of values in each group. |
|
Generates summary statistics for each group. |
|
Returns the first element in each group. |
|
Returns the last element in each group. |
|
Finds the maximum value in each group. |
|
Computes the mean (average) of values in each group. |
|
Calculates the median of values in each group. |
|
Finds the minimum value in each group. |
|
Returns the nth element in each group. |
|
Calculates the product of values in each group. |
|
Computes the qth quantile of values in each group. |
|
Determines the number of elements in each group. |
|
Computes the standard deviation of values in each group. |
|
Calculates the sum of values in each group. |
|
Applies a function to each group and broadcasts the result. |
|
Calculates the variance of values in each group. |
These aggregation functions provide a wide range of options for summarizing and analyzing data within each group formed by the groupby()
operation in pandas.
6.6.3. Multiple aggregations#
You can apply multiple aggregation functions simultaneously using the agg()
method.
Example - Exploring Store Sales Patterns: Imagine you’re analyzing sales data for a small boutique store. The store sells products from two categories, ‘A’ and ‘B’. You want to understand how the sales values for these categories differ.
You start by generating a random dataset using numpy and pandas to simulate the store’s sales records. You set a random seed for reproducibility and create 100 rows of data. Each row has a ‘Category’ column with either ‘A’ or ‘B’ values and a ‘Value’ column with random sales amounts between 1 and 100.
import numpy as np
import pandas as pd
# Set the random seed for reproducibility
rng = np.random.default_rng(42)
# Define the number of rows in the DataFrame
num_rows = 100
# Generate random alphabet column with only 'A' and 'B'
random_alphabet = [rng.choice(['A', 'B']) for _ in range(num_rows)]
# Generate random numeric column
random_numeric = rng.integers(1, 101, size=num_rows)
# Create a Pandas DataFrame
data = {'Category': random_alphabet, 'Value': random_numeric}
df = pd.DataFrame(data)
# Display the first few rows of the DataFrame
display(df.head())
Category | Value | |
---|---|---|
0 | A | 84 |
1 | B | 20 |
2 | B | 81 |
3 | A | 1 |
4 | A | 80 |
# Grouping by 'Category'
grouped = df.groupby('Category')
# Applying multiple aggregation functions
agg_functions = {'Value': ['mean', 'sum', 'count', 'max', 'min']}
result = grouped.agg(agg_functions)
# or simply result = grouped.agg({'Value': ['mean', 'sum', 'count', 'max', 'min']})
display(result)
Value | |||||
---|---|---|---|---|---|
mean | sum | count | max | min | |
Category | |||||
A | 46.458333 | 2230 | 48 | 100 | 1 |
B | 52.519231 | 2731 | 52 | 100 | 4 |
This code snippet applies multiple aggregation functions to a grouped dataset. It calculates statistics like mean, sum, count, maximum, and minimum values for the ‘Value’ column within each group.
6.6.4. Custom aggregation (Optional Content)#
You can also define custom aggregation functions using lambda functions or regular functions.
# Custom aggregation function
custom_agg = lambda x: x.max() - x.min()
result = grouped['Value'].agg(custom_agg)
display(result.to_frame('Value'))
Value | |
---|---|
Category | |
A | 99 |
B | 96 |
This code employs a custom aggregation function to compute the range (difference between maximum and minimum) of the ‘Value’ column within each group.
6.6.5. Grouping by multiple columns#
Suppose we have a dataset containing sales data for different products in different regions. We want to group the data by both “Product” and “Region” to analyze the total sales and average price for each combination.
import numpy as np
import pandas as pd
# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)
# Define the number of rows in the DataFrame
num_rows = 100
# Generate random data using NumPy functions directly
data = {'Product': rng.choice(['A', 'B'], num_rows),
'Region': rng.choice(['North', 'South'], num_rows),
'Sales': rng.integers(100, 201, size=num_rows),
'Price': rng.integers(10, 23, size=num_rows)}
# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)
# Display the resulting DataFrame
display(df.head())
# Grouping by both "Product" and "Region" and aggregating Sales and Price
result = df.groupby(['Product', 'Region']).agg({'Sales': 'sum', 'Price': 'mean'})
# Display the result
display(result)
# This flattens the DataFrame.
result.reset_index(inplace=True)
# Display the result
display(result)
Product | Region | Sales | Price | |
---|---|---|---|---|
0 | A | South | 136 | 14 |
1 | B | North | 191 | 20 |
2 | B | South | 150 | 10 |
3 | A | North | 170 | 21 |
4 | A | South | 146 | 17 |
Sales | Price | ||
---|---|---|---|
Product | Region | ||
A | North | 4242 | 16.214286 |
South | 2922 | 16.150000 | |
B | North | 3311 | 16.000000 |
South | 4318 | 15.103448 |
Product | Region | Sales | Price | |
---|---|---|---|---|
0 | A | North | 4242 | 16.214286 |
1 | A | South | 2922 | 16.150000 |
2 | B | North | 3311 | 16.000000 |
3 | B | South | 4318 | 15.103448 |
In this example, we first create a DataFrame containing sales data for different products in different regions. Then, we group the data using groupby(['Product', 'Region'])
to form multi-level groups based on both “Product” and “Region”. Finally, we calculate the total sales and average price for each combination using the agg()
function with a dictionary specifying the aggregation functions for each column. The reset_index()
function is used to convert the grouped results into a DataFrame for better presentation.
6.6.6. Applying different aggregations to different columns#
You can apply different aggregate functions to different columns using the agg()
method with a dictionary. Consider a dataset containing information about sales of different products. We want to group the data by “Product” and calculate the total sales as well as the maximum and minimum prices for each product.
# Grouping by "Product"
grouped = df.groupby('Product')
# Applying different aggregations to different columns
result = grouped.agg({
'Sales': 'sum', # Calculate total sales for each product
'Price': ['max', 'min'] # Calculate maximum and minimum prices for each product
}).reset_index(drop = False)
display(result)
Product | Sales | Price | ||
---|---|---|---|---|
sum | max | min | ||
0 | A | 7164 | 22 | 10 |
1 | B | 7629 | 22 | 10 |
This code groups a DataFrame df
by the ‘Product’ column, then applies aggregation functions to calculate the total sales for each product and the maximum and minimum prices.
6.6.7. Named Aggregation#
Named Aggregation in pandas allows you to assign custom column names to the results of aggregation functions, making it easier to interpret and work with the summarized data. This feature was introduced in pandas version 0.25.
To use named aggregation, you can define a named tuple with custom column names and specify the aggregation functions for each column in the agg()
method. This allows you to have more descriptive column names in the resulting DataFrame.
import numpy as np
import pandas as pd
# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)
# Define the number of rows in the DataFrame
num_rows = 100
# Generate random data using NumPy functions directly
data = {'Product': rng.choice(['A', 'B'], num_rows),
'Region': rng.choice(['North', 'South'], num_rows),
'Sales': rng.integers(100, 201, size=num_rows),
'Price': rng.integers(10, 23, size=num_rows)}
# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)
# Group, aggregate, and display in one chain
grouped = df.groupby('Product').agg(TotalSales=pd.NamedAgg(column='Sales', aggfunc='sum'),
MinPrice = pd.NamedAgg(column='Price', aggfunc='min'),
AvgPrice = pd.NamedAgg(column='Price', aggfunc='mean'),
MaxPrice = pd.NamedAgg(column='Price', aggfunc='max')
).reset_index(drop = False)
import re
# Adding spaces between words in column names that are in CamelCase.
grouped.columns = [re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', x) for x in grouped.columns]
display(grouped)
Product | Total Sales | Min Price | Avg Price | Max Price | |
---|---|---|---|---|---|
0 | A | 7164 | 10 | 16.1875 | 22 |
1 | B | 7629 | 10 | 15.5000 | 22 |
The provided Python code uses the pandas library to work with a dataset containing product information, including sales and prices. It groups the data by product, and within each group, it calculates total sales, minimum price, average price, and maximum price. The code then organizes the results into a new DataFrame with appropriately named columns. It employs regular expressions to add spaces between words in column names that are in CamelCase.
6.6.8. Grouping and Resampling Time Series Data#
6.6.8.1. resample
Function for Time Series Analysis#
The resample
function within pandas is a versatile and dynamic tool designed explicitly for managing time series data. It extends the capability to modify the frequency of your time-dependent data and simultaneously carry out a multitude of aggregation operations on the grouped data. The essence of the resample
function can be succinctly encapsulated as follows [Pandas Developers, 2023]:
When applied to pandas DataFrames or Series with a datetime index, the resample
function’s primary purpose is to partition time series data into distinct time intervals (such as days, months, years) while offering the flexibility to employ aggregation functions within each interval. The general syntax of utilization stands as:
new_resampled_object = df.resample(rule)
In this context, rule
symbolizes a string that defines the desired resampling frequency, such as 'D'
for daily, 'M'
for monthly, 'A'
for annually, and so forth.
The true power of the resample
function lies in its inherent adaptability, enabling a seamless exploration of trends, patterns, and aggregated insights across various time intervals.
You can find a comprehensive description by following this link.
Example: Consider our initial illustration involving climate data from the University of Calgary station for the 1st half of the year 1990.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Datasets/UofC_Daily_1990_Q1Q2.csv',
parse_dates = ['Date/Time'])
df = df.set_index('Date/Time')
# Grouping by month and calculating the sum of sales and revenue for each group
result = df.resample('MS').agg({'Mean Temp (°C)': 'mean'})
display(result)
Mean Temp (°C) | |
---|---|
Date/Time | |
1990-01-01 | -4.377419 |
1990-02-01 | -6.096429 |
1990-03-01 | 1.225806 |
1990-04-01 | 4.573333 |
1990-05-01 | 8.854839 |
1990-06-01 | 13.296667 |
result = df.resample('MS').agg({'Mean Temp (°C)': 'mean'})
: Here, resample
is applied to the DataFrame df
.
resample('MS')
specifies the resampling frequency. ‘MS’ stands for “Month Start,” indicating that the data will be resampled at the beginning of each month.agg({'Mean Temp (°C)': 'mean'})
defines the aggregation operation to be performed during resampling. In this case, it calculates the mean of the ‘Mean Temp (°C)’ column for each month.
6.6.9. Group Iteration (Optional Content)#
In Pandas, you can use the groupby
method to group data in a DataFrame based on one or more columns. Once you have grouped the data, you might want to perform operations on each group or apply custom functions to each group. Iterating through the grouped data can be done using the groupby
object.
Let’s go through an example to illustrate how to iterate through the groups:
Suppose you have the following DataFrame:
import numpy as np
import pandas as pd
import random
# Set the random seed for reproducibility (optional)
np.random.seed(100)
random.seed(100)
# Define the number of rows in the DataFrame
num_rows = 10
# Create a dictionary with the generated data
data = {'Category': [random.choice(['A', 'B']) for _ in range(num_rows)],
'Value': np.random.randint(1, 101, size=num_rows)}
# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)
# Print the resulting DataFrame
display(df.head())
Category | Value | |
---|---|---|
0 | A | 9 |
1 | B | 25 |
2 | B | 68 |
3 | A | 88 |
4 | B | 80 |
Now, let’s group the data by the ‘Category’ column and iterate through the groups:
grouped = df.groupby('Category')
for name, group in grouped:
print(f"Group: {name}")
display(group)
Group: A
Category | Value | |
---|---|---|
0 | A | 9 |
3 | A | 88 |
7 | A | 95 |
8 | A | 53 |
9 | A | 99 |
Group: B
Category | Value | |
---|---|---|
1 | B | 25 |
2 | B | 68 |
4 | B | 80 |
5 | B | 49 |
6 | B | 11 |
As you can see, the groupby
method has created two groups based on the unique values in the ‘Category’ column (‘A’ and ‘B’). We used a for
loop to iterate through each group, and in each iteration, the variable name
contains the group name (‘A’ or ‘B’), and the variable group
contains the corresponding DataFrame for that group.
You can perform various operations on each group inside the loop or apply custom functions using the apply
method on the grouped object. The groupby
method is a powerful tool for data aggregation and analysis, making it easier to work with subsets of data based on certain criteria.
6.6.10. Filtering Groups (Optional Content)#
In Pandas, after grouping the data using the groupby
method, you may want to filter the groups based on specific conditions. This can be done using the filter
function of the grouped object. The filter
function allows you to apply a function to each group and returns only the groups that satisfy the specified condition [Pandas Developers, 2023].
Let’s go through an example to demonstrate how to filter groups in Pandas:
Suppose you have the following DataFrame:
Now, let’s group the data by the ‘Category’ column and filter out the groups based on a condition, for example, keeping only the groups with a sum of ‘Value’ greater than 30:
grouped = df.groupby('Category')
filtered_groups = grouped.filter(lambda group: group['Value'].sum() > 30)
display(filtered_groups)
Category | Value | |
---|---|---|
0 | A | 9 |
1 | B | 25 |
2 | B | 68 |
3 | A | 88 |
4 | B | 80 |
5 | B | 49 |
6 | B | 11 |
7 | A | 95 |
8 | A | 53 |
9 | A | 99 |
In this example, the lambda function inside the filter
method checks whether the sum of ‘Value’ in each group is greater than 30. Only the groups that satisfy this condition are returned in the filtered_groups
DataFrame.
You can use any custom function inside the filter
method to perform more complex filtering operations based on your specific requirements.
Remember that the result of filtering will be a new DataFrame that contains only the rows from the original DataFrame that belong to the groups meeting the specified condition.
6.6.11. Transformation (Optional Content)#
In Pandas, the groupby
transform operation is used to transform the grouped data, broadcasting the results back to the original DataFrame. The transform function returns a new DataFrame or Series with the same shape as the original, but with the transformed values based on the grouped data.
The transform
method is often used to perform operations on each group and align the results with the original DataFrame, preserving the index and columns. This is especially useful when you want to add aggregated values back to each row of the original DataFrame.
Let’s go through an example to demonstrate how to use groupby
with transform
:
Suppose you have the following DataFrame:
Now, let’s use the groupby
transform to add a new column ‘GroupSum’ containing the sum of ‘Value’ for each ‘Category’:
grouped = df.groupby('Category')
display(df)
df['GroupSum'] = grouped['Value'].transform('sum')
display(df)
Category | Value | |
---|---|---|
0 | A | 9 |
1 | B | 25 |
2 | B | 68 |
3 | A | 88 |
4 | B | 80 |
5 | B | 49 |
6 | B | 11 |
7 | A | 95 |
8 | A | 53 |
9 | A | 99 |
Category | Value | GroupSum | |
---|---|---|---|
0 | A | 9 | 344 |
1 | B | 25 | 233 |
2 | B | 68 | 233 |
3 | A | 88 | 344 |
4 | B | 80 | 233 |
5 | B | 49 | 233 |
6 | B | 11 | 233 |
7 | A | 95 | 344 |
8 | A | 53 | 344 |
9 | A | 99 | 344 |
As you can see, the ‘GroupSum’ column is added to the original DataFrame, and each row now contains the sum of ‘Value’ for its corresponding ‘Category’.
You can use any function compatible with groupby
(e.g., ‘sum’, ‘mean’, ‘min’, ‘max’, custom functions) with the transform
method. This allows you to apply various transformations to the grouped data and broadcast the results back to the original DataFrame.
6.6.12. Pandas pivot and melt#
The pandas.DataFrame.pivot
method is a powerful feature in the pandas library for reshaping data in a DataFrame. It allows you to transform your data from a long format (where data is stored in rows) to a wide format (where data is stored in columns) or vice versa. This can be particularly useful when you need to analyze or visualize your data in a different structure.
Syntax:
DataFrame.pivot(index=None, columns=None, values=None)
index
: This parameter specifies the column whose unique values will become the new index (row labels) of the pivoted DataFrame. It can be a column name or a list of column names.columns
: This parameter specifies the column whose unique values will become the new column headers of the pivoted DataFrame. It can be a column name or a list of column names.values
: This parameter specifies the column(s) containing the data to populate the pivoted DataFrame. It can be a column name or a list of column names. If not provided, all remaining columns not used asindex
orcolumns
will be used.
We can see the full syntax here.
Example: We can use pivot
to transform data from a long format (e.g., for time-series data) to a wide format, making it easier to analyze.
import pandas as pd
# Create a DataFrame with air quality data
# This dataset is a fictional dataset:
air_quality_data = pd.DataFrame({
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'Pollutant': ['CO', 'NO2', 'CO', 'NO2'],
'Value': [2.5, 20.1, 2.7, 22.3]
})
# Print the original air quality data DataFrame
print("Original Air Quality Data:")
display(air_quality_data)
# Pivot the air quality data to a wide format
pivoted_air_quality = air_quality_data.pivot(index='Date', columns='Pollutant', values='Value')
# Print the pivoted air quality data
print("\nPivoted Air Quality Data (Wide Format):")
display(pivoted_air_quality)
Original Air Quality Data:
Date | Pollutant | Value | |
---|---|---|---|
0 | 2023-01-01 | CO | 2.5 |
1 | 2023-01-01 | NO2 | 20.1 |
2 | 2023-01-02 | CO | 2.7 |
3 | 2023-01-02 | NO2 | 22.3 |
Pivoted Air Quality Data (Wide Format):
Pollutant | CO | NO2 |
---|---|---|
Date | ||
2023-01-01 | 2.5 | 20.1 |
2023-01-02 | 2.7 | 22.3 |
The pandas.melt
function is a powerful tool for converting a DataFrame from wide format to long format, making it easier to work with certain types of data. It essentially “unpivots” a DataFrame by melting columns into rows. This can be particularly useful when you want to analyze or visualize data that is initially organized with column headers representing different variables.
Syntax:
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
frame
: This parameter specifies the DataFrame you want to melt.id_vars
: This parameter specifies the column(s) that should remain as identifier variables (columns that won’t be melted). It can be a column name or a list of column names.value_vars
: This parameter specifies the column(s) that should be melted (unpivoted). If not specified, all columns not listed inid_vars
will be melted.var_name
: This parameter specifies the name of the new column that will store the variable names (from the melted columns).value_name
: This parameter specifies the name of the new column that will store the values (from the melted columns).col_level
: This parameter is used when working with MultiIndex columns. It specifies which level of the column index should be melted.
We can see the full syntax here.
Example: Let’s apply the melt
function to the previous air quality data example:
# Reset the index to convert 'Date' back to a regular column
pivoted_air_quality.reset_index(inplace=True)
display(pivoted_air_quality)
# Use the melt function to transform the data to long format
melted_air_quality = pd.melt(pivoted_air_quality, id_vars=['Date'], value_vars=['CO', 'NO2'],
var_name='Pollutant', value_name='Value')
# Print the melted air quality data in long format
print("Melted Air Quality Data (Long Format):")
display(melted_air_quality)
Pollutant | Date | CO | NO2 |
---|---|---|---|
0 | 2023-01-01 | 2.5 | 20.1 |
1 | 2023-01-02 | 2.7 | 22.3 |
Melted Air Quality Data (Long Format):
Date | Pollutant | Value | |
---|---|---|---|
0 | 2023-01-01 | CO | 2.5 |
1 | 2023-01-02 | CO | 2.7 |
2 | 2023-01-01 | NO2 | 20.1 |
3 | 2023-01-02 | NO2 | 22.3 |
Example: Utilizing the groupby function in conjunction with pivot.
import numpy as np
import pandas as pd
# ===========================================
# Dataset
# ===========================================
# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)
# Define the number of rows in the DataFrame
num_rows = 100
# Generate random data using NumPy functions directly
data = {
'Product': rng.choice(['A', 'B'], num_rows),
'Region': rng.choice(['North', 'South'], num_rows),
'Sales': rng.integers(100, 201, size=num_rows),
'Price': rng.integers(10, 23, size=num_rows)
}
# Create a Pandas DataFrame from the generated data
df = pd.DataFrame(data)
display(df.head())
# ===========================================
# Calculate Total Sales by Region and Product
# ===========================================
# Grouping by 'Region' and 'Product' and calculating the total sales
grouped = df.groupby(['Region', 'Product'])['Sales'].sum().reset_index()
# Display the grouped DataFrame
print("Total Sales by Region and Product:")
display(grouped)
# ===========================================
# Pivot Data to Create a Summary Table
# ===========================================
# Pivoting the data to create a summary table
pivot_table = grouped.pivot(index='Region', columns='Product', values='Sales')
# Display the pivot table
print("\nSummary Table (Region vs. Product):")
display(pivot_table)
Product | Region | Sales | Price | |
---|---|---|---|---|
0 | A | South | 136 | 14 |
1 | B | North | 191 | 20 |
2 | B | South | 150 | 10 |
3 | A | North | 170 | 21 |
4 | A | South | 146 | 17 |
Total Sales by Region and Product:
Region | Product | Sales | |
---|---|---|---|
0 | North | A | 4242 |
1 | North | B | 3311 |
2 | South | A | 2922 |
3 | South | B | 4318 |
Summary Table (Region vs. Product):
Product | A | B |
---|---|---|
Region | ||
North | 4242 | 3311 |
South | 2922 | 4318 |