6.6. Aggregation and Grouping in Pandas#

In data analysis and manipulation, the concepts of aggregation and grouping play a crucial role, allowing us to derive insights from complex datasets. This section will delve deeper into the fundamental concepts of aggregation and grouping in the context of pandas, a powerful Python library for data manipulation and analysis.

6.6.1. Grouping Data with groupby()#

Grouping data involves categorizing and splitting a dataset into smaller subsets based on specific criteria. The groupby() function in pandas is a fundamental tool for accomplishing this task. It allows us to group a DataFrame based on one or more columns, enabling subsequent analysis and computation within these groups [McKinney, 2022, Pandas Developers, 2023].

6.6.1.1. Basic Syntax#

The basic syntax for using the groupby() function is as follows [Pandas Developers, 2023]:

grouped = df.groupby('column_name')

Full syntax can be found here.

Example This dataset records meteorological observations for the “UNIVERSITY OF CALGARY” weather station. It was obtained from the source https://climate.weather.gc.ca/ and includes the following columns:

  1. Station Name: The name of the weather station, which is “UNIVERSITY OF CALGARY” in this case.

  2. Date/Time: The date and time of the observation in the format MM/DD/YYYY.

  3. Year: The year of the observation, extracted from the Date/Time column.

  4. Month: The month of the observation, extracted from the Date/Time column.

  5. Day: The day of the observation, extracted from the Date/Time column.

  6. Max Temp (°C): The maximum temperature recorded on that date in degrees Celsius.

  7. Min Temp (°C): The minimum temperature recorded on that date in degrees Celsius.

  8. Mean Temp (°C): The mean temperature recorded on that date in degrees Celsius, typically calculated as the average of the maximum and minimum temperatures.

  9. Total Rain (mm): The total amount of rainfall in millimeters on that date.

  10. Total Snow (cm): The total amount of snowfall in centimeters on that date.

  11. Total Precip (mm): The cumulative total of precipitation, which combines rain and snow, in millimeters on that date.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Datasets/UofC_Daily_1990_Q1Q2.csv')

# Display the first few rows of the DataFrame
display(df.tail())

# Grouping by 'Month'
grouped = df.groupby('Month')
print('grouped would be an object as follows:')
print(grouped)
Station Name Date/Time Year Month Day Max Temp (°C) Min Temp (°C) Mean Temp (°C) Total Rain (mm) Total Snow (cm) Total Precip (mm)
176 UNIVERSITY OF CALGARY 6/26/1990 1990 6 26 26.0 11.5 18.8 0.0 0.0 0.0
177 UNIVERSITY OF CALGARY 6/27/1990 1990 6 27 22.0 10.0 16.0 6.8 0.0 6.8
178 UNIVERSITY OF CALGARY 6/28/1990 1990 6 28 21.0 8.0 14.5 1.4 0.0 1.4
179 UNIVERSITY OF CALGARY 6/29/1990 1990 6 29 22.0 7.0 14.5 3.3 0.0 3.3
180 UNIVERSITY OF CALGARY 6/30/1990 1990 6 30 24.5 12.0 18.3 0.5 0.0 0.5
grouped would be an object as follows:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F17FB728D0>

6.6.2. Aggregation: Extracting Insights from Groups#

Once you’ve successfully grouped your data using the groupby() function, the next step is to harness the power of aggregation. Aggregation involves summarizing and extracting meaningful information from the grouped data by applying various functions to each group. This process unveils hidden patterns, trends, and characteristics within your dataset.

Pandas offers a multitude of aggregation functions that allow you to compute summary statistics and insights for each group. These functions include [McKinney, 2022, Pandas Developers, 2023]:

  • Mean: Calculates the average value of a specific column within each group.

  • Sum: Computes the total sum of values in a column for each group.

  • Count: Determines the number of non-null values within a column for each group.

  • Max: Identifies the maximum value within a column for each group.

  • Min: Finds the minimum value within a column for each group.

  • …and many more.

def Line(n=40):
    print(n * "_")

grouped = df.groupby('Month')
# Applying aggregation functions on the groups

# Calculating the mean of 'Value' for each group
average_values = grouped['Max Temp (°C)'].mean().round(2)
print("Average Values:")
print(average_values)
Line()

# Calculating the sum of 'Value' for each group
sum_values = grouped['Total Precip (mm)'].sum().round(2)
print("Sum Values:")
print(sum_values)
Line()

# Counting the number of occurrences in each group
count_values = grouped['Max Temp (°C)'].count().round(2)
print("Count Values:")
print(count_values)
Line()

# Finding the maximum value in each group
max_value = grouped['Max Temp (°C)'].max().round(2)
print("Max Values:")
print(max_value)
Line()

# Finding the minimum value in each group
min_value = grouped['Max Temp (°C)'].min().round(2)
print("Min Values:")
print(min_value)
Line()
Average Values:
Month
1     0.73
2     0.70
3     7.71
4    10.50
5    14.58
6    19.13
Name: Max Temp (°C), dtype: float64
________________________________________
Sum Values:
Month
1     22.5
2     19.5
3    239.0
4    315.0
5    452.0
6    574.0
Name: Max Temp (°C), dtype: float64
________________________________________
Count Values:
Month
1    31
2    28
3    31
4    30
5    31
6    30
Name: Max Temp (°C), dtype: int64
________________________________________
Max Values:
Month
1    10.0
2    16.0
3    18.0
4    20.0
5    25.0
6    26.5
Name: Max Temp (°C), dtype: float64
________________________________________
Min Values:
Month
1   -26.0
2   -24.5
3    -9.0
4    -3.5
5     7.0
6     8.0
Name: Max Temp (°C), dtype: float64
________________________________________

The code then calculates various aggregation functions (mean(), sum(), count(), max(), min()) on the grouped data and prints out the results for each function. The Line() function is used to separate the output of each aggregation function for better readability.

Here’s a comprehensive list of all the aggregation functions available in pandas’ groupby() operation [Pandas Developers, 2023]:

Aggregation Function

Description

agg(func)

Applies a custom aggregation function to each group.

apply(func)

Applies a function to each group.

count()

Counts the non-null values in each group.

cummax()

Computes the cumulative maximum of values in each group.

cummin()

Computes the cumulative minimum of values in each group.

cumprod()

Computes the cumulative product of values in each group.

cumsum()

Computes the cumulative sum of values in each group.

describe()

Generates summary statistics for each group.

first()

Returns the first element in each group.

last()

Returns the last element in each group.

max()

Finds the maximum value in each group.

mean()

Computes the mean (average) of values in each group.

median()

Calculates the median of values in each group.

min()

Finds the minimum value in each group.

nth(n)

Returns the nth element in each group.

prod()

Calculates the product of values in each group.

quantile(q)

Computes the qth quantile of values in each group.

size()

Determines the number of elements in each group.

std()

Computes the standard deviation of values in each group.

sum()

Calculates the sum of values in each group.

transform(func)

Applies a function to each group and broadcasts the result.

var()

Calculates the variance of values in each group.

These aggregation functions provide a wide range of options for summarizing and analyzing data within each group formed by the groupby() operation in pandas.

6.6.3. Multiple aggregations#

You can apply multiple aggregation functions simultaneously using the agg() method.

Example - Exploring Store Sales Patterns: Imagine you’re analyzing sales data for a small boutique store. The store sells products from two categories, ‘A’ and ‘B’. You want to understand how the sales values for these categories differ.

You start by generating a random dataset using numpy and pandas to simulate the store’s sales records. You set a random seed for reproducibility and create 100 rows of data. Each row has a ‘Category’ column with either ‘A’ or ‘B’ values and a ‘Value’ column with random sales amounts between 1 and 100.

import numpy as np
import pandas as pd

# Set the random seed for reproducibility
rng = np.random.default_rng(42)

# Define the number of rows in the DataFrame
num_rows = 100

# Generate random alphabet column with only 'A' and 'B'
random_alphabet = [rng.choice(['A', 'B']) for _ in range(num_rows)]

# Generate random numeric column
random_numeric = rng.integers(1, 101, size=num_rows)

# Create a Pandas DataFrame
data = {'Category': random_alphabet, 'Value': random_numeric}
df = pd.DataFrame(data)

# Display the first few rows of the DataFrame
display(df.head())
Category Value
0 A 84
1 B 20
2 B 81
3 A 1
4 A 80
# Grouping by 'Category'
grouped = df.groupby('Category')

#  Applying multiple aggregation functions
agg_functions = {'Value': ['mean', 'sum', 'count', 'max', 'min']}
result = grouped.agg(agg_functions)
# or simply result = grouped.agg({'Value': ['mean', 'sum', 'count', 'max', 'min']})
display(result)
Value
mean sum count max min
Category
A 46.458333 2230 48 100 1
B 52.519231 2731 52 100 4

This code snippet applies multiple aggregation functions to a grouped dataset. It calculates statistics like mean, sum, count, maximum, and minimum values for the ‘Value’ column within each group.

6.6.4. Custom aggregation (Optional Content)#

You can also define custom aggregation functions using lambda functions or regular functions.

# Custom aggregation function
custom_agg = lambda x: x.max() - x.min()
result = grouped['Value'].agg(custom_agg)
display(result.to_frame('Value'))
Value
Category
A 99
B 96

This code employs a custom aggregation function to compute the range (difference between maximum and minimum) of the ‘Value’ column within each group.

6.6.5. Grouping by multiple columns#

Suppose we have a dataset containing sales data for different products in different regions. We want to group the data by both “Product” and “Region” to analyze the total sales and average price for each combination.

import numpy as np
import pandas as pd

# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)

# Define the number of rows in the DataFrame
num_rows = 100

# Generate random data using NumPy functions directly
data = {'Product': rng.choice(['A', 'B'], num_rows),
        'Region': rng.choice(['North', 'South'], num_rows),
        'Sales': rng.integers(100, 201, size=num_rows),
        'Price': rng.integers(10, 23, size=num_rows)}

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the resulting DataFrame
display(df.head())

# Grouping by both "Product" and "Region" and aggregating Sales and Price
result = df.groupby(['Product', 'Region']).agg({'Sales': 'sum', 'Price': 'mean'})

# Display the result
display(result)

# This flattens the DataFrame.
result.reset_index(inplace=True)
# Display the result
display(result)
Product Region Sales Price
0 A South 136 14
1 B North 191 20
2 B South 150 10
3 A North 170 21
4 A South 146 17
Sales Price
Product Region
A North 4242 16.214286
South 2922 16.150000
B North 3311 16.000000
South 4318 15.103448
Product Region Sales Price
0 A North 4242 16.214286
1 A South 2922 16.150000
2 B North 3311 16.000000
3 B South 4318 15.103448

In this example, we first create a DataFrame containing sales data for different products in different regions. Then, we group the data using groupby(['Product', 'Region']) to form multi-level groups based on both “Product” and “Region”. Finally, we calculate the total sales and average price for each combination using the agg() function with a dictionary specifying the aggregation functions for each column. The reset_index() function is used to convert the grouped results into a DataFrame for better presentation.

6.6.6. Applying different aggregations to different columns#

You can apply different aggregate functions to different columns using the agg() method with a dictionary. Consider a dataset containing information about sales of different products. We want to group the data by “Product” and calculate the total sales as well as the maximum and minimum prices for each product.

# Grouping by "Product"
grouped = df.groupby('Product')

# Applying different aggregations to different columns
result = grouped.agg({
    'Sales': 'sum',      # Calculate total sales for each product
    'Price': ['max', 'min']  # Calculate maximum and minimum prices for each product
    }).reset_index(drop = False)

display(result)
Product Sales Price
sum max min
0 A 7164 22 10
1 B 7629 22 10

This code groups a DataFrame df by the ‘Product’ column, then applies aggregation functions to calculate the total sales for each product and the maximum and minimum prices.

6.6.7. Named Aggregation#

Named Aggregation in pandas allows you to assign custom column names to the results of aggregation functions, making it easier to interpret and work with the summarized data. This feature was introduced in pandas version 0.25.

To use named aggregation, you can define a named tuple with custom column names and specify the aggregation functions for each column in the agg() method. This allows you to have more descriptive column names in the resulting DataFrame.

import numpy as np
import pandas as pd

# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)

# Define the number of rows in the DataFrame
num_rows = 100

# Generate random data using NumPy functions directly
data = {'Product': rng.choice(['A', 'B'], num_rows),
        'Region': rng.choice(['North', 'South'], num_rows),
        'Sales': rng.integers(100, 201, size=num_rows),
        'Price': rng.integers(10, 23, size=num_rows)}

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# Group, aggregate, and display in one chain
grouped = df.groupby('Product').agg(TotalSales=pd.NamedAgg(column='Sales', aggfunc='sum'),
                                    MinPrice = pd.NamedAgg(column='Price', aggfunc='min'),
                                    AvgPrice = pd.NamedAgg(column='Price', aggfunc='mean'),
                                    MaxPrice = pd.NamedAgg(column='Price', aggfunc='max')
                                    ).reset_index(drop = False)
import re
# Adding spaces between words in column names that are in CamelCase.
grouped.columns = [re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', x) for x in grouped.columns]
display(grouped)
Product Total Sales Min Price Avg Price Max Price
0 A 7164 10 16.1875 22
1 B 7629 10 15.5000 22

The provided Python code uses the pandas library to work with a dataset containing product information, including sales and prices. It groups the data by product, and within each group, it calculates total sales, minimum price, average price, and maximum price. The code then organizes the results into a new DataFrame with appropriately named columns. It employs regular expressions to add spaces between words in column names that are in CamelCase.

6.6.8. Grouping and Resampling Time Series Data#

6.6.8.1. resample Function for Time Series Analysis#

The resample function within pandas is a versatile and dynamic tool designed explicitly for managing time series data. It extends the capability to modify the frequency of your time-dependent data and simultaneously carry out a multitude of aggregation operations on the grouped data. The essence of the resample function can be succinctly encapsulated as follows [Pandas Developers, 2023]:

When applied to pandas DataFrames or Series with a datetime index, the resample function’s primary purpose is to partition time series data into distinct time intervals (such as days, months, years) while offering the flexibility to employ aggregation functions within each interval. The general syntax of utilization stands as:

new_resampled_object = df.resample(rule)

In this context, rule symbolizes a string that defines the desired resampling frequency, such as 'D' for daily, 'M' for monthly, 'A' for annually, and so forth.

The true power of the resample function lies in its inherent adaptability, enabling a seamless exploration of trends, patterns, and aggregated insights across various time intervals.

You can find a comprehensive description by following this link.

Example: Consider our initial illustration involving climate data from the University of Calgary station for the 1st half of the year 1990.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/HatefDastour/ENGG_680/main/Datasets/UofC_Daily_1990_Q1Q2.csv',
                parse_dates = ['Date/Time'])
df = df.set_index('Date/Time')

# Grouping by month and calculating the sum of sales and revenue for each group
result = df.resample('MS').agg({'Mean Temp (°C)': 'mean'})

display(result)
Mean Temp (°C)
Date/Time
1990-01-01 -4.377419
1990-02-01 -6.096429
1990-03-01 1.225806
1990-04-01 4.573333
1990-05-01 8.854839
1990-06-01 13.296667

result = df.resample('MS').agg({'Mean Temp (°C)': 'mean'}): Here, resample is applied to the DataFrame df.

  • resample('MS') specifies the resampling frequency. ‘MS’ stands for “Month Start,” indicating that the data will be resampled at the beginning of each month.

  • agg({'Mean Temp (°C)': 'mean'}) defines the aggregation operation to be performed during resampling. In this case, it calculates the mean of the ‘Mean Temp (°C)’ column for each month.

6.6.9. Group Iteration (Optional Content)#

In Pandas, you can use the groupby method to group data in a DataFrame based on one or more columns. Once you have grouped the data, you might want to perform operations on each group or apply custom functions to each group. Iterating through the grouped data can be done using the groupby object.

Let’s go through an example to illustrate how to iterate through the groups:

Suppose you have the following DataFrame:

import numpy as np
import pandas as pd
import random

# Set the random seed for reproducibility (optional)
np.random.seed(100)
random.seed(100)

# Define the number of rows in the DataFrame
num_rows = 10


# Create a dictionary with the generated data
data = {'Category': [random.choice(['A', 'B']) for _ in range(num_rows)],
        'Value': np.random.randint(1, 101, size=num_rows)}

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# Print the resulting DataFrame
display(df.head())
Category Value
0 A 9
1 B 25
2 B 68
3 A 88
4 B 80

Now, let’s group the data by the ‘Category’ column and iterate through the groups:

grouped = df.groupby('Category')

for name, group in grouped:
    print(f"Group: {name}")
    display(group)
Group: A
Category Value
0 A 9
3 A 88
7 A 95
8 A 53
9 A 99
Group: B
Category Value
1 B 25
2 B 68
4 B 80
5 B 49
6 B 11

As you can see, the groupby method has created two groups based on the unique values in the ‘Category’ column (‘A’ and ‘B’). We used a for loop to iterate through each group, and in each iteration, the variable name contains the group name (‘A’ or ‘B’), and the variable group contains the corresponding DataFrame for that group.

You can perform various operations on each group inside the loop or apply custom functions using the apply method on the grouped object. The groupby method is a powerful tool for data aggregation and analysis, making it easier to work with subsets of data based on certain criteria.

6.6.10. Filtering Groups (Optional Content)#

In Pandas, after grouping the data using the groupby method, you may want to filter the groups based on specific conditions. This can be done using the filter function of the grouped object. The filter function allows you to apply a function to each group and returns only the groups that satisfy the specified condition [Pandas Developers, 2023].

Let’s go through an example to demonstrate how to filter groups in Pandas:

Suppose you have the following DataFrame:

Now, let’s group the data by the ‘Category’ column and filter out the groups based on a condition, for example, keeping only the groups with a sum of ‘Value’ greater than 30:

grouped = df.groupby('Category')

filtered_groups = grouped.filter(lambda group: group['Value'].sum() > 30)
display(filtered_groups)
Category Value
0 A 9
1 B 25
2 B 68
3 A 88
4 B 80
5 B 49
6 B 11
7 A 95
8 A 53
9 A 99

In this example, the lambda function inside the filter method checks whether the sum of ‘Value’ in each group is greater than 30. Only the groups that satisfy this condition are returned in the filtered_groups DataFrame.

You can use any custom function inside the filter method to perform more complex filtering operations based on your specific requirements.

Remember that the result of filtering will be a new DataFrame that contains only the rows from the original DataFrame that belong to the groups meeting the specified condition.

6.6.11. Transformation (Optional Content)#

In Pandas, the groupby transform operation is used to transform the grouped data, broadcasting the results back to the original DataFrame. The transform function returns a new DataFrame or Series with the same shape as the original, but with the transformed values based on the grouped data.

The transform method is often used to perform operations on each group and align the results with the original DataFrame, preserving the index and columns. This is especially useful when you want to add aggregated values back to each row of the original DataFrame.

Let’s go through an example to demonstrate how to use groupby with transform:

Suppose you have the following DataFrame:

Now, let’s use the groupby transform to add a new column ‘GroupSum’ containing the sum of ‘Value’ for each ‘Category’:

grouped = df.groupby('Category')
display(df)
df['GroupSum'] = grouped['Value'].transform('sum')
display(df)
Category Value
0 A 9
1 B 25
2 B 68
3 A 88
4 B 80
5 B 49
6 B 11
7 A 95
8 A 53
9 A 99
Category Value GroupSum
0 A 9 344
1 B 25 233
2 B 68 233
3 A 88 344
4 B 80 233
5 B 49 233
6 B 11 233
7 A 95 344
8 A 53 344
9 A 99 344

As you can see, the ‘GroupSum’ column is added to the original DataFrame, and each row now contains the sum of ‘Value’ for its corresponding ‘Category’.

You can use any function compatible with groupby (e.g., ‘sum’, ‘mean’, ‘min’, ‘max’, custom functions) with the transform method. This allows you to apply various transformations to the grouped data and broadcast the results back to the original DataFrame.

6.6.12. Pandas pivot and melt#

The pandas.DataFrame.pivot method is a powerful feature in the pandas library for reshaping data in a DataFrame. It allows you to transform your data from a long format (where data is stored in rows) to a wide format (where data is stored in columns) or vice versa. This can be particularly useful when you need to analyze or visualize your data in a different structure.

Syntax:

DataFrame.pivot(index=None, columns=None, values=None)
  • index: This parameter specifies the column whose unique values will become the new index (row labels) of the pivoted DataFrame. It can be a column name or a list of column names.

  • columns: This parameter specifies the column whose unique values will become the new column headers of the pivoted DataFrame. It can be a column name or a list of column names.

  • values: This parameter specifies the column(s) containing the data to populate the pivoted DataFrame. It can be a column name or a list of column names. If not provided, all remaining columns not used as index or columns will be used.

We can see the full syntax here.

Example: We can use pivot to transform data from a long format (e.g., for time-series data) to a wide format, making it easier to analyze.

import pandas as pd

# Create a DataFrame with air quality data
# This dataset is a fictional dataset:
air_quality_data = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Pollutant': ['CO', 'NO2', 'CO', 'NO2'],
    'Value': [2.5, 20.1, 2.7, 22.3]
})

# Print the original air quality data DataFrame
print("Original Air Quality Data:")
display(air_quality_data)

# Pivot the air quality data to a wide format
pivoted_air_quality = air_quality_data.pivot(index='Date', columns='Pollutant', values='Value')

# Print the pivoted air quality data
print("\nPivoted Air Quality Data (Wide Format):")
display(pivoted_air_quality)
Original Air Quality Data:
Date Pollutant Value
0 2023-01-01 CO 2.5
1 2023-01-01 NO2 20.1
2 2023-01-02 CO 2.7
3 2023-01-02 NO2 22.3
Pivoted Air Quality Data (Wide Format):
Pollutant CO NO2
Date
2023-01-01 2.5 20.1
2023-01-02 2.7 22.3

The pandas.melt function is a powerful tool for converting a DataFrame from wide format to long format, making it easier to work with certain types of data. It essentially “unpivots” a DataFrame by melting columns into rows. This can be particularly useful when you want to analyze or visualize data that is initially organized with column headers representing different variables.

Syntax:

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
  • frame: This parameter specifies the DataFrame you want to melt.

  • id_vars: This parameter specifies the column(s) that should remain as identifier variables (columns that won’t be melted). It can be a column name or a list of column names.

  • value_vars: This parameter specifies the column(s) that should be melted (unpivoted). If not specified, all columns not listed in id_vars will be melted.

  • var_name: This parameter specifies the name of the new column that will store the variable names (from the melted columns).

  • value_name: This parameter specifies the name of the new column that will store the values (from the melted columns).

  • col_level: This parameter is used when working with MultiIndex columns. It specifies which level of the column index should be melted.

We can see the full syntax here.

Example: Let’s apply the melt function to the previous air quality data example:

# Reset the index to convert 'Date' back to a regular column
pivoted_air_quality.reset_index(inplace=True)
display(pivoted_air_quality)

# Use the melt function to transform the data to long format
melted_air_quality = pd.melt(pivoted_air_quality, id_vars=['Date'], value_vars=['CO', 'NO2'], 
                              var_name='Pollutant', value_name='Value')

# Print the melted air quality data in long format
print("Melted Air Quality Data (Long Format):")
display(melted_air_quality)
Pollutant Date CO NO2
0 2023-01-01 2.5 20.1
1 2023-01-02 2.7 22.3
Melted Air Quality Data (Long Format):
Date Pollutant Value
0 2023-01-01 CO 2.5
1 2023-01-02 CO 2.7
2 2023-01-01 NO2 20.1
3 2023-01-02 NO2 22.3

Example: Utilizing the groupby function in conjunction with pivot.

import numpy as np
import pandas as pd

# ===========================================
# Dataset
# ===========================================

# Set the random seed to ensure reproducibility
rng = np.random.default_rng(42)

# Define the number of rows in the DataFrame
num_rows = 100

# Generate random data using NumPy functions directly
data = {
    'Product': rng.choice(['A', 'B'], num_rows),
    'Region': rng.choice(['North', 'South'], num_rows),
    'Sales': rng.integers(100, 201, size=num_rows),
    'Price': rng.integers(10, 23, size=num_rows)
}

# Create a Pandas DataFrame from the generated data
df = pd.DataFrame(data)
display(df.head())

# ===========================================
# Calculate Total Sales by Region and Product
# ===========================================

# Grouping by 'Region' and 'Product' and calculating the total sales
grouped = df.groupby(['Region', 'Product'])['Sales'].sum().reset_index()

# Display the grouped DataFrame
print("Total Sales by Region and Product:")
display(grouped)

# ===========================================
# Pivot Data to Create a Summary Table
# ===========================================

# Pivoting the data to create a summary table
pivot_table = grouped.pivot(index='Region', columns='Product', values='Sales')

# Display the pivot table
print("\nSummary Table (Region vs. Product):")
display(pivot_table)
Product Region Sales Price
0 A South 136 14
1 B North 191 20
2 B South 150 10
3 A North 170 21
4 A South 146 17
Total Sales by Region and Product:
Region Product Sales
0 North A 4242
1 North B 3311
2 South A 2922
3 South B 4318
Summary Table (Region vs. Product):
Product A B
Region
North 4242 3311
South 2922 4318