Pandas merge

5.9. Pandas `merge`#

Within the Pandas library, merge, join, and concat are fundamental functions that play a pivotal role in combining and merging data from various DataFrames. The merge function, in particular, facilitates the merging of two DataFrames based on common columns or indices, resembling the behavior of SQL joins. By specifying the on parameter, you can identify the shared column(s) for merging. This function supports various types of joins, such as inner, left, right, and outer joins. Here’s how you can use the merge function [Pandas Developers, 2023]:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'),
         copy=True, indicator=False, validate=None)

Parameters Explained:

left: The left DataFrame to be merged.
right: The right DataFrame to be merged.
how: The type of join to perform. Options include:
- 'inner': Only include rows with keys present in both DataFrames.
- 'outer': Include rows with keys from either DataFrame.
- 'left': Include all rows from the left DataFrame and matching rows from the right DataFrame.
- 'right': Include all rows from the right DataFrame and matching rows from the left DataFrame.
- 'cross': Perform a Cartesian product of rows from both DataFrames.
on: Column(s) to join on. It can be a single column name or a list of column names.
left_on: Column(s) from the left DataFrame to use as join keys.
right_on: Column(s) from the right DataFrame to use as join keys.
left_index: If True, use the left DataFrame’s index as a join key.
right_index: If True, use the right DataFrame’s index as a join key.
sort: Sort the result DataFrame by the join keys.
suffixes: Suffixes to add to overlapping column names from both DataFrames.
copy: If True, creates copies of input data.
indicator: If True, adds a column to indicate the source of each row.
validate: Validates the merge keys. Options include:
- 'one_to_one': Ensure that each key combination appears only once in both DataFrames.
- 'one_to_many': Ensure that each key combination appears only once in the left DataFrame and multiple times in the right DataFrame.
- 'many_to_one': Ensure that each key combination appears multiple times in the left DataFrame and only once in the right DataFrame.
- 'many_to_many': Allow multiple appearances of each key combination in both DataFrames.

For a comprehensive understanding of the function, you may refer to the complete documentation available here.

These are the different types of merge operations that you can specify when combining dataframes using the merge() function in pandas [McKinney, 2022]:

../_images/join_how.jpg — Fig. 5.17 Various types of circle overlaps and connections: **Left:** Two circles overlapping slightly on the left side. **Right:** Two circles overlapping slightly on the right side. **Outer:** Two circles with a larger overlap in the middle. **Inner:** One small circle entirely within a larger circle. **Cross:** Three pairs of numbers (1-1, 2-2, 3-3) connected by lines in a star or cross-like pattern within an encompassing circle.#

5.9.1. Inner Merge#

An inner merge in pandas returns only the rows where the keys (columns used for merging) exist in both dataframes. Think of it as finding the common ground between two datasets. For example, if you have two dataframes, one with employee names and IDs, and another with employee IDs and ages, an inner merge on the ID column will give you a dataframe that includes only the employees who appear in both dataframes, along with their names and ages.

Example:

import pandas as pd

# Create two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3,], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [25, 30, 22]})

# Display the original DataFrames
print('df1:')
display(df1)
print('df1:')
display(df2)

# Perform an inner merge on 'ID' column
inner_merged = df1.merge(df2, on='ID', how='inner')

# Display the result of the inner merge
print('Mergerd df:')
display(inner_merged)

df1:

	ID	Name
0	1	Alice
1	2	Bob
2	3	Charlie

df1:

	ID	Age
0	2	25
1	3	30
2	4	22

Mergerd df:

	ID	Name	Age
0	2	Bob	25
1	3	Charlie	30

Fig. 5.18 illustrates the IDs present in two datasets, df1 and df2. The overlap shows IDs common to both datasets (2, 3), while the non-overlapping sections display unique IDs for each dataset (1 for df1 and 4 for df2). The merge using using inner is demonstrated.

../_images/pandas_merge_inner_example1.jpg — Fig. 5.18 Visualizing the inner merge for the above example.#

5.9.2. Outer Merge#

An outer merge, on the other hand, is more inclusive. It returns all rows from both dataframes, filling in missing values with NaN where there are no matches. This type of merge is useful when you want to combine all data from both sources, even if some entries don’t have corresponding matches. For instance, if one dataframe lists all employees and another lists only those who have completed a training program, an outer merge will show all employees, indicating who has completed the training and who hasn’t.

Example:

# Display the original DataFrames
print('df1:')
display(df1)
print('df2:')
display(df2)

# Perform an outer merge on 'ID' and store the result in 'outer_merged'
outer_merged = df1.merge(df2, on='ID', how='outer')

# Display the result of the outer merge
print('Merged DataFrame:')
display(outer_merged)

df1:

	ID	Name
0	1	Alice
1	2	Bob
2	3	Charlie

df2:

	ID	Age
0	2	25
1	3	30
2	4	22

Merged DataFrame:

	ID	Name	Age
0	1	Alice	NaN
1	2	Bob	25.0
2	3	Charlie	30.0
3	4	NaN	22.0

Fig. 5.19 illustrates the IDs present in two datasets, df1 and df2. The overlap shows IDs common to both datasets (2, 3), while the non-overlapping sections display unique IDs for each dataset (1 for df1 and 4 for df2). The merge using using outer is demonstrated.

../_images/pandas_merge_outer_example1.jpg — Fig. 5.19 Visualizing the outer merge for the above example.#

5.9.3. Left Merge#

A left merge focuses on retaining all rows from the left dataframe and matching them with rows from the right dataframe where possible. If a key exists in the left dataframe but not in the right, the resulting dataframe will still include the left dataframe’s rows, with NaN filling in for missing values from the right dataframe. This is particularly useful when you have a primary dataset and want to enrich it with additional information from another dataset, without losing any of the original data.

Example:

# Display the original DataFrames
print('df1:')
display(df1)
print('df2:')
display(df2)

# Perform a left merge on 'ID' and store the result in 'left_merged'
left_merged = df1.merge(df2, on='ID', how='left')

# Display the result of the left merge
print('Merged DataFrame:')
display(left_merged)

df1:

	ID	Name
0	1	Alice
1	2	Bob
2	3	Charlie

df2:

	ID	Age
0	2	25
1	3	30
2	4	22

Merged DataFrame:

	ID	Name	Age
0	1	Alice	NaN
1	2	Bob	25.0
2	3	Charlie	30.0

Fig. 5.20 illustrates the IDs present in two datasets, df1 and df2. The overlap shows IDs common to both datasets (2, 3), while the non-overlapping sections display unique IDs for each dataset (1 for df1 and 4 for df2). The merge using using left is demonstrated.

../_images/pandas_merge_left_example1.jpg — Fig. 5.20 Visualizing the left merge for the above example.#

5.9.4. Right Merge#

The right merge is similar to the left merge but works in the opposite direction. It keeps all rows from the right dataframe and includes matching rows from the left dataframe. If a key exists in the right dataframe but not in the left, the resulting dataframe will include the right dataframe’s rows, with NaN filling in for missing values from the left dataframe. This is handy when the right dataframe is your primary dataset and you want to add information from the left dataframe.

Example:

# Display the original DataFrames
print('df1:')
display(df1)
print('df2:')
display(df2)

# Perform a right merge on 'ID' and store the result in 'right_merged'
right_merged = df1.merge(df2, on='ID', how='right')

# Display the result of the right merge
print('Merged DataFrame:')
display(right_merged)

df1:

	ID	Name
0	1	Alice
1	2	Bob
2	3	Charlie

df2:

	ID	Age
0	2	25
1	3	30
2	4	22

Merged DataFrame:

	ID	Name	Age
0	2	Bob	25
1	3	Charlie	30
2	4	NaN	22

Fig. 5.21 illustrates the IDs present in two datasets, df1 and df2. The overlap shows IDs common to both datasets (2, 3), while the non-overlapping sections display unique IDs for each dataset (1 for df1 and 4 for df2). The merge using using right is demonstrated.

../_images/pandas_merge_right_example1.jpg — Fig. 5.21 Visualizing the right merge for the above example.#

5.9.5. Cross Join#

A cross join, also known as a Cartesian product, combines every row from the first dataframe with every row from the second dataframe. This results in a dataframe where the number of rows is the product of the number of rows in both input dataframes. While this can generate a large number of rows, it’s useful in scenarios where you need to pair each item from one dataset with every item from another, such as creating all possible combinations of products and stores for a sales analysis.

# Display the original DataFrames
print('df1:')
display(df1)
print('df2:')
display(df2)

# Perform a cross merge on 'ID' and store the result in 'cross_join'
cross_join = df1.merge(df2, how='cross', suffixes=('_df1', '_df2'))

# Display the result of the cross join
print('Merged DataFrame:')
display(cross_join)

df1:

	ID	Name
0	1	Alice
1	2	Bob
2	3	Charlie

df2:

	ID	Age
0	2	25
1	3	30
2	4	22

Merged DataFrame:

	ID_df1	Name	ID_df2	Age
0	1	Alice	2	25
1	1	Alice	3	30
2	1	Alice	4	22
3	2	Bob	2	25
4	2	Bob	3	30
5	2	Bob	4	22
6	3	Charlie	2	25
7	3	Charlie	3	30
8	3	Charlie	4	22

Table 5.5 Summary of Merge Types#
Merge Type	Description
`'inner'`	Retains rows with matching keys in both dataframes.
`'outer'`	Retains all rows, filling in missing values with `NaN` for non-matching keys.
`'left'`	Retains all rows from the left dataframe and matches them with rows from the right dataframe.
`'right'`	Retains all rows from the right dataframe and matches them with rows from the left dataframe.
`'cross'`	Generates a new dataframe with all possible combinations of rows from both dataframes.

Set Theory Notation for Joins

Inner Join:
- Set Theory: A ∩ B
- Description: Includes rows that have matching values in both A and B.
Outer Join (Full Outer Join):
- Set Theory: A ∪ B
- Description: Includes all rows from both A and B, with missing values for non-matching rows.
Left Join:
- Set Theory: A ∪ (A ∩ B)
- Description: Includes all rows from A and only the matching rows from B.
Right Join:
- Set Theory: B ∪ (A ∩ B)
- Description: Includes all rows from B and only the matching rows from A.
Cross Join (Cartesian Join):
- Set Theory: A × B
- Description: Creates all possible combinations of rows from A and B.
Semi-Join (Left or Right):
- Set Theory: A ∩ B
- Description: Includes rows from one DataFrame (A or B) that have matching values in the other DataFrame.
Anti-Join:
- Set Theory: B - (A ∩ B)
- Description: Includes rows from B that do not have matching values in A.

Table 5.6 summarizes the various types of joins using set theory notation and provides brief descriptions of each join operation.

Table 5.6 Summary of the various types of joins#
Join Type	Notation	Description
Inner Join	A ∩ B	Rows with matching values in both A and B
Outer Join	A ∪ B	All rows from both A and B, with missing values
Left Join	A ∪ (A ∩ B)	All rows from A, only matching rows from B
Right Join	B ∪ (A ∩ B)	All rows from B, only matching rows from A
Cross Join	A × B	All possible combinations of rows from A and B
Semi-Join	A ∩ B	Rows from one DataFrame (A or B) with matching values
Anti-Join	B - (A ∩ B)	Rows from B without matching values in A

5.9.6. Multi-Key Merging#

You can merge DataFrames on multiple columns by passing a list of column names to the on parameter. This allows you to combine data based on multiple keys, ensuring that only rows with matching values in all specified columns are merged.

merged_df = pd.merge(df1, df2, on=['key1', 'key2'])

Example:

import pandas as pd

# Create the first DataFrame
df1 = pd.DataFrame({'key1': ['A', 'B', 'C'],
                    'key2': ['X', 'Y', 'Z'],
                    'value': [10, 20, 30]
                    })

# Create the second DataFrame
df2 = pd.DataFrame({'key1': ['B', 'C', 'D'],
                    'key2': ['Y', 'Z', 'W'],
                    'value': [40, 50, 60]
                    })

# Display the first DataFrame
print('df1:')
display(df1)

# Display the second DataFrame
print('df2:')
display(df2)

# Merge the DataFrames based on the common keys 'key1' and 'key2'
merged_df = pd.merge(df1, df2, on=['key1', 'key2'])

# Display the merged DataFrame
print('Merged DataFrame:')
display(merged_df)

df1:

	key1	key2	value
0	A	X	10
1	B	Y	20
2	C	Z	30

df2:

	key1	key2	value
0	B	Y	40
1	C	Z	50
2	D	W	60

Merged DataFrame:

	key1	key2	value_x	value_y
0	B	Y	20	40
1	C	Z	30	50

The resulting DataFrame contains only the rows where both ‘key1’ and ‘key2’ values are common between df1 and df2. It shows the corresponding values from both DataFrames in separate columns (‘value_x’ and ‘value_y’).

5.9.7. Specifying Left and Right Prefixes#

When merging DataFrames with overlapping column names, you can specify suffixes to differentiate them. This is particularly useful to avoid confusion and ensure clarity in the resulting DataFrame.

merged_df = pd.merge(df1, df2, on='ID', suffixes=('_left', '_right'))

Example:

merged_df = pd.merge(df1, df2, on= ['key1', 'key2'], suffixes=('_left', '_right'))
display(merged_df)

	key1	key2	value_left	value_right
0	B	Y	20	40
1	C	Z	30	50

In this example, the resulting DataFrame will have columns with suffixes _left and _right to distinguish between the overlapping column names from df1 and df2.

5.9.8. Merging on Index#

You can merge DataFrames based on their indices using the left_index and right_index parameters. This is useful when the indices of the DataFrames contain the keys you want to merge on.

merged_df = pd.merge(df1, df2, left_index=True, right_index=True)

Example:

import pandas as pd

# Assuming df1 and df2 are already defined

# Set the index of df1 to 'key1'
df1.set_index('key1', inplace=True)

# Set the index of df2 to 'key1'
df2.set_index('key1', inplace=True)

# Merge the DataFrames based on their index (key1)
merged_df = pd.merge(df1, df2, left_index=True, right_index=True)

# Display the merged DataFrame
display(merged_df)

	key2_x	value_x	key2_y	value_y
key1
B	Y	20	Y	40
C	Z	30	Z	50

In this example, the DataFrames df1 and df2 are merged based on their indices, which are set to ‘key1’. The resulting DataFrame contains rows where the indices match in both DataFrames, combining the corresponding values from each DataFrame.

5.9.9. Merging and Filtering DataFrames in Pandas using merge() and query()#

In Pandas, the merge() function is used to combine two or more DataFrames based on a common column or index. The query() function is used to filter rows from a DataFrame based on a given condition. Although you can use both functions separately, there might be scenarios where you want to merge DataFrames and then filter the result using a query [Pandas Developers, 2023].

Example: Let’s assume we have two DataFrames, df1 and df2, and we want to merge them based on a common column, and then filter the merged result using a query condition.

import numpy as np
import pandas as pd

# Sample data for df1
df1 = pd.DataFrame({'ID': np.arange(0, 5),
                    'Value1': np.arange(0, 50, 10)
                    })
print('df1:')
display(df1)

# Sample data for df2
df2 = pd.DataFrame({'ID': np.arange(1, 8),
                    'Value2': np.arange(100, 800, 100)
                    })
print('df2:')
display(df2)

# Merge the two DataFrames on the 'ID' column using an inner join
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print('Merged:')
display(merged_df)

# Filter the merged DataFrame using the query function
filtered_df = merged_df.query('Value1 > 15 and Value2 < 400')
print('Filtered:')
display(filtered_df)

df1:

	ID	Value1
0	0	0
1	1	10
2	2	20
3	3	30
4	4	40

df2:

	ID	Value2
0	1	100
1	2	200
2	3	300
3	4	400
4	5	500
5	6	600
6	7	700

Merged:

	ID	Value1	Value2
0	1	10	100
1	2	20	200
2	3	30	300
3	4	40	400

Filtered:

	ID	Value1	Value2
1	2	20	200
2	3	30	300

In this example, we first create two DataFrames, df1 and df2, with a common column ‘ID’. We then merge these DataFrames on the ‘ID’ column using an inner join, which retains only the rows with matching ‘ID’ values in both DataFrames. After merging, we use the query() function to filter the merged DataFrame based on the condition that ‘Value1’ is greater than 15 and ‘Value2’ is less than 400. The resulting DataFrame contains only the rows that meet these criteria.

5.9.10. Semi-join and Anti-join#

Semi-join and anti-join are concepts in relational database management systems that involve combining two tables based on a common column but with specific filtering conditions. These concepts are useful for filtering and retrieving rows that meet certain criteria while excluding others.[Rioux, 2022, Pandas Developers, 2023].

../_images/semi_join.jpg — Fig. 5.22 Semi join and Anti-join#

Semi-Join: A semi-join is a type of join operation that returns rows from the first table that match the condition in the second table. In other words, it selects rows from one table where there is a match in the other table based on a specified column. However, unlike a regular join, only the columns from the first table are included in the result [Rioux, 2022, Pandas Developers, 2023].
Anti-Join: An anti-join, also known as an anti-semi-join or difference join, is the opposite of a semi-join. It returns rows from the first table that do not have a match in the second table based on a specified column. Essentially, it filters out rows from the first table that meet a certain condition found in the second table [Rioux, 2022, Pandas Developers, 2023].

Remark

Semi-joins are typically used to filter data based on a condition present in another table.
Anti-joins are useful for finding records that are not present in another table or for excluding certain records that meet a certain criterion.

Example:

import pandas as pd

# Assuming df1 and df2 are already defined

# Print and display df1
print('df1:')
display(df1)

# Print and display df2
print('df2:')
display(df2)

# Semi-Join: Get rows from df1 that have matching IDs in df2
semi_join_result = df1[df1['ID'].isin(df2['ID'])]

# Anti-Join: Get rows from df1 that do not have matching IDs in df2
anti_join_result = df1[~df1['ID'].isin(df2['ID'])]

# Print Semi-Join Result
print("Semi-Join Result:")
display(semi_join_result)

# Print Anti-Join Result
print("Anti-Join Result:")
display(anti_join_result)

df1:

	ID	Value1
0	0	0
1	1	10
2	2	20
3	3	30
4	4	40

df2:

	ID	Value2
0	1	100
1	2	200
2	3	300
3	4	400
4	5	500
5	6	600
6	7	700

Semi-Join Result:

	ID	Value1
1	1	10
2	2	20
3	3	30
4	4	40

Anti-Join Result:

	ID	Value1
0	0	0

In SQL, you can use the LEFT JOIN (for semi-join) and NOT IN or NOT EXISTS (for anti-join) constructs to achieve these types of joins.

In Pandas, you can simulate these types of joins using various methods. For example, to perform a semi-join, you could use the merge() function with the desired how parameter and then select the columns you want. For an anti-join, you could use the merge() function with the indicator parameter and then filter out the matching rows [Rioux, 2022, Pandas Developers, 2023].

Example:

# Semi Join: Customers who have made orders
semi_join_df = df1.merge(df2[['ID']], on='ID', how='left', indicator=True)
semi_join_df = semi_join_df.query('_merge == "both"').drop('_merge', axis=1)
print("Semi-Join Result:")
display(semi_join_df)

# Anti Join: Customers who have not made orders
anti_join_df = df1.merge(df2[['ID']], on='ID', how='left', indicator=True)
anti_join_df = anti_join_df.query('_merge == "left_only"').drop('_merge', axis=1)
print("Anti-Join Result:")
display(anti_join_df)

Semi-Join Result:

	ID	Value1
1	1	10
2	2	20
3	3	30
4	4	40

Anti-Join Result:

	ID	Value1
0	0	0

Comparison with Other Merge Types

Inner Join: Returns only the rows with matching keys in both DataFrames. Similar to a semi-join but includes columns from both DataFrames.
Left Join: Returns all rows from the left DataFrame and matching rows from the right DataFrame. Non-matching rows from the right DataFrame are filled with NaN.
Right Join: Returns all rows from the right DataFrame and matching rows from the left DataFrame. Non-matching rows from the left DataFrame are filled with NaN.
Outer Join: Returns all rows from both DataFrames, filling in NaN where there are no matches. This is more inclusive than a semi-join or anti-join.

Semi-joins and anti-joins are more specialized operations that focus on filtering data based on the presence or absence of matching keys in another DataFrame, whereas the standard merge operations (inner, left, right, outer) are used to combine data from multiple DataFrames in various ways.

5.9.11. Merge on Nearest Key#

The pd.merge_asof() function in Pandas is used to perform a merge based on the nearest key, particularly for ordered data. It’s often used when you want to match records in two datasets based on a key column, but the values in the key column may not match exactly. Instead of requiring exact matches, this function allows you to match the nearest key values within a specified tolerance [McKinney and others, 2010, Pandas Developers, 2023].

The syntax of the pd.merge_asof() function:

pd.merge_asof(left, right, on, by, left_on, right_on, left_index, right_index, direction, tolerance)

left: The left DataFrame to merge.
right: The right DataFrame to merge.
on: The column name to use as the key for merging. It must be present in both DataFrames.
by: A list of columns to be used as additional keys for merging.
left_on, right_on: Columns from the left and right DataFrames to use as keys instead of the on parameter.
left_index, right_index: If True, use the index of the left or right DataFrame as a key.
direction: Specifies whether to search for the nearest key in ‘backward’ or ‘forward’ direction.
tolerance: The maximum allowable difference between key values for a successful match.

Example: An explanation of the parameters using an example.

import pandas as pd

# Creating stock price and economic data DataFrames
stock_prices = pd.DataFrame({'timestamp': ['2023-01-01 09:00:00', '2023-01-01 09:15:00', '2023-01-01 09:30:00'],
                             'stock_symbol': ['AAPL', 'AAPL', 'AAPL'],
                             'price': [150.20, 152.40, 153.10]
                             })
print('Stock Prices:')
display(stock_prices)

economic_data = pd.DataFrame({'timestamp': ['2023-01-01 09:05:00', '2023-01-01 09:25:00', '2023-01-01 09:40:00'],
                              'indicator': ['GDP', 'Unemployment', 'Inflation'],
                              'value': [2.5, 4.8, 2.2]
                              })
print('Economic Data:')
display(economic_data)

# Converting timestamps to datetime format
stock_prices['timestamp'] = pd.to_datetime(stock_prices['timestamp'])
economic_data['timestamp'] = pd.to_datetime(economic_data['timestamp'])

# Merging based on the nearest key
merged_df = pd.merge_asof(stock_prices, economic_data, on='timestamp')

# Displaying the result
print('Merged DataFrame:')
display(merged_df)

Stock Prices:

	timestamp	stock_symbol	price
0	2023-01-01 09:00:00	AAPL	150.2
1	2023-01-01 09:15:00	AAPL	152.4
2	2023-01-01 09:30:00	AAPL	153.1

Economic Data:

	timestamp	indicator	value
0	2023-01-01 09:05:00	GDP	2.5
1	2023-01-01 09:25:00	Unemployment	4.8
2	2023-01-01 09:40:00	Inflation	2.2

Merged DataFrame:

	timestamp	stock_symbol	price	indicator	value
0	2023-01-01 09:00:00	AAPL	150.2	NaN	NaN
1	2023-01-01 09:15:00	AAPL	152.4	GDP	2.5
2	2023-01-01 09:30:00	AAPL	153.1	Unemployment	4.8

Remark

The choice between using pd.merge and pd.merge_asof depends on the specific use case and the desired merging behavior. Both methods are useful for merging datasets, but they have different purposes and are suited for different scenarios.

pd.merge is generally used for precise merges on exact match conditions, including merges on columns other than datetime. It provides options for different types of joins (e.g., inner, outer, left, right), and it requires an exact match on the specified merge column(s).

pd.merge_asof, on the other hand, is specifically designed for merging based on nearest key(s) with an optional tolerance. This method is particularly useful when you want to merge datasets based on datetime columns, and you want to match rows within a certain time range (tolerance) rather than requiring an exact match.

Here’s an example of when you might prefer to use pd.merge_asof:

import pandas as pd

# Sample DataFrames with datetime columns
data1 = {
    "datetime": ["2023-01-01", "2023-02-01", "2023-03-01", "2023-04-01"],
    "value1": [10, 20, 30, 40],
}
df1 = pd.DataFrame(data1)

data2 = {
    "datetime": ["2023-01-15", "2023-03-01", "2023-03-15"],
    "value2": [100, 200, 300],
}
df2 = pd.DataFrame(data2)

# Convert the 'datetime' column to datetime format
df1['datetime'] = pd.to_datetime(df1['datetime'])
df2['datetime'] = pd.to_datetime(df2['datetime'])

# Merge based on nearest datetime with a tolerance
merged_df = pd.merge_asof(df1, df2, on='datetime', tolerance=pd.Timedelta(days=15))

# Print the merged DataFrame
print(merged_df)

    datetime  value1  value2
2023-01-01      10     NaN
2023-02-01      20     NaN
2023-03-01      30   200.0
2023-04-01      40     NaN

In this code, pd.merge_asof is used to merge df1 and df2 based on the “datetime” column with a tolerance of 15 days. This means rows from df2 will be matched to the nearest row in df1 based on the datetime column, allowing for a difference of up to 15 days.

5.9.12. Categorical Merging#

If your data has categorical variables, you can use categorical merging for better performance.

df1['category_column'] = df1['category_column'].astype('category')
df2['category_column'] = df2['category_column'].astype('category')
merged_df = pd.merge(df1, df2, on='category_column')

Example:

import pandas as pd
import numpy as np
import time

# Generating example data
n = int(1e4)  # Number of rows
categories = ['A', 'B', 'C', 'D', 'E']
np.random.seed(42)
df1 = pd.DataFrame({
    'id': np.arange(n),
    'category_column': np.random.choice(categories, size=n)
})
df2 = pd.DataFrame({
    'category_column': np.random.choice(categories, size=n),
    'value': np.random.randint(1, 100, size=n)
})

# Regular merging (without categorical data type)
start_time = time.time()
regular_merged_df = pd.merge(df1, df2, on='category_column')
regular_merge_time = time.time() - start_time

# Converting the 'category_column' to categorical data type
df1['category_column'] = df1['category_column'].astype('category')
df2['category_column'] = df2['category_column'].astype('category')

# Categorical merging
start_time = time.time()
categorical_merged_df = pd.merge(df1, df2, on='category_column')
categorical_merge_time = time.time() - start_time

# Displaying results
print(f"Regular Merge Time: {regular_merge_time:.3f}")
print(f"Categorical Merge Time: {categorical_merge_time:.3f}")

Regular Merge Time: 0.674
Categorical Merge Time: 0.561

In this example, we generate two DataFrames, df1 and df2, with a categorical column ‘category_column’. We then compare the performance of regular merging versus categorical merging.

Categorical merging can offer better performance, especially when dealing with large datasets, because categorical data types use integer-based codes internally, reducing memory usage and speeding up comparison operations. As a result, merging operations involving categorical columns can be more efficient.

Remember that the actual performance gain will depend on various factors including the size of your dataset and the specific operations being performed. Always consider the benefits of categorical data types based on your use case.

These merging techniques provide more flexibility and control over how you combine datasets using Pandas. Be sure to refer to the Pandas documentation for detailed information on these methods and their parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Pandas merge

Contents

5.9. Pandas merge#

5.9.1. Inner Merge#

5.9.2. Outer Merge#

5.9.3. Left Merge#

5.9.4. Right Merge#

5.9.5. Cross Join#

5.9.6. Multi-Key Merging#

5.9.7. Specifying Left and Right Prefixes#

5.9.8. Merging on Index#

5.9.9. Merging and Filtering DataFrames in Pandas using merge() and query()#

5.9.10. Semi-join and Anti-join#

5.9.11. Merge on Nearest Key#

5.9.12. Categorical Merging#

5.9. Pandas `merge`#