I would like to merge two DataFrames that contain repeated records in the merging column. Here’s an example:
Sample DataFrames:
import pandas as pd
import numpy as np
# Sample data for df1 and df2
df1 = pd.DataFrame({'id': [1, 2, 2, 2, 3, 3, 3, 3, 4],
'tr1': [10, 80, 50, 40, 60, 20, 70, 90, 30]})
df2 = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3, 3],
'tr2': [15, 45, 55, 35, 95, 75, 65]})
df = df1.merge(df2, how='outer')
This produces the following result:
df
id tr1 tr2
1 10 15.0
1 10 45.0
1 10 55.0
2 80 35.0
2 80 95.0
2 50 35.0
2 50 95.0
2 40 35.0
2 40 95.0
3 60 75.0
3 60 65.0
3 20 75.0
3 20 65.0
3 70 75.0
3 70 65.0
3 90 75.0
3 90 65.0
4 30 NaN
However, this isn’t quite correct. For instance, id = 1 should not have repeated values of tr1 as 10 for each of the tr2 values. To address this, I thought of adding record numbers to differentiate the rows and then discarding these numbers after merging:
df1 = pd.DataFrame({'id': [1, 2, 2, 2, 3, 3, 3, 3, 4],
'tr1': [10, 80, 50, 40, 60, 20, 70, 90, 30],
'n1': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df2 = pd.DataFrame({'id': [1, 1, 1, 2, 2, 3, 3],
'tr2': [15, 45, 55, 35, 95, 75, 65],
'n2': [1, 2, 3, 4, 5, 6, 7]})
df = df1.merge(df2, how='outer')
# Remove duplicates based on record numbers
df['tr1'] = np.where(df.duplicated('n1', keep='first'), np.nan, df['tr1'])
df['tr2'] = np.where(df.duplicated('n2', keep='first'), np.nan, df['tr2'])
df = df[['id', 'tr1', 'tr2']]
# Drop rows where both 'tr1' and 'tr2' are NaN
df = df.dropna(subset=['tr1', 'tr2'], how='all')
This results in:
id tr1 tr2
1 10.0 15.0
1 NaN 45.0
1 NaN 55.0
2 80.0 35.0
2 NaN 95.0
2 50.0 NaN
2 40.0 NaN
3 60.0 75.0
3 NaN 65.0
3 20.0 NaN
3 70.0 NaN
3 90.0 NaN
4 30.0 NaN
This is an improvement, but still not exactly what I want. For example, id = 2 should not have NaN for tr1 in the second row. It should be:
2 80.0 35.0
2 50.0 95.0
2 40.0 NaN
And for id 3 for example:
3 60.0 75.0
3 20.0 65.0
3 70.0 NaN
3 90.0 NaN
The expected output should look like this:
Desired Output:
id tr1 tr2
1 10 15
1 nan 45
1 nan 55
2 80 35
2 50 95
2 40 nan
3 60 75
3 20 65
3 70 nan
3 90 nan
4 30 nan
Any idea how I can reach it?
0
You can solve this issue by adding a cumulative count to both DataFrames to identify each repeated value in the merging column uniquely. Then, merge the DataFrames on both the merging column and the cumulative count.
import pandas as pd
def merge_with_repeated_values(df1, df2, merge_column):
"""
Merge DataFrames with repeated values using cumulative count.
Args:
df1, df2 (pd.DataFrame): DataFrames to merge.
merge_column (str): Column to merge on.
Returns:
pd.DataFrame: Merged DataFrame with all aligned rows.
"""
# Add cumulative count to handle repeated values
temp_count_col = '__temp_count__'
df1[temp_count_col] = df1.groupby(merge_column).cumcount()
df2[temp_count_col] = df2.groupby(merge_column).cumcount()
# Merge on both original column and count
merged_df = pd.merge(df1, df2, on=[merge_column, temp_count_col], how='outer')
# Drop the temporary count column
return merged_df.drop(columns=[temp_count_col])
Code
When use merge
, attach the result of cumcount
to on
(left & right)
out = df1.merge(
df2,
left_on=['id', df1.groupby('id').cumcount()],
right_on=['id', df2.groupby('id').cumcount()],
how='outer'
)
out[['id', 'tr1', 'tr2']]
:
id tr1 tr2
0 1 10.0 15.0
1 1 NaN 45.0
2 1 NaN 55.0
3 2 80.0 35.0
4 2 50.0 95.0
5 2 40.0 NaN
6 3 60.0 75.0
7 3 20.0 65.0
8 3 70.0 NaN
9 3 90.0 NaN
10 4 30.0 NaN