I am trying to merge two dataframes in Python, pandas, df1 and df2.
I am trying to merge them on Column1, and then assign value of Column2 from df2 to df1.
This is my code:
df1 = df1.reset_index()
merged = df1.merge(df2, on='Column1', how='left', indicator=True, suffixes=('_df1', '_df2'))
mask = (
(df1['Column1'].notna()) &
(merged['_merge'] == 'both')
)
df1.loc[mask, 'Column2'] = merged.loc[mask, 'Column2_df2']
However, indices get mixed up. And when I print merged, the values are correctly matched. But when I print df1 after assigning values I can see that the values in Column2 are not assign properly, and I can see that the indices do not match indices printed out in merged table.
I have tried adding reset_index when using .loc, not reseting indices at all, but the result is the same.
Here is the example of my dataframes:
(I will include indices that are not related to this example of input, since I am working with big dataframes with milions of records, I cannot think of another way of putting this)
#input
df1 = pd.DataFrame({'Column1': ['A', 'B', None, 'D'],
'Column2': [None, None, None, None]
})
df2 = pd.DataFrame({'Column1': ['B', 'C', 'E', 'A'],
'Column2': [10, 20, 30, 40]
})
# output of merged:
Column1_df1 Column1_df2 Column2_df1 Column2_df2 _merge
44 A A None 40 both
57 B B None 10 both
# output of df1 after .loc:
Column1 Column2
0 A 30
1 B 10
3 None None
6 D None
# expected output of df1 after .loc:
Column1 Column2
44 A 30
57 B 40
3 None None
6 D None
Thank you in advance for your help!
4
IIUC, you could avoid using a merge
and rather go for a map
, which makes it easier to not have to track the indices:
df1['Column2'] = (df1['Column1']
.map(df2.set_index('Column1')['Column2'])
.fillna(df1['Column2'])
)
A fix of your approach would be:
merged = (df1.reset_index()
.merge(df2, on='Column1', how='left',
indicator=True,
suffixes=('_df1', '_df2'))
.set_index('index')
)
mask = df1['Column1'].notna() & merged['_merge'].eq('both')
df1.loc[mask, 'Column2'] = merged.loc[mask, 'Column2_df2']
Example:
# input
df1 = pd.DataFrame({'Column1': ['A', 'B', None, 'D'],
'Column2': [1, 2, 3, 4]
})
df2 = pd.DataFrame({'Column1': ['B', 'C', 'E', 'A'],
'Column2': [10, 20, 30, 40]
})
# output
Column1 Column2
0 A 40.0
1 B 10.0
2 None 3.0
3 D 4.0