data1 = {
'alias_cd': ['12345', '12345', '12345'],
'country_cd': ['AU', 'AU', 'AU2'],
'pos_name': ['st1', 'Jh', 'Jh'],
'ts_allocated': [100, 100, 100],
'tr_id': ['None', 'None', 'None'],
'ty_name': ['E2E', 'E2E', 'E2E']
}
data2 = {
'alias_cd': ['12345', '12345'],
'country_cd': ['AU', 'AU3'],
'pos_name': ['st1', 'st2'],
'ts_allocated': [200, 100],
'tr_id': ['None', 'None'],
'ty_name': ['E2E', 'E2E']
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
output should be
alias_cd country_cd pos_name ts_allocated tr_id ty_name etl_flag
1 12345 AU st1 200 None E2E U
2 12345 AU3 st2 100 None E2E D
3 12345 AU st1 100 None E2E I
4 12345 AU Jh 100 None E2E I
5 12345 AU2 Jh 100 None E2E I
Because:
The combination of alias_cd and country_cd acts as a primary key.
1.If a combination exists in df2 and df1 (12345 AU
), it will be marked for ‘Update’ in df2, and all corresponding rows in df1 for the same combination will be marked as ‘Insert’. for the above example for 12345 AU
records in df2 will be etl_flag= ‘Update’ and add records for the same combination from df1 to df2 with etl_flag as ‘Insert’
2.12345 AU3
exists in df2 but not in df1, so it will be tagged as ‘DELETE’ in the etl_flag column.
3.Any new combination that appears in df1 and not present in df2 will be tagged as ‘Insert’ in the etl_flag column.
How can I achieve this efficiently?
This is what I tried but its doesn’t give correct output:
df2['etl_flag'] = 'U'
to_insert = df1[~df1.apply(lambda x: (df2['alias_cd'] == x['alias_cd']) & (df2['country_cd'] == x['country_cd']), axis=1)]
to_insert['etl_flag'] = 'I'
df2 = pd.concat([df2, to_insert], ignore_index=True)
to_delete = df2[~df2.apply(lambda x: (df1['alias_cd'] == x['alias_cd']) & (df1['country_cd'] == x['country_cd']), axis=1)]
to_delete['etl_flag'] = 'D'
final_df = pd.concat([df2, to_delete], ignore_index=True)
final_df.sort_values(by=['alias_cd', 'country_cd'], inplace=True)
print(final_df[['alias_cd', 'country_cd', 'pos_name', 'ts_allocated', 'tr_id', 'ty_name', 'etl_flag']])