Consider I have a dataset which has a date column that is generated everyday as given below.
DF_A
ID name qty date
1 abc 20 17/01/2022
1 abc 10 18/01/2022
2 def 10 24/01/2022
2 def 40 25/01/2022
2 def 67 26/01/2022
DF_B
ID name price_dt price
1 abc 18/01/2022 23.56
1 abc 17/01/2022 10.56
1 abc 16/01/2022 44.33
1 abc 15/01/2022 56.11
2 def 25/01/2022 2.98
2 def 26/01/2022 4.92
2 def 27/01/2022 4.88
2 def 24/01/2022 3.33
2 def 23/01/2022 8.47
2 def 22/01/2022 3.89
I’m joining the DF_A with DF_B and I only need the recent price_dt record that is less than the date column. This can be done my joining the 2 DF’s and dropping the duplicates by sorting the price_dt DESC but the challenge is the DF size is so huge and joining is not feasible. so im looking to reduce the rows in DF_B before joining.
Code that I tried
DF_C = pd.merge(DF_A,DF_B,on='ID',how='left')
# (This actually give 40 rows which is not optimum way of doing for larger dataset)
Expected_DF = DF_C.sort_values(by=['price_dt'], ascending=False)
Expected_DF = Expected_DF.drop_duplicates(subset=['ID','name','date'],keep='first')
Expected_DF:
ID name qty date price_dt price
1 abc 20 17/01/2022 16/01/2022 44.33
1 abc 10 18/01/2022 17/01/2022 10.56
2 def 10 24/01/2022 23/01/2022 8.47
2 def 40 25/01/2022 24/01/2022 3.33
2 def 67 26/01/2022 25/01/2022 2.98
Im looking for feasible method when I can reduce the memory usage instead of fetching all the matching records from DF_B
1
Use merge_asof
with allow_exact_matches=False
parameter:
DF_A['date'] = pd.to_datetime(DF_A['date'], dayfirst=True)
DF_B['price_dt'] = pd.to_datetime(DF_B['price_dt'], dayfirst=True)
out = pd.merge_asof(DF_A.sort_values('date'),
DF_B.sort_values('price_dt'),
left_on='date',
right_on='price_dt',
by=['ID','name'],
allow_exact_matches=False)
print (out)
ID name qty date price_dt price
0 1 abc 20 2022-01-17 2022-01-16 44.33
1 1 abc 10 2022-01-18 2022-01-17 10.56
2 2 def 10 2022-01-24 2022-01-23 8.47
3 2 def 40 2022-01-25 2022-01-24 3.33
4 2 def 67 2022-01-26 2022-01-25 2.98
EDIT: If need backward and forward direction together use:
DF_A['date'] = pd.to_datetime(DF_A['date'], dayfirst=True)
DF_B['price_dt'] = pd.to_datetime(DF_B['price_dt'], dayfirst=True)
out1 = pd.merge_asof(DF_A.sort_values('date'),
DF_B.sort_values('price_dt'),
left_on='date',
right_on='price_dt',
by=['ID','name'],
allow_exact_matches=False)
out2 = pd.merge_asof(DF_A.sort_values('date'),
DF_B.sort_values('price_dt'),
left_on='date',
right_on='price_dt',
by=['ID','name'],
allow_exact_matches=False,
direction='forward').rename(columns={'price_dt':'price_dt1',
'price':'price1'})
out = out1.merge(out2, how='left')
print (out)
ID name qty date price_dt price price_dt1 price1
0 1 abc 20 2022-01-17 2022-01-16 44.33 2022-01-18 23.56
1 1 abc 10 2022-01-18 2022-01-17 10.56 NaT NaN
2 2 def 10 2022-01-24 2022-01-23 8.47 2022-01-25 2.98
3 2 def 40 2022-01-25 2022-01-24 3.33 2022-01-26 4.92
4 2 def 67 2022-01-26 2022-01-25 2.98 2022-01-27 4.88
3