I have a DataFrame as shown below
DateTime Order Val1 Val2 Val3 Category Item Quantity Cost
0 01-01-2000 00:00 Offline 906437.46 507604.062 217544.598 Appliances Oven 654 1108.79
1 01-01-2000 00:00 Offline 71512.00 40046.720 17162.880 Clothing Sweater 2554 22.40
2 01-01-2000 00:00 Offline 74987.50 41993.000 17997.000 Clothing Shirt 4285 14.00
3 01-01-2000 00:00 Offline 1097057.80 614354.020 263294.580 Gadgets Ipad 590 1487.54
4 01-01-2000 00:00 Offline 725866.48 406489.104 174209.616 Gadgets HeadPhone 1384 419.58
5 01-01-2000 00:00 Offline 177827.28 99586.242 42679.818 Gadgets Ipod 1059 134.34
6 01-01-2000 00:00 Offline 88381.21 49491.211 21210.519 Utensils FryingPan 1619 43.67
7 01-01-2000 00:00 Offline 40555.44 22708.728 9748.872 Utensils Spoon 1656 19.59
8 01-01-2000 00:00 Online 150752.00 84421.120 36180.480 Clothing Jacket 2692 44.80
9 01-01-2000 00:00 Online 58861.59 32961.313 14126.277 Utensils Pot 841 55.99
10 02-01-2000 00:00 Offline 752395.25 421326.425 180574.575 Appliances Oven 475 1267.19
11 02-01-2000 00:00 Offline 105920.00 59315.200 25420.800 Clothing Jacket 1655 51.20
12 02-01-2000 00:00 Offline 79540.00 44542.400 19089.600 Clothing Shirt 3977 16.00
13 02-01-2000 00:00 Offline 511879.06 286651.078 122850.462 Gadgets HeadPhone 854 479.51
14 02-01-2000 00:00 Offline 62826.73 35181.559 15077.811 Utensils FryingPan 1007 49.91
15 02-01-2000 00:00 Offline 39801.78 22287.006 9551.574 Utensils Spoon 1422 22.39
16 02-01-2000 00:00 Online 77792.00 43563.520 18670.080 Clothing Sweater 2431 25.60
17 02-01-2000 00:00 Online 554638.05 310597.308 133113.132 Gadgets Ipad 261 1700.04
18 02-01-2000 00:00 Online 173870.46 97368.726 41729.454 Gadgets Ipod 906 153.53
19 02-01-2000 00:00 Online 60632.42 33953.094 14551.326 Utensils Pot 758 63.99
I want to write a lag function in which I can lag the numerical columns by x number of days, which I plan to use on different datasets. The function will get the category columns, date column and the numerical columns I want to lag in the data as input. I want the output to look like:
DateTime Order Val1 Val2 Val3 Category Item Quantity Cost Quantity_lag
0 01-01-2000 00:00 Offline 906437.46 507604.062 217544.598 Appliances Oven 654 1108.79 Nan
1 01-01-2000 00:00 Offline 71512.00 40046.720 17162.880 Clothing Sweater 2554 22.40 Nan
2 01-01-2000 00:00 Offline 74987.50 41993.000 17997.000 Clothing Shirt 4285 14.00 Nan
3 01-01-2000 00:00 Offline 1097057.80 614354.020 263294.580 Gadgets Ipad 590 1487.54 Nan
4 01-01-2000 00:00 Offline 725866.48 406489.104 174209.616 Gadgets HeadPhone 1384 419.58 Nan
5 01-01-2000 00:00 Offline 177827.28 99586.242 42679.818 Gadgets Ipod 1059 134.34 Nan
6 01-01-2000 00:00 Offline 88381.21 49491.211 21210.519 Utensils FryingPan 1619 43.67 Nan
7 01-01-2000 00:00 Offline 40555.44 22708.728 9748.872 Utensils Spoon 1656 19.59 Nan
8 01-01-2000 00:00 Online 150752.00 84421.120 36180.480 Clothing Jacket 2692 44.80 Nan
9 01-01-2000 00:00 Online 58861.59 32961.313 14126.277 Utensils Pot 841 55.99 Nan
10 02-01-2000 00:00 Offline 752395.25 421326.425 180574.575 Appliances Oven 475 1267.19 Nan
11 02-01-2000 00:00 Offline 105920.00 59315.200 25420.800 Clothing Jacket 1655 51.20 Nan
12 02-01-2000 00:00 Offline 79540.00 44542.400 19089.600 Clothing Shirt 3977 16.00 Nan
13 02-01-2000 00:00 Offline 511879.06 286651.078 122850.462 Gadgets HeadPhone 854 479.51 Nan
14 02-01-2000 00:00 Offline 62826.73 35181.559 15077.811 Utensils FryingPan 1007 49.91 Nan
15 02-01-2000 00:00 Offline 39801.78 22287.006 9551.574 Utensils Spoon 1422 22.39 Nan
16 02-01-2000 00:00 Online 77792.00 43563.520 18670.080 Clothing Sweater 2431 25.60 Nan
17 02-01-2000 00:00 Online 554638.05 310597.308 133113.132 Gadgets Ipad 261 1700.04 Nan
18 02-01-2000 00:00 Online 173870.46 97368.726 41729.454 Gadgets Ipod 906 153.53 Nan
19 02-01-2000 00:00 Online 60632.42 33953.094 14551.326 Utensils Pot 758 63.99 Nan
I tried using the following code:
groupby_columns = [datecolumn] + category_columns
df[datecolumn] = pd.to_datetime(df[datecolumn], format = '%d-%m-%Y %H:%M')
grp_col = df.merge(df.assign(DateTime = df[datecolumn]+pd.Timedelta(days=lag)), on=groupby_columns, how='left', suffixes = ['','_lag'])
which gives me the desired output but it lags all the numerical columns that are present inside the data. Is there a better method for doing this?
katzing is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.