I have a dataset of transactions with for example the columns: UserID
, IBAN
, Timestamp
, Amount
.
I want to consider a rolling window of 24h for each user taking into account one single IBAN, so for example I want to count the number of transactions towards the same IBAN from the same user in the past 24h with respect of the actual transaction.
I’ve done it with a single element in the group by,considering only the UserID column like that:
def sort_and_reset(df_to_sort, by: Union[List[str], str] = "Timestamp", inplace: bool = True):
"""
Sorts a DataFrame by Timestamp (unless otherwise specified), and then resets its indexes (both operations are performed inplace)
"""
if inplace:
df_to_sort.sort_values(by=by, inplace=True)
df_to_sort.reset_index(drop=True, inplace=True)
else:
df = df_to_sort.sort_values(by=by)
df = df.reset_index(drop=True)
return df
def add_rolling_count_feature(df: pd.DataFrame, time_column_name: str, groupby_on: str, agg_on: str, agg_period: str) -> pd.DataFrame:
"""
Adds a new column to the DataFrame that represents the rolling count of occurrences and the sum of the amounts given a specified feature
within a given time period for each group.
Parameters:
- df: pd.DataFrame
The original DataFrame containing the data.
- time_column_name: str
The name of the column containing the time or datetime values.
- groupby_on: str
The name of the column to group the data by (e.g., user ID, card number).
- agg_on: str
The name of the column for which the rolling count is calculated e.g "Amount"
- agg_period: str
The rolling window period as a string (e.g., '30D' for 30 days, '24h' for 24 hours).
Returns:
- pd.DataFrame
The original DataFrame with an additional column containing the rolling count of the specified feature.
re(df, time_column_name='Time', groupby_on='UserID', agg_on='Amount', agg_period='30D')
"""
# Ensure the time column is in datetime format.Check maybe not needed
df[time_column_name] = pd.to_datetime(df[time_column_name])
#sort and reset basically sort and then reset the indexes
df = sort_and_reset(df, by=["UserID","Timestamp"], inplace=False)
# Set the time column as the index
df = df.set_index(time_column_name)
#groupby_on is a single column here
# Compute the rolling count of the specified feature
df["Xa1"] = df.groupby(groupby_on)[agg_on]
.rolling(agg_period, min_periods=0, closed="left")
.count()
.reset_index(level=0, drop=True)
# Compute the rolling sum and add it as a new column
df['Xa2'] = df.groupby(groupby_on)[agg_on]
.rolling(agg_period, min_periods=0, closed='left')
.sum()
.reset_index(level=0, drop=True) # Drop the groupby index level
# Reset the index to bring the time column back as a regular column
df = df.reset_index()
return df
But then, when I try with 2 columns in the group, it doesn’t work,or better said, it returns all NAN values (the window is right, and it should return something). I also changed the .reset_index(drop=True)
to consider all the levels. I think there are some problems with the multi-index or something like that.
An example of the input df:
df_x = {
'UserID': [1, 1, 1, 2, 2, 3, 3, 3],
'Timestamp': [
'2024-09-01 10:00:00',
'2024-09-01 11:00:00',
'2024-09-01 12:00:00',
'2024-09-01 10:00:00',
'2024-09-01 12:00:00',
'2024-09-01 09:00:00',
'2024-09-01 10:00:00',
'2024-09-01 11:00:00'
],
'IBAN': ['A', 'B', 'A', 'A', 'B', 'A', 'A', 'B'],
'Amount': [100, 200, 150, 300, 250, 120, 180, 220]
}
df_x = add_rolling_count_feature(df_x,"Timestamp","UserID","Amount","24h")
6
Your original approach messes up the Index. Since you have duplicated dates across the groups, the safest is to merge
back the data:
def add_rolling_count_feature(df: pd.DataFrame, time_column_name: str, groupby_on: str, agg_on: str, agg_period: str) -> pd.DataFrame:
# Ensure the time column is in datetime format.Check maybe not needed
df[time_column_name] = pd.to_datetime(df[time_column_name])
#sort and reset basically sort and then reset the indexes
df = sort_and_reset(df, by=["UserID","Timestamp"], inplace=False)
# prepare the groupby.rolling
r = (df.set_index(time_column_name)
.groupby(groupby_on)[agg_on]
.rolling(agg_period, min_periods=0, closed='left')
)
# compute count/sum and merge to input
return (df.merge(pd.concat({'Xa1': r.count(), 'Xa2': r.sum()}, axis=1)
.reset_index(), how='left')
.set_index(df.index)
)
Example:
add_rolling_count_feature(df_x, 'Timestamp', ['UserID', 'IBAN'], 'Amount', '24h')
UserID Timestamp IBAN Amount Xa1 Xa2
0 1 2024-09-01 10:00:00 A 100 0.0 0.0
1 1 2024-09-01 11:00:00 B 200 1.0 100.0
2 1 2024-09-01 12:00:00 A 150 2.0 300.0
3 2 2024-09-01 10:00:00 A 300 0.0 0.0
4 2 2024-09-01 12:00:00 B 250 1.0 300.0
5 3 2024-09-01 09:00:00 A 120 0.0 0.0
6 3 2024-09-01 10:00:00 A 180 1.0 120.0
7 3 2024-09-01 11:00:00 B 220 2.0 300.0
1