I have a set of data, where a user action generates multiple events.
A configuration event, that logs the amount of each product used and then for each used product, an available event. These events may not be received in order and sometimes with a little delay.
Timestamp | Event | Data |
---|---|---|
25/07/2024 08:58:49 | Configuration | A,B |
25/07/2024 08:58:49 | Product Available | A,1 |
25/07/2024 08:58:51 | Product Available | B,3 |
25/07/2024 10:17:53 | Product Available | B,1 |
25/07/2024 10:17:53 | Configuration | B,C |
25/07/2024 10:17:53 | Product Available | C,4 |
Previously, I split the data into a Configuration Dataframe and Dataframes for each Product, where the Timestamp are set as index.
Thus, I could look up the Amount for each Product for each Configuration by doing this:
iloc_idx = product_A.index.get_indexer([configuration_timestamp], method='nearest')
It is not pretty, but works.
Now, the user can stage some actions and that leads to some events occurring simultaneously, which in turn breaks my index based idea.
Timestamp | Event | Data |
---|---|---|
25/07/2024 10:18:11 | Product Available | C,2 |
25/07/2024 10:18:11 | Product Available | B,1 |
25/07/2024 10:18:11 | Configuration | A,B |
25/07/2024 10:18:11 | Configuration | B,C |
25/07/2024 10:18:11 | Product Available | B,3 |
25/07/2024 10:18:12 | Product Available | A,1 |
data = {
'Timestamp': ['25/07/2024 08:58:49', '25/07/2024 08:58:49', '25/07/2024 08:58:51', '25/07/2024 10:17:53', '25/07/2024 10:17:53', '25/07/2024 10:17:53', '25/07/2024 10:18:11', '25/07/2024 10:18:11', '25/07/2024 10:18:11', '25/07/2024 10:18:11', '25/07/2024 10:18:11', '25/07/2024 10:18:12'],
'Event': ['Configuration', 'Product Available', 'Product Available', 'Product Available', 'Configuration', 'Product Available', 'Product Available', 'Product Available', 'Configuration', 'Configuration', 'Product Available', 'Product Available'],
'Data': ['A,B', 'A,1', 'B,3', 'B,1', 'B,C', 'C,4', 'C,2','B,1','A,B','B,C','B,3','A,1']
}
I can’t sort the data based on the configuration, since there is an overlap in products, but I could combine them.
Is there a way to group my Data, so that I can see for each product the amount used?
1
You would need to approach this in few steps:
- Converting
Timestamp
to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'], format='%d/%m/%Y %H:%M:%S')
- Then seperating the configuration and product events
config_df = df[df['Event'] == 'Configuration'].copy()
product_df = df[df['Event'] == 'Product Available'].copy()
- Then extracting product and amount from
Data
column inproduct_df
product_df[['Product', 'Amount']] = product_df['Data'].str.split(',', expand=True)
product_df['Amount'] = product_df['Amount'].astype(int)
product_df = product_df[['Timestamp', 'Product', 'Amount']]
- Renaming and sorting both DataFrames by
Timestamp
config_df = config_df.rename(columns={'Timestamp': 'Configuration_Timestamp'})
product_df = product_df.rename(columns={'Timestamp': 'Product_Timestamp'})
product_df = product_df.sort_values('Product_Timestamp')
config_df = config_df.sort_values('Configuration_Timestamp')
- Merging product events with configuration events using
Timestamp
merged_df = pd.merge_asof(product_df,
config_df,
left_on='Product_Timestamp',
right_on='Configuration_Timestamp',
direction='backward')
- Displaying the result:
result_df = merged_df.groupby(['Configuration_Timestamp', 'Product'])['Amount'].sum().reset_index()
# Rename columns for clarity
result_df = result_df.rename(columns={'Amount': 'Total_Amount'})
print(result_df)
which will give you:
Timestamp | Product | Total_Amount | |
---|---|---|---|
0 | 2024-07-25 08:58:49 | A | 1 |
1 | 2024-07-25 08:58:49 | B | 3 |
2 | 2024-07-25 10:17:53 | B | 1 |
3 | 2024-07-25 10:17:53 | C | 4 |
4 | 2024-07-25 10:18:11 | A | 1 |
5 | 2024-07-25 10:18:11 | B | 4 |
6 | 2024-07-25 10:18:11 | C | 2 |