I have a pandas dataframe that contains multiple rows with one identificator and the datetime of the row.
id | creation_date |
---|---|
MP5694319-001 | 2024-05-29 12:58:00.489 |
MP5694319-001 | 2024-05-29 12:58:03.489 |
MP5694331-001 | 2024-05-29 12:58:05.489 |
MP5694319-001 | 2024-05-29 13:03:03.489 |
From this dataframe I need to filter rows that contain both the same id and were created within 1 minute from each other. The final result:
id | creation_date |
---|---|
MP5694319-001 | 2024-05-29 12:58:00.489 |
MP5694319-001 | 2024-05-29 12:58:03.489 |
Dataframe contains multiple rows with various id, so I need to group them somehow and then apply the difference, but I have no idea how to do this.
You can typically use a self-merge_asof
with tolerance:
# data must be sorted by timestamp
df['creation_date'] = pd.to_datetime(df['creation_date'])
df = df.sort_values(by=['creation_date', 'id'])
out = (pd.merge_asof(df,
df[['id', 'creation_date']].assign(flag=True),
by='id', on='creation_date',
direction='nearest',
tolerance=pd.Timedelta('1min'),
allow_exact_matches=False)
.loc[lambda x: x.pop('flag').eq(True)]
)
Output:
id creation_date
0 MP5694319-001 2024-05-29 12:58:00.489
1 MP5694319-001 2024-05-29 12:58:03.489
If you want to identify which value was identified as pair, rename
the column, then dropna
:
out = (pd.merge_asof(df,
df[['id', 'creation_date']]
.rename(columns={'creation_date':
'creation_date_2'}),
by='id', left_on='creation_date',
right_on='creation_date_2',
direction='nearest',
tolerance=pd.Timedelta('1min'),
allow_exact_matches=False)
.dropna(subset=['creation_date_2'])
)
Output:
id creation_date creation_date_2
0 MP5694319-001 2024-05-29 12:58:00.489 2024-05-29 12:58:03.489
1 MP5694319-001 2024-05-29 12:58:03.489 2024-05-29 12:58:00.489
You can use DataFrame.sort_values
forst, get difference per groups in both directions by DataFrameGroupBy.diff
and filter by Series.le
for less or equal in boolean indexing:
df['creation_date'] = pd.to_datetime(df['creation_date'])
df = df.sort_values(['id','creation_date'])
g = df.groupby('id')['creation_date']
out = df[g.diff().le('1 min') | g.diff(-1).abs().le('1 min')]
print (out)
id creation_date
0 MP5694319-001 2024-05-29 12:58:00.489
1 MP5694319-001 2024-05-29 12:58:03.489
Another idea is back filling first value of difference per groups for include in output:
df['creation_date'] = pd.to_datetime(df['creation_date'])
df = df.sort_values(['id','creation_date'])
g = df.groupby('id')['creation_date']
out = df[g.transform(lambda x: x.diff().bfill()).le('1 min')]
print (out)
id creation_date
0 MP5694319-001 2024-05-29 12:58:00.489
1 MP5694319-001 2024-05-29 12:58:03.489
4