I have the following pandas dataframe:
# Create the DataFrame
df = pd.DataFrame({
'id': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],
'date': [1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7, 8, 9, 10, 11, 12],
'value': [11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 26, 27, 28]
})
df
id date value
0 1 1 11
1 1 2 12
2 1 3 13
3 1 4 14
4 1 5 15
5 1 6 16
6 1 7 17
7 1 8 18
8 2 5 21
9 2 6 22
10 2 7 23
11 2 8 24
12 2 9 25
13 2 10 26
14 2 11 27
15 2 12 28
I want to query the above dataframe, in a rolling window manner, for both ids
. The rolling window should be of size n
.
So, if n==2
, in the 1st iteration I would like to query this:
df.query('(id==1 and (date==1 or date==2)) or (id==2 and (date==5 or date==6))')
id date value
0 1 1 11
1 1 2 12
8 2 5 21
9 2 6 22
in the 2nd iteration I would like to query this:
df.query('(id==1 and (date==2 or date==3)) or (id==2 and (date==6 or date==7))')
id date value
1 1 2 12
2 1 3 13
9 2 6 22
10 2 7 23
in the 3rd iteration I would like to query this:
df.query('(id==1 and (date==3 or date==4)) or (id==2 and (date==7 or date==8))')
id date value
2 1 3 13
3 1 4 14
10 2 7 23
11 2 8 24
etc. How could I do that in pandas ? My data has around 500 ids
3
The exact expected logic is not fully clear, but assuming you want to loop over the groups/rolls, you could combine groupby.nth
with sliding_window_view
. By reusing the DataFrameGroupBy
object, will only need to compute the groups once:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv
n = 2
max_size = df['id'].value_counts(sort=False).max()
g = df.sort_values(by=['id', 'date']).groupby('id', sort=False)
for idx in swv(np.arange(max_size), n):
print(f'rows {idx}')
print(g.nth(idx))
Output:
rows [0 1]
id date value
0 1 1 11
1 1 2 12
8 2 5 21
9 2 6 22
rows [1 2]
id date value
1 1 2 12
2 1 3 13
9 2 6 22
10 2 7 23
rows [2 3]
id date value
2 1 3 13
3 1 4 14
10 2 7 23
11 2 8 24
rows [3 4]
id date value
3 1 4 14
4 1 5 15
11 2 8 24
12 2 9 25
rows [4 5]
id date value
4 1 5 15
5 1 6 16
12 2 9 25
13 2 10 26
rows [5 6]
id date value
5 1 6 16
6 1 7 17
13 2 10 26
14 2 11 27
rows [6 7]
id date value
6 1 7 17
7 1 8 18
14 2 11 27
15 2 12 28
Alternatively, and assuming groups with an identical size and sorted by id/date, using shifted indexing:
n = 2
ngroups = df['id'].nunique()
for idx in swv(np.arange(len(df)).reshape(-1, ngroups, order='F'), n, axis=0):
print(f'indices: {idx.ravel()}')
print(df.iloc[idx.flat])
Output:
indices: [0 1 8 9]
id date value
0 1 1 11
1 1 2 12
8 2 5 21
9 2 6 22
indices: [ 1 2 9 10]
id date value
1 1 2 12
2 1 3 13
9 2 6 22
10 2 7 23
indices: [ 2 3 10 11]
id date value
2 1 3 13
3 1 4 14
10 2 7 23
11 2 8 24
indices: [ 3 4 11 12]
id date value
3 1 4 14
4 1 5 15
11 2 8 24
12 2 9 25
indices: [ 4 5 12 13]
id date value
4 1 5 15
5 1 6 16
12 2 9 25
13 2 10 26
indices: [ 5 6 13 14]
id date value
5 1 6 16
6 1 7 17
13 2 10 26
14 2 11 27
indices: [ 6 7 14 15]
id date value
6 1 7 17
7 1 8 18
14 2 11 27
15 2 12 28
If I understand well what you expect, you can use a simple loop:
L = int(df.shape[0] / len(df["id"].unique())) # length for each id (here 16/2 = 8)
for i in range(L-1):
print(f"Iteration {i+1}:")
print(df.loc[[i, i+1, i+L, i+L+1]])
Iteration 1:
id date value
0 1 1 11
1 1 2 12
8 2 5 21
9 2 6 22
Iteration 2:
id date value
1 1 2 12
2 1 3 13
9 2 6 22
10 2 7 23
Iteration 3:
id date value
2 1 3 13
3 1 4 14
10 2 7 23
11 2 8 24
Iteration 4:
id date value
3 1 4 14
4 1 5 15
11 2 8 24
12 2 9 25
Iteration 5:
id date value
4 1 5 15
5 1 6 16
12 2 9 25
13 2 10 26
Iteration 6:
id date value
5 1 6 16
6 1 7 17
13 2 10 26
14 2 11 27
Iteration 7:
id date value
6 1 7 17
7 1 8 18
14 2 11 27
15 2 12 28
2
I attempted it by looping through each window and setting the start date for each unique id and create the query conditions for each ID dynamically.
n = 2
unique_ids = df['id'].unique()
start_dates = {uid: df[df['id'] == uid]['date'].min() for uid in unique_ids}
for iteration in range(len(df['date'].unique()) - n + 1):
query_conditions = []
for uid in unique_ids:
start_date = start_dates[uid] + iteration
end_date = start_date + n - 1
query_conditions.append(f"(id == {uid} and date >= {start_date} and date <= {end_date})")
query_string = " or ".join(query_conditions)
result = df.query(query_string)
print(f"Iteration {iteration + 1} Query:")
print(query_string)
print(result)
print("n")
Output
Iteration 1 Query:
(id == 1 and date >= 1 and date <= 2) or (id == 2 and date >= 5 and date <= 6)
id date value
0 1 1 11
1 1 2 12
8 2 5 21
9 2 6 22
Iteration 2 Query:
(id == 1 and date >= 2 and date <= 3) or (id == 2 and date >= 6 and date <= 7)
id date value
1 1 2 12
2 1 3 13
9 2 6 22
10 2 7 23
Iteration 3 Query:
(id == 1 and date >= 3 and date <= 4) or (id == 2 and date >= 7 and date <= 8)
id date value
2 1 3 13
3 1 4 14
10 2 7 23
11 2 8 24
Iteration 4 Query:
(id == 1 and date >= 4 and date <= 5) or (id == 2 and date >= 8 and date <= 9)
id date value
3 1 4 14
4 1 5 15
11 2 8 24
12 2 9 25