Consider the following pl.DataFrame
s:
import datetime
import polars as pl
df_orig = pl.DataFrame(
{
"symbol": [*["A"] * 10, *["B"] * 8],
"date": [
*pl.datetime_range(
start=datetime.date(2024, 1, 1),
end=datetime.date(2024, 1, 10),
eager=True,
),
*pl.datetime_range(
start=datetime.date(2024, 1, 1),
end=datetime.date(2024, 1, 8),
eager=True,
),
],
"data": [*range(10), *range(8)],
}
)
df_helper = pl.DataFrame({"symbol": ["A", "B"], "start_idx": [[0, 5], [0, 4]]})
chunk_size = 5
print(df_orig)
print(df_helper)
shape: (18, 3)
┌────────┬─────────────────────┬──────┐
│ symbol ┆ date ┆ data │
│ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ i64 │
╞════════╪═════════════════════╪══════╡
│ A ┆ 2024-01-01 00:00:00 ┆ 0 │
│ A ┆ 2024-01-02 00:00:00 ┆ 1 │
│ A ┆ 2024-01-03 00:00:00 ┆ 2 │
│ A ┆ 2024-01-04 00:00:00 ┆ 3 │
│ A ┆ 2024-01-05 00:00:00 ┆ 4 │
│ … ┆ … ┆ … │
│ B ┆ 2024-01-04 00:00:00 ┆ 3 │
│ B ┆ 2024-01-05 00:00:00 ┆ 4 │
│ B ┆ 2024-01-06 00:00:00 ┆ 5 │
│ B ┆ 2024-01-07 00:00:00 ┆ 6 │
│ B ┆ 2024-01-08 00:00:00 ┆ 7 │
└────────┴─────────────────────┴──────┘
shape: (2, 2)
┌────────┬───────────┐
│ symbol ┆ start_idx │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞════════╪═══════════╡
│ A ┆ [0, 5] │
│ B ┆ [0, 4] │
└────────┴───────────┘
Now, I need to split the dataframe into two chunks of length 5 (chunk_size
) grouped by the symbol
column. The column start_idx
indicate the rows to start the chunk in each group. That is, group A will be split into two chunks of length 5 starting in row 0 and 5, while the chunks of grouß B start in row 0 and 4.
Finally, all chunks need to be concatenated on axis=0
, whereby a new column split_idx
indicates where the split is coming from.
Here’s what I am looking for:
shape: (20, 4)
┌────────────────────┬─────────────────────┬──────┐
│ split_idx ┆ symbol ┆ date ┆ data │
│ ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ datetime[μs] ┆ i64 │
╞═══════════╪════════╪═════════════════════╪══════╡
│ 0 ┆ A ┆ 2024-01-01 00:00:00 ┆ 0 │
│ 0 ┆ A ┆ 2024-01-02 00:00:00 ┆ 1 │
│ 0 ┆ A ┆ 2024-01-03 00:00:00 ┆ 2 │
│ 0 ┆ A ┆ 2024-01-04 00:00:00 ┆ 3 │
│ 0 ┆ A ┆ 2024-01-05 00:00:00 ┆ 4 │
│ 0 ┆ B ┆ 2024-01-01 00:00:00 ┆ 0 │
│ 0 ┆ B ┆ 2024-01-02 00:00:00 ┆ 1 │
│ 0 ┆ B ┆ 2024-01-03 00:00:00 ┆ 2 │
│ 0 ┆ B ┆ 2024-01-04 00:00:00 ┆ 3 │
│ 0 ┆ B ┆ 2024-01-05 00:00:00 ┆ 4 │
│ 1 ┆ A ┆ 2024-01-01 00:00:00 ┆ 5 │
│ 1 ┆ A ┆ 2024-01-02 00:00:00 ┆ 6 │
│ 1 ┆ A ┆ 2024-01-03 00:00:00 ┆ 7 │
│ 1 ┆ A ┆ 2024-01-04 00:00:00 ┆ 8 │
│ 1 ┆ A ┆ 2024-01-05 00:00:00 ┆ 9 │
│ 1 ┆ B ┆ 2024-01-01 00:00:00 ┆ 3 │
│ 1 ┆ B ┆ 2024-01-02 00:00:00 ┆ 4 │
│ 1 ┆ B ┆ 2024-01-03 00:00:00 ┆ 5 │
│ 1 ┆ B ┆ 2024-01-04 00:00:00 ┆ 6 │
│ 1 ┆ B ┆ 2024-01-05 00:00:00 ┆ 7 │
└───────────┴────────┴─────────────────────┴──────┘
Keep in mind that list in column start_idx
may be of variable length for each individual row. The length of each list determines the number of chunks for each group.