Thiết kế website giá rẻ

Question

Consider the following pl.DataFrames:

import datetime

import polars as pl

df_orig = pl.DataFrame(
    {
        "symbol": [*["A"] * 10, *["B"] * 8],
        "date": [
            *pl.datetime_range(
                start=datetime.date(2024, 1, 1),
                end=datetime.date(2024, 1, 10),
                eager=True,
            ),
            *pl.datetime_range(
                start=datetime.date(2024, 1, 1),
                end=datetime.date(2024, 1, 8),
                eager=True,
            ),
        ],
        "data": [*range(10), *range(8)],
    }
)

df_helper = pl.DataFrame({"symbol": ["A", "B"], "start_idx": [[0, 5], [0, 4]]})

chunk_size = 5

print(df_orig)
print(df_helper)


shape: (18, 3)
┌────────┬─────────────────────┬──────┐
│ symbol ┆ date                ┆ data │
│ ---    ┆ ---                 ┆ ---  │
│ str    ┆ datetime[μs]        ┆ i64  │
╞════════╪═════════════════════╪══════╡
│ A      ┆ 2024-01-01 00:00:00 ┆ 0    │
│ A      ┆ 2024-01-02 00:00:00 ┆ 1    │
│ A      ┆ 2024-01-03 00:00:00 ┆ 2    │
│ A      ┆ 2024-01-04 00:00:00 ┆ 3    │
│ A      ┆ 2024-01-05 00:00:00 ┆ 4    │
│ …      ┆ …                   ┆ …    │
│ B      ┆ 2024-01-04 00:00:00 ┆ 3    │
│ B      ┆ 2024-01-05 00:00:00 ┆ 4    │
│ B      ┆ 2024-01-06 00:00:00 ┆ 5    │
│ B      ┆ 2024-01-07 00:00:00 ┆ 6    │
│ B      ┆ 2024-01-08 00:00:00 ┆ 7    │
└────────┴─────────────────────┴──────┘
shape: (2, 2)
┌────────┬───────────┐
│ symbol ┆ start_idx │
│ ---    ┆ ---       │
│ str    ┆ list[i64] │
╞════════╪═══════════╡
│ A      ┆ [0, 5]    │
│ B      ┆ [0, 4]    │
└────────┴───────────┘

Now, I need to split the dataframe into two chunks of length 5 (chunk_size) grouped by the symbol column. The column start_idx indicate the rows to start the chunk in each group. That is, group A will be split into two chunks of length 5 starting in row 0 and 5, while the chunks of grouß B start in row 0 and 4.
Finally, all chunks need to be concatenated on axis=0, whereby a new column split_idx indicates where the split is coming from.

Here’s what I am looking for:

shape: (20, 4)
┌────────────────────┬─────────────────────┬──────┐
│ split_idx ┆ symbol ┆ date                ┆ data │
│           ┆ ---    ┆ ---                 ┆ ---  │
│ i64       ┆ str    ┆ datetime[μs]        ┆ i64  │
╞═══════════╪════════╪═════════════════════╪══════╡
│ 0         ┆ A      ┆ 2024-01-01 00:00:00 ┆ 0    │
│ 0         ┆ A      ┆ 2024-01-02 00:00:00 ┆ 1    │
│ 0         ┆ A      ┆ 2024-01-03 00:00:00 ┆ 2    │
│ 0         ┆ A      ┆ 2024-01-04 00:00:00 ┆ 3    │
│ 0         ┆ A      ┆ 2024-01-05 00:00:00 ┆ 4    │
│ 0         ┆ B      ┆ 2024-01-01 00:00:00 ┆ 0    │
│ 0         ┆ B      ┆ 2024-01-02 00:00:00 ┆ 1    │
│ 0         ┆ B      ┆ 2024-01-03 00:00:00 ┆ 2    │
│ 0         ┆ B      ┆ 2024-01-04 00:00:00 ┆ 3    │
│ 0         ┆ B      ┆ 2024-01-05 00:00:00 ┆ 4    │
│ 1         ┆ A      ┆ 2024-01-01 00:00:00 ┆ 5    │
│ 1         ┆ A      ┆ 2024-01-02 00:00:00 ┆ 6    │
│ 1         ┆ A      ┆ 2024-01-03 00:00:00 ┆ 7    │
│ 1         ┆ A      ┆ 2024-01-04 00:00:00 ┆ 8    │
│ 1         ┆ A      ┆ 2024-01-05 00:00:00 ┆ 9    │
│ 1         ┆ B      ┆ 2024-01-01 00:00:00 ┆ 3    │
│ 1         ┆ B      ┆ 2024-01-02 00:00:00 ┆ 4    │
│ 1         ┆ B      ┆ 2024-01-03 00:00:00 ┆ 5    │
│ 1         ┆ B      ┆ 2024-01-04 00:00:00 ┆ 6    │
│ 1         ┆ B      ┆ 2024-01-05 00:00:00 ┆ 7    │
└───────────┴────────┴─────────────────────┴──────┘

Keep in mind that list in column start_idx may be of variable length for each individual row. The length of each list determines the number of chunks for each group.

Thiết kế website giá rẻ

Danh mục

Split a polars DataFrame into multiple chunks with groupby