Here’s the solution I came up with for the problem:
import polars as pl
import numpy as np
max_groups = 5
max_reps = 3
# print out all rows in our table, for the sake of convenience
pl.Config.set_tbl_rows(max_groups * max_reps)
num_groups = np.random.randint(3, max_groups + 1)
unique_ids = np.random.randint(97, 123, num_groups)
repetitions = np.random.randint(1, max_reps + 1, num_groups)
id_col = "id"
data_col = "point"
index_col = "ixs"
# # Generate data
# convert integers to ascii using `chr`
ids = pl.Series(
id_col,
[c for n, id in zip(repetitions, unique_ids) for c in [chr(id)] * n],
)
data = pl.Series(
data_col,
np.random.rand(len(ids)),
)
df = pl.DataFrame([ids, data])
# # Generate indices
df.sort(id_col, data_col).group_by(id_col).agg(
pl.col(data_col), pl.int_range(pl.len()).alias(index_col)
).explode(data_col, index_col).sort(id_col, data_col)
Can I do better? I sort twice, for instance: once before grouping, and once after. I can eliminate the need for the second sort by maintain_order=True
in the group_by
:
# # Generate indices, but maintain_order in group_by
df.sort(id_col, data_col).group_by(id_col, maintain_order=True).agg(
pl.col(data_col), pl.int_range(pl.len()).alias(index_col)
).explode(data_col, index_col)
(Some simple, very naive, timeit
based experments suggest maintain_order=True
generally wins over sorting twice, but not by a large margin.)