I’m trying to make all groups for a given data frame have the same size. In Starting point below, I show an example of a data frame that I whish to transform. In Goal I try to demonstrate what I’m trying to achieve. I want to group by the column group
, make all groups have a size of 4
, and fill ‘missing’ values with null
– I hope it’s clear.
I have tried several approaches but have not been able to figure this one out.
Starting point
dfa = pl.DataFrame(data={'group': ['a', 'a', 'a', 'b', 'b', 'c'],
'value': ['a1', 'a2', 'a3', 'b1', 'b2', 'c1']})
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪═══════╡
│ a ┆ a1 │
│ a ┆ a2 │
│ a ┆ a3 │
│ b ┆ b1 │
│ b ┆ b2 │
│ c ┆ c1 │
└───────┴───────┘
Goal
>>> make_groups_uniform(dfa, group_by='group', group_size=4)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪═══════╡
│ a ┆ a1 │
│ a ┆ a2 │
│ a ┆ a3 │
│ a ┆ null │
│ b ┆ b1 │
│ b ┆ b2 │
│ b ┆ null │
│ b ┆ null │
│ c ┆ c1 │
│ c ┆ null │
│ c ┆ null │
│ c ┆ null │
└───────┴───────┘
Package version
polars: 1.1.0
1
You could use pl.repeat()
to generate the nulls and .append()
them.
group_size = 4
(df.group_by("group", maintain_order=True)
.agg(pl.all().append(pl.repeat(None, group_size - pl.len().cast(int))))
.explode(pl.exclude("group"))
)
shape: (12, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪═══════╡
│ a ┆ a1 │
│ a ┆ a2 │
│ a ┆ a3 │
│ a ┆ null │
│ b ┆ b1 │
│ b ┆ b2 │
│ b ┆ null │
│ b ┆ null │
│ c ┆ c1 │
│ c ┆ null │
│ c ┆ null │
│ c ┆ null │
└───────┴───────┘
☣ Warning ☣
pl.len()
is currently an unsigned int (that may change in the future)
We cast to an integer to avoid potential issues with overflow.
df.group_by("group").agg(int = 2 - pl.len().cast(int), uint32 = 2 - pl.len())
shape: (3, 3)
┌───────┬─────┬────────────┐
│ group ┆ int ┆ uint32 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪═════╪════════════╡
│ a ┆ -1 ┆ 4294967295 │ # <- pl.repeat(None, 4294967295) would be bad.
│ b ┆ 0 ┆ 0 │
│ c ┆ 1 ┆ 1 │
└───────┴─────┴────────────┘
- https://github.com/pola-rs/polars/issues/17722
1
You can find the largest group using .group_by().len().max()
, then groupby and convert each value in the groups to a struct, unnest the struct using a 'max_width'
strategy, then melt the wide data. After that you can pad up to the difference between the largest group and the desired group size.
def make_groups_uniform(df: pl.DataFrame, group_by: str, value_col: str, group_size: int):
gb = df.group_by(group_by)
largest_group_size = gb.len().max()['len'].item()
out = gb.all().with_columns(
pl.col(value_col)
.list
.to_struct(n_field_strategy='max_width')
).unnest(
value_col
).melt(group_by, value_name=value_col)[[group_by, value_col]]
n = group_size - largest_group_size
if n > 0:
keys = df[group_by].unique().to_list()
pad = pl.DataFrame({group_by: keys * n, value_col: ['null']*n*len(keys)})
out = pl.concat([out, pad])
return out.sort(group_by)
Running the code:
make_groups_uniform(dfa, 'group', 'value', 4)
# returns:
shape: (12, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪═══════╡
│ a ┆ a1 │
│ a ┆ a2 │
│ a ┆ a3 │
│ a ┆ null │
│ b ┆ b1 │
│ … ┆ … │
│ b ┆ null │
│ c ┆ c1 │
│ c ┆ null │
│ c ┆ null │
│ c ┆ null │
└───────┴───────┘
You can solve this using groupby
to group the df by the group column.
Then pad each group with NULL
values if a group <= 4 , then we concatenate all group after process it with pl.concat.
Here is a code snippet to guide you.
ef make_groups_uniform(df, group_by, group_size):
def padding_each_group(group: pl.DataFrame) -> pl.DataFrame:
missing_data_count = group_size - group.shape[0]
if missing_data_count > 0:
# missing values
missing_df = pl.DataFrame(
{
group_by: [group[group_by][0]]
* missing_data_count, # Fill with the group value
"value": [None] * missing_data_count, # Fill with None
}
)
# Concatenate the original group with the missing DataFrame
group = pl.concat([group, missing_df])
return group
# Collect each group as separate DataFrames
groups = []
for _, df_group in df.group_by(group_by, maintain_order=True):
padded_group = padding_each_group(df_group)
groups.append(padded_group)
# concat all groups
output = pl.concat(groups)
return output
print(output)
shape: (12, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪═══════╡
│ a ┆ a1 │
│ a ┆ a2 │
│ a ┆ a3 │
│ a ┆ null │
│ b ┆ b1 │
│ … ┆ … │
│ b ┆ null │
│ c ┆ c1 │
│ c ┆ null │
│ c ┆ null │
│ c ┆ null │
└───────┴───────┘```