I have a simple pl.DataFrame
with a number of columns that only contain boolean values.
import polars as pl
df = pl.DataFrame(
{"s1": [True, True, False], "s2": [False, True, True], "s3": [False, False, False]}
)
shape: (3, 3)
┌───────┬───────┬───────┐
│ s1 ┆ s2 ┆ s3 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ false ┆ false │
│ true ┆ true ┆ false │
│ false ┆ true ┆ false │
└───────┴───────┴───────┘
I need to add another column that contains lists of varying length. A list in any individual row should contain the column name where the values of the columns S1
, s2
, and s3
have a True
value.
Here’s what I am actually looking for:
shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1 ┆ s2 ┆ s3 │ list │
│ --- ┆ --- ┆ --- │ --- │
│ bool ┆ bool ┆ bool │ list[str] │
╞═══════╪═══════╪═══════╡══════════════╡
│ true ┆ false ┆ false │ ["s1"] │
│ true ┆ true ┆ false │ ["s1", "s2"] │
│ false ┆ true ┆ false │ ["s2"] │
└───────┴───────┴───────┴──────────────┘
List API
You could build a list of when/then expressions and then remove the nulls.
df.with_columns(
pl.concat_list(
pl.when(col).then(pl.lit(col)) for col in df.columns
)
.list.drop_nulls()
.alias("list")
)
shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1 ┆ s2 ┆ s3 ┆ list │
│ --- ┆ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool ┆ list[str] │
╞═══════╪═══════╪═══════╪══════════════╡
│ true ┆ false ┆ false ┆ ["s1"] │
│ true ┆ true ┆ false ┆ ["s1", "s2"] │
│ false ┆ true ┆ false ┆ ["s2"] │
└───────┴───────┴───────┴──────────────┘
Unpivot
If “raw performance” is of concern, it can be done at the frame level.
You can reshape with .unpivot()
and .group_by
to create the lists.
(df.with_row_index()
.unpivot(index="index")
.filter(pl.col.value)
.group_by("index", maintain_order=True)
.agg(pl.col.variable.alias("list"))
)
shape: (3, 2)
┌───────┬──────────────┐
│ index ┆ list │
│ --- ┆ --- │
│ u32 ┆ list[str] │
╞═══════╪══════════════╡
│ 0 ┆ ["s1"] │
│ 1 ┆ ["s1", "s2"] │
│ 2 ┆ ["s2"] │
└───────┴──────────────┘
As we’ve maintained the order, we can horizontally .concat()
to combine them.
pl.concat(
[
df,
df.with_row_index()
.unpivot(index="index")
.filter(pl.col.value)
.group_by("index", maintain_order=True)
.agg(pl.col.variable.alias("list"))
.drop("index") # optional
],
how = "horizontal"
)
shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1 ┆ s2 ┆ s3 ┆ list │
│ --- ┆ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool ┆ list[str] │
╞═══════╪═══════╪═══════╪══════════════╡
│ true ┆ false ┆ false ┆ ["s1"] │
│ true ┆ true ┆ false ┆ ["s1", "s2"] │
│ false ┆ true ┆ false ┆ ["s2"] │
└───────┴───────┴───────┴──────────────┘
Timing
As a basic comparison.
bigger_df = df.sample(2_000_000, with_replacement=True)
Name | Time |
---|---|
concat_list | 1.4s |
unpivot + concat | 0.2s |
1
You could melt
, filter
, group_by.agg
, and join
:
(df.with_row_index()
.join(df.with_row_index().melt('index').filter(pl.col('value'))
.group_by('index').agg(pl.col('variable')),
on='index', how='left'
)
)
Output:
┌───────┬───────┬───────┬───────┬──────────────┐
│ index ┆ s1 ┆ s2 ┆ s3 ┆ variable │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ bool ┆ bool ┆ list[str] │
╞═══════╪═══════╪═══════╪═══════╪══════════════╡
│ 0 ┆ true ┆ false ┆ false ┆ ["s1"] │
│ 1 ┆ true ┆ true ┆ false ┆ ["s1", "s2"] │
│ 2 ┆ false ┆ true ┆ false ┆ ["s2"] │
└───────┴───────┴───────┴───────┴──────────────┘
First You can create struct of all columns using struct
, then you can use map_elements
to apply a lambda func for each row.
The lambda func creates a list comprehension that includes the column name col
if its corresponding value val
is True. Then use alias for the new column as list
.
desired_output = df.with_columns(
[
pl.struct(df.columns)
.map_elements(lambda x: [col for col, val in x.items() if val])
.alias("list")
]
)
Alternatively you can do this
result = (
df.with_columns(
[
pl.concat_list(
[
pl.when(pl.col(col)).then(pl.lit(col)).otherwise(pl.lit(None))
for col in df.columns
]
).alias("temp_list")
]
)
.with_columns(
[
pl.col("temp_list")
.list.eval(pl.element().filter(pl.element().is_not_null()))
.alias("list")
]
)
.drop("temp_list")
)
print(desired_output)
shape: (3, 4)
┌───────┬───────┬───────┬──────────────┐
│ s1 ┆ s2 ┆ s3 ┆ list │
│ --- ┆ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool ┆ list[str] │
╞═══════╪═══════╪═══════╪══════════════╡
│ true ┆ false ┆ false ┆ ["s1"] │
│ true ┆ true ┆ false ┆ ["s1", "s2"] │
│ false ┆ true ┆ false ┆ ["s2"] │
└───────┴───────┴───────┴──────────────┘