I am working with polars expressions as below.
pl.when(pl.col(a).contains("some value")).then(pl.concat_list("ArayColumn"),[A])
.when(pl.col(a).contains("some value")).then(pl.concat_list("ArayColumn"),[B])
.when(pl.col(a).contains("some value")).then(pl.concat_list("ArayColumn"),[C])
.otherwise("Z is Big").alias("Bigger Value")
However, I’d like to generate these expressions dynamically (as in the following pseudo-code).
exprs = []
for i,value on conditions:
exprs.append(pl.when(i).then(value))
exprs.otherwise(defaultValue).alias("Bigger Value")
Then, I’d like to evaluate exprs
in the context of a dataframe.
df.with_columns(exprs)
So, i want for each row, it should first when, then, second when, then…..
i should evaluate every when.then condition and union all
my final values may contain [A, B, C].
Tried Approaches :
- For Loop and applying one by one , but if dataFrame has huge data its blasting.
- Expression chaining , but polars otherwise is taking None by default for every conditions i kept and making it slow [ few ms ]
can you suggest me any other approach ? I need to generate the expression and later i will apply on dataFrame.
1
You can dynamically create a chain of .when/.then
like this:
First do expr = pl
, then in each for loop, do expr = expr.when(_).then(_)
.
After the loop, do a select or with_columns for expr.otherwise(_)
.
import polars as pl
df = pl.DataFrame(
{
"string": ["1", "foo bar", "bar", "baz", "quux"],
}
)
conditions = {"foo": "contains foo", "bar": "contains bar", "baz": "contains baz"}
expr = pl
for key, value in conditions.items():
expr = expr.when(pl.col("string").str.contains(key)).then(pl.lit(value))
df = df.with_columns(output=expr.otherwise(pl.lit("other")))
print(df)
Output:
┌─────────┬──────────────┐
│ string ┆ output │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪══════════════╡
│ 1 ┆ other │
│ foo bar ┆ contains foo │
│ bar ┆ contains bar │
│ baz ┆ contains baz │
│ quux ┆ other │
└─────────┴──────────────┘
5
You can also use pl.coalesce
. For example pl.when(cond).then(result).when(cond_2).then(result_2)
becomes pl.coalesce(pl.when(cond).then(result), pl.when(cond_2).then(result_2))
.
This makes the expression a little easier to build dynamically and saves the potentially confusing expr = pl
initialisation. coalesce
also operates in parallel like when/then
, so performance should be similar. fill_null
can be used in place of otherwise
to provide a default value.
df = pl.DataFrame({"string": ["1", "foo bar", "bar", "baz", "quux"]})
conditions = {"foo": "contains foo", "bar": "contains bar", "baz": "contains baz"}
expr = pl.coalesce(
pl.when(pl.col("string").str.contains(key)).then(pl.lit(value))
for key, value in conditions.items()
)
df.with_columns(output=expr.fill_null(pl.lit("other")))
┌─────────┬──────────────┐
│ string ┆ output │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪══════════════╡
│ 1 ┆ other │
│ foo bar ┆ contains foo │
│ bar ┆ contains bar │
│ baz ┆ contains baz │
│ quux ┆ other │
└─────────┴──────────────┘