When dealing with dynamic number of operations, especially with calculations depending on previous steps, the Polars Expression chaining becomes tricky:
here is a demo operator to generate new variables:
def demo_operator_addition(var1: str, var2: str) -> IntoExpr:
return pl.col(var1).add(pl.col(var2))
The above Callable
is then used in the helper function:
def variable_calculation(
data: pl.DataFrame,
target_var: str,
operator: Callable,
reference_var: Optional[str] = None,
by: Optional[List[str]] = None,
col_name: Optional[str] = None,
) -> pl.DataFrame:
data = data.lazy()
by = by if by is not None else []
data = (
data
.with_columns(
operator(target_var, reference_var)
.over([True, *by])
.alias(col_name) if col_name is not None else target_var
)
)
return data.collect()
I want to be able to use variable_calculation
function dynamically with different operations:
operation1 = variable_calculation(
df,
target_var="var1",
reference_var="var2",
operator=demo_operator_addition,
col_name="var3",
)
operation2 = variable_calculation(
operation1,
target_var="var1",
reference_var="var3",
operator=demo_operator_addition,
col_name="var4",
)
In the above operations var3
is generated before var4
, which requires var3
to calculate. However Polars only allows df.with_columns(calculate var3).with_columns(calculate var4)
Is there an efficient way to dynamically chain those expressions together? (For example, I want to avoid doing operation = xxx definition steps)
2
What about defining variable_calculation
as
def variable_calculation(
data: pl.DataFrame,
inputs = List[Tuple[str, Callable, str| None, List[str]|None, str|None]]
) -> pl.DataFrame:
for target_var, operator, reference_var, by, col_name in inputs:
by = by if by is not None else []
data = data.lazy()
data = (
data
.with_columns(
operator(target_var, reference_var)
.over([True, *by])
.alias(col_name) if col_name is not None else target_var
)
).collect()
return data
then you could do
variable_calculation(
df,
[
("var1", "var2", demo_operator_addition, "var3"),
("var1", "var3", demo_operator_addition, "var4"),
],
)
You could change inputs
to take an inner dict
instead of tuple
if you want to use parameter names instead of argument order.