I am trying to perform a pivot operation on a lazy polars dataframe.
This means, if I collect my data, I can do the pivot:
df = pl.read_parquet(path,low_memory=True)
pivoted_df = df.pivot(index=["ind1", "ind2", "ind3", "ind4"], columns="my_signal", values="Value", aggregate_function="mean")
These lines of code work.
However, if i am using a lazy dataframe by calling this :
df = pl.scan_parquet(path,low_memory=True)
I cannot find a way to perform the same algorithm. .pivot
cannot be applied on a lazy object.
IMPORTANT: I do not want at any point to collect my data with df.collect()
as I have a very large dataset that doesn’t fit in memory.
I want to save the lazy dataframe at the end with pivoted_df .sink_parquet()
, so yeah I don’t want at any point to collect the data.
Thank you in advance !
I have tried to use .group_by
instead of .pivot
:
grouped_df = df.group_by(["["ind1", "ind2", "ind3", "ind4"])
transformed_df = grouped_df.agg(**{f"{col}_mean": pl.col("Value").mean() for col in pl.col('my_signal').unique()})
But I am getting the error 'Expr' object is not iterable
which is normal as pl.col('full_signal_name').unique()
is an expression.
Any alternatives ?