Say I have a dataframe:
df = pl.DataFrame({'a':[1,1,1,1,2,2,2,2],'b':[1,2,1,2,1,2,1,2],'c':[10,11,12,13,14,15,18,17]})
print(df)
output:
shape: (8, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 10 │
│ 1 ┆ 2 ┆ 11 │
│ 1 ┆ 1 ┆ 12 │
│ 1 ┆ 2 ┆ 13 │
│ 2 ┆ 1 ┆ 14 │
│ 2 ┆ 2 ┆ 15 │
│ 2 ┆ 1 ┆ 18 │
│ 2 ┆ 2 ┆ 17 │
└─────┴─────┴─────┘
Now, within each group according to a
, I want to find the max of the last values according to the second level groups in b
something that gets achieved by the following code
df.with_columns(pl.col("c").last().over(["a", "b"])).group_by(pl.col("a")).agg(pl.col("c").max())
output:
shape: (2, 2)
┌─────┬─────┐
│ a ┆ c │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 18 │
│ 1 ┆ 13 │
└─────┴─────┘
Is there a way to do this with one expression? Currently I would have to write a separate class for such cases that provides one expression to go to the with_columns
and another expression for the group_by
. I can do that but I am also trying to be better at polars and would like to know if it can be done with just one expression