Due to certain restrictions, polars version is 0.18.4
Given a dataframe:
df = pl.DataFrame({
"column_id": [1,2,3],
"column1": [[1,2,1],[2,2,2],[0,0,1]],
"column2": [[0,0,0],[0,1,2],[0,2,1]]
})
What I want to do is to sort both column 1 and 2 descending by column1, then sort by column2 if column 1 has a tie.
What I have managed to do is to sort both column1 and column2 by column1 descending:
df_ranked = df.with_columns(
rank=pl.col('column1').list.eval( pl.element().rank(method="ordinal", descending=True) )
)
explode_col = ['column1','column2','rank']
rank = df_ranked['rank']
df_full_rank = (df_ranked.explode(explode_col)
.select( pl.all().sort_by('rank').over('column_id') )
.groupby('column_id', maintain_order=True).agg(pl.col(explode_col))
.with_columns(rank=rank)
)
Output:
┌───────────┬───────────┬───────────┬───────────┐
│ column_id ┆ column1 ┆ column2 ┆ rank │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[u32] │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 1 ┆ [2, 1, 1] ┆ [0, 0, 0] ┆ [2, 1, 3] │
│ 2 ┆ [2, 2, 2] ┆ [0, 1, 2] ┆ [1, 2, 3] │
│ 3 ┆ [1, 0, 0] ┆ [1, 0, 2] ┆ [2, 3, 1] │
└───────────┴───────────┴───────────┴───────────┘
But this is inadequate since it does not take into account the values in column2 when there is a tie.
Specifically, in the second row, even though in column1 all elements are tied, but because in column2 it is [0,1,2]
, the rank should be [3,2,1]
, since the last element has the biggest number. Similar logic applies for the 3rd row.
The expected output should be:
┌───────────┬───────────┬───────────┬───────────┐
│ column_id ┆ column1 ┆ column2 ┆ rank │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[u32] │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 1 ┆ [2, 1, 1] ┆ [0, 0, 0] ┆ [2, 1, 3] │
│ 2 ┆ [2, 2, 2] ┆ [2, 1, 0] ┆ [3, 2, 1] │
│ 3 ┆ [1, 0, 0] ┆ [1, 2, 0] ┆ [3, 2, 1] │
└───────────┴───────────┴───────────┴───────────┘
Here’s one way:
First, explode the dataframe on column1
and column2
.
Then, compute the rank column and sorted column1/column2 columns separately.
Finally, join the two dataframes.
import polars as pl
df = pl.DataFrame(
{
"column_id": [1, 2, 3],
"column1": [[1, 2, 1], [2, 2, 2], [0, 0, 1]],
"column2": [[0, 0, 0], [0, 1, 2], [0, 2, 1]],
}
)
df = df.explode(["column1", "column2"])
rank = (
df.select(
"column_id",
rank=pl.struct("column1", "column2")
.rank("ordinal", descending=True)
.over("column_id"),
)
.group_by("column_id", maintain_order=True)
.agg(pl.col("rank"))
)
df = (
df.sort(["column_id", "column1", "column2"], descending=[False, True, True])
.group_by("column_id", maintain_order=True)
.agg(pl.col("column1"), pl.col("column2"))
.join(rank, on="column_id")
)
print(df)
Output:
shape: (3, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ column_id ┆ column1 ┆ column2 ┆ rank │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] ┆ list[u32] │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 1 ┆ [2, 1, 1] ┆ [0, 0, 0] ┆ [2, 1, 3] │
│ 2 ┆ [2, 2, 2] ┆ [2, 1, 0] ┆ [3, 2, 1] │
│ 3 ┆ [1, 0, 0] ┆ [1, 2, 0] ┆ [3, 2, 1] │
└───────────┴───────────┴───────────┴───────────┘