Let’s say I have an auth
field that use bit flags to indicate permissions (example bit-0 means add
and bit-1 means delete
).
How do I bitwise-OR
them together?
import polars as pl
df_in = pl.DataFrame(
{
"k": ["a", "a", "b", "b", "c"],
"auth": [1, 3, 1, 0, 0],
}
)
The dataframe:
df_in: shape: (5, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 1 │
│ a ┆ 3 │
│ b ┆ 1 │
│ b ┆ 0 │
│ c ┆ 0 │
└─────┴──────┘
When I group by and sum, things look good, I sum the auth
by k
dfsum = df_in.group_by("k").agg(pl.col("auth").sum())
dfsum: shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 4 │
│ b ┆ 1 │
│ c ┆ 0 │
└─────┴──────┘
So, it looks as if I am using group_by
and agg
correctly, when using sum
.
Not so good when using or_
.
dfor = df_in.group_by("k").agg(pl.col("auth").or_())
gives
dfor: shape: (3, 2)
┌─────┬───────────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═════╪═══════════╡
│ a ┆ [1, 3] │
│ b ┆ [1, 0] │
│ c ┆ [0] │
└─────┴───────────┘
Expectations:
for the or_
I was expecting this result instead:
df_wanted_or: shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 3 │
│ b ┆ 1 │
│ c ┆ 0 │
└─────┴──────┘
Now, I did find a workaround by using map_batches
to call a Python function. Very simple something like
functools.reduce(lambda x,y: x|y)
but how do I do this without leaving Polars?
1
Update.
Bitwise aggregation was implemented in version 1.9.0
. So now you can use pl.Expr.bitwise_or()
:
(
df_in
.group_by("k", maintain_order=True)
.agg(pl.col.auth.bitwise_or())
)
shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 3 │
│ b ┆ 1 │
│ c ┆ 0 │
└─────┴──────┘
Previous answer. Bitwise aggregation is not yet implemented in polars – issue.
There’re a few ways you could approach it though:
1. Pure polars solution.
unique()
– not strictly necessary, but can reduce size of the aggregated lists.list.to_struct()
to convert aggregated data to Struct..reduce()
to applybitwise or
operator.field()
to access all the fields of the Struct withinreduce
context.
(
df_in
.group_by("k")
.agg(pl.col.auth.unique())
.with_columns(pl.col.auth.list.to_struct())
.with_columns(
auth = pl.reduce(
lambda acc, x: acc | x,
exprs = pl.col.auth.struct.field("*")
).fill_null(0)
)
)
shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ b ┆ 1 │
│ a ┆ 3 │
│ c ┆ 0 │
└─────┴──────┘
2. DuckDB integration with Polars.
You can use DuckDB integration with Polars and bit_or()
;
duckdb.sql("""
select
k,
bit_or(auth) as auth
from df_in
group by
k
""").pl()
shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 3 │
│ b ┆ 1 │
│ c ┆ 0 │
└─────┴──────┘
3. Polars integration with NumPy
Another possible way to do that would be to use Polars integration with NumPy.
First, use pure polars to aggregate auth
columns to lists and convert them to arrays.
df_agg = df_in.group_by("k").agg("auth")
w = df_agg["auth"].list.len().max()
df_agg = (
df_agg
.with_columns(
pl.col.auth.list.concat(
pl.lit(0).repeat_by(w - pl.col.auth.list.len())
)
).with_columns(pl.col.auth.list.to_array(w))
)
shape: (3, 2)
┌─────┬───────────────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ array[i64, 2] │
╞═════╪═══════════════╡
│ b ┆ [1, 0] │
│ a ┆ [1, 3] │
│ c ┆ [0, 0] │
└─────┴───────────────┘
Now we can get auth
column as Series, convert it to 2d numpy array with to_numpy()
and use np.bitwise_or
and reduce()
:
(
df_agg
.with_columns(
auth = np.bitwise_or.reduce(df_agg["auth"].to_numpy(), axis=1)
)
)
shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ b ┆ 1 │
│ a ┆ 3 │
│ c ┆ 0 │
└─────┴──────┘
2