Lets say I have a column named structures which is a list of structs containing “a” and “b” inside of each. I want to avoid using explodes and such as my data structure is quite complex (and quite big), I want to work as much as possible with the lists alone.
# Create a list of lists of structures
list_of_lists = [
[{"a":1, "b": 2}, {"a":3, "b": 4}, {"a":5, "b": 4}]
]
# Create a DataFrame from the list of lists
df = pl.DataFrame({"structures": list_of_lists})
I want to perform some kind of “group all elements by B” (and then agg using concat_list), in spark the code (see how I reference X and Y) looks like this:
arrays_grouped = F.array_distinct(
F.transform(
F.col("structures"),
lambda x: (
F.filter(F.col("structures"),
lambda y: x.field("b") == y.field("b")
)
),
)
)
However in polars I can only see eval operator. https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.list.eval.html#polars.Expr.list.eval.
However I can’t reference anything apart from pl.element() from outside the eval, so I’m quite stuck.
Do I need to implement this in plugins or there’s a way with the provided API? This is my existing approach which does not work (probably influenced because I have been working quite long time with Spark functions).
df = df.with_columns(
#Gets all unique "b"
pl.col("structures").list.eval(
pl.element().struct.field("b")
#Tries to filter structures all unique "b"
).list.unique().list.eval(
pl.struct(
base:= pl.element(),
df.get_column("structures").list.eval(
#Idk what this base value is, but its not filtering, if I replace it by hardcoded-4 it does filter 4s correctly
pl.element().filter(pl.element().struct.field("b") == base)
)
)
)
)