I would like to get value from column whose name is in another column.
In my table there are many columns, so doing this with pl.when().then()
is not an option.
As input we have this dataframe:
foo | foo_count | bar | bar_count | baz | baz_count | largest |
---|---|---|---|---|---|---|
1 | 23 | 4 | 43 | 5 | 64 | baz |
2 | 45 | 6 | 45 | 1 | 43 | bar |
3 | 234 | 9 | 453 | 15 | 231 | baz |
4 | 55 | 2 | 67 | 3 | 94 | foo |
and would like transform this with with_columns()
:
foo | foo_count | bar | bar_count | baz | baz_count | largest | largest_count |
---|---|---|---|---|---|---|---|
1 | 23 | 4 | 43 | 5 | 64 | baz | 64 |
2 | 45 | 6 | 45 | 1 | 43 | bar | 45 |
3 | 234 | 9 | 453 | 15 | 231 | baz | 231 |
4 | 55 | 2 | 67 | 3 | 94 | foo | 4 |
This (pseudo)code illustrates what I have in mind (but ofc. this doesn’t work)
df = pl.DataFrame({"foo" : [1, 2, 3, 4],
"foo_count" : [23, 45 ,234, 55],
"bar" : [4 ,6 ,9, 2],
"bar_count" : [43, 45 ,453, 67],
"baz": [5,1,15, 3],
"baz_count" : [64, 43 ,231, 94],
"largest" : ["baz", "bar", "baz", "foo"]})
df.with_columns(
pl.col(f"{pl.col('largest')}_count").alias("largest_count")
)
0
Here is a simple approach. First, we create expressions for masked columns using a generator of when-then expressions. Then, we rely on pl.coalesce
to combine the masked columns. Still there is a naive python loop over the unique elements in largest.
df.with_columns(
pl.coalesce(
pl.when(pl.col("largest") == col).then(f"{col}_count")
for col in df.get_column("largest").unique()
).alias("largest_count")
)
shape: (4, 8)
┌─────┬───────────┬─────┬───────────┬─────┬───────────┬─────────┬───────────────┐
│ foo ┆ foo_count ┆ bar ┆ bar_count ┆ baz ┆ baz_count ┆ largest ┆ largest_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═════╪═══════════╪═════╪═══════════╪═════╪═══════════╪═════════╪═══════════════╡
│ 1 ┆ 23 ┆ 4 ┆ 43 ┆ 5 ┆ 64 ┆ baz ┆ 64 │
│ 2 ┆ 45 ┆ 6 ┆ 45 ┆ 1 ┆ 43 ┆ bar ┆ 45 │
│ 3 ┆ 234 ┆ 9 ┆ 453 ┆ 15 ┆ 231 ┆ baz ┆ 231 │
│ 4 ┆ 55 ┆ 2 ┆ 67 ┆ 3 ┆ 94 ┆ foo ┆ 55 │
└─────┴───────────┴─────┴───────────┴─────┴───────────┴─────────┴───────────────┘
Here is an approach that scales even with a very large amount of columns. We can extract just the count columns:
counts = df.select(pl.col("^.*_count$").name.map(lambda c: c.removesuffix("_count")))
shape: (4, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 23 ┆ 43 ┆ 64 │
│ 45 ┆ 45 ┆ 43 │
│ 234 ┆ 453 ┆ 231 │
│ 55 ┆ 67 ┆ 94 │
└─────┴─────┴─────┘
Then create a normalized table with an index:
norm_counts = counts.with_row_index().unpivot(index="index")
shape: (12, 3)
┌───────┬──────────┬───────┐
│ index ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 │
╞═══════╪══════════╪═══════╡
│ 0 ┆ foo ┆ 23 │
│ 1 ┆ foo ┆ 45 │
│ 2 ┆ foo ┆ 234 │
│ 3 ┆ foo ┆ 55 │
│ 0 ┆ bar ┆ 43 │
│ … ┆ … ┆ … │
│ 3 ┆ bar ┆ 67 │
│ 0 ┆ baz ┆ 64 │
│ 1 ┆ baz ┆ 43 │
│ 2 ┆ baz ┆ 231 │
│ 3 ┆ baz ┆ 94 │
└───────┴──────────┴───────┘
Then we can join the normalized counts to the original dataframe using a temporary index:
out = (
df
.with_row_index()
.join(norm_counts,
left_on=["index", "largest"],
right_on=["index", "variable"],
how="left")
.rename({"value": "largest_count"})
.drop("index")
)
shape: (4, 8)
┌─────┬───────────┬─────┬───────────┬─────┬───────────┬─────────┬───────────────┐
│ foo ┆ foo_count ┆ bar ┆ bar_count ┆ baz ┆ baz_count ┆ largest ┆ largest_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ str ┆ i64 │
╞═════╪═══════════╪═════╪═══════════╪═════╪═══════════╪═════════╪═══════════════╡
│ 1 ┆ 23 ┆ 4 ┆ 43 ┆ 5 ┆ 64 ┆ baz ┆ 64 │
│ 2 ┆ 45 ┆ 6 ┆ 45 ┆ 1 ┆ 43 ┆ bar ┆ 45 │
│ 3 ┆ 234 ┆ 9 ┆ 453 ┆ 15 ┆ 231 ┆ baz ┆ 231 │
│ 4 ┆ 55 ┆ 2 ┆ 67 ┆ 3 ┆ 94 ┆ foo ┆ 55 │
└─────┴───────────┴─────┴───────────┴─────┴───────────┴─────────┴───────────────┘