I have a dataset with N number of columns, let’s say 3 to generate an example (n_1
, n_2
, n_3
).
There are duplicate values based on some columns, as in:
import polars as pl
data = {
'n_1': ['a', 'b', 'a', 'c', 'd', 'b', None, 'c', 'e'],
'n_2': [1, 2, None, 3, 1, 2, 1, 3, 5],
'n_3': [123, 345, 123, 567, 123, 987, 123, None, 923]
}
df = pl.DataFrame(data)
┌──────┬──────┬──────┐
│ n_1 ┆ n_2 ┆ n_3 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════╪══════╪══════╡
│ a ┆ 1 ┆ 123 │
│ b ┆ 2 ┆ 345 │
│ a ┆ null ┆ 123 │
│ c ┆ 3 ┆ 567 │
│ d ┆ 1 ┆ 123 │
│ b ┆ 2 ┆ 987 │
│ null ┆ 1 ┆ 123 │
│ c ┆ 3 ┆ null │
│ e ┆ 5 ┆ 923 │
I want to fill the null
values in each column with the value that is present in similar records, so that e.g. row 3 and 7 become ['a', 1, 123]
, row 8 becomes ['c', 3, 567]
.
EDIT: I forgot to add that that row 7 could become two rows and fill the values with both 'a'
and 'b'
so you have both [d, 1, 123]
and [a, 1, 123]
.
Do you know the optimal strategy to achieve this in plain SQL or Polars or Spark?
Thanks
2