Given a list of column names in the variable column_names(more than 20, but in this example reduced to 3) and the dataframe below:
df = pl.DataFrame({
"column1": [2, 1, 3],
"column2": [0, 2, 0],
"column3": [0, 0, 4]
})
Output:
┌─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╡
│ 2 ┆ 0 ┆ 0 │
│ 1 ┆ 2 ┆ 0 │
│ 3 ┆ 0 ┆ 4 │
└─────────┴─────────┴─────────┘
The original dataframe is sparse and I would like to condense it into the following format with the following tranformations:
- Condense the values into a column of lists named column_values. The corresponding column name will appear in another column named column_names
- All values below a certain threshold will not appear in column values. In the below example, only values more than 1 are included (the second row is missing the intended 1)
- Column values should be arranged in descending order, with the corresponding column name also arranged in the same order
- Please try to minimize memory usage!
The output should look like this:
┌────────────────────────┬───────────────┐
│ column_names ┆ column_values │
│ --- ┆ --- │
│ list[str] ┆ list[i64] │
╞════════════════════════╪═══════════════╡
│ ["column1"] ┆ [2] │
│ ["column2"] ┆ [2] │
│ ["column3", "column1"] ┆ [4, 3] │
└────────────────────────┴───────────────┘