I have a dataframe that contains “duplicated” data in all columns but one called source. I match these records one to one per source into groups.
(I asked similar question here: how-to-efficiently-set-column-values-based-on-multiple-other-columns)
Using following data:
id,str_id,partition_number,source,type,state,quantity,price,m_group,m_status
1,s1_1,111,1,A,1,10,100.0,,0
2,s1_2,111,1,A,1,10,100.0,,0
3,s1_3,222,1,B,2,20,150.0,,0
4,s1_4,333,1,C,1,30,200.0,,0
5,s1_5,111,1,A,1,10,100.0,,0
6,s1_6,111,1,A,1,10,100.0,,0
7,s2_1,111,5,A,1,10,100.0,,0
8,s2_2,111,5,A,1,10,100.0,,0
9,s2_3,111,5,A,1,10,100.0,,0
10,s2_4,222,5,B,2,20,150.0,,0
11,s2_5,444,5,D,1,40,250.0,,0
12,s3_1,111,6,A,1,10,100.0,,0
13,s3_2,111,6,A,1,10,100.0,,0
14,s3_3,111,6,A,1,10,100.0,,0
15,s3_4,222,6,B,2,20,150.0,,0
16,s3_5,444,6,D,1,40,250.0,,0
17,s3_6,333,6,C,1,30,200.0,,0
I obtain following dataframe:
┌─────┬──────────┬──────────┬──────────┬────────┬──────┬──────────┬──────────┬──────────┬──────────┐
│ id ┆ str_id ┆ part_ ┆ source ┆ type ┆ stat ┆ quantity ┆ price ┆ m_group ┆ m_status │
│ --- ┆ ┆ number ┆ ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ │
│ i64 ┆ --- ┆ --- ┆ --- ┆ str ┆ i64 ┆ i64 ┆ f64 ┆ --- ┆ --- │
│ ┆ str ┆ str ┆ i64 ┆ ┆ ┆ ┆ ┆ ┆ │
╞═════╪══════════╪══════════╪══════════╪══════╪══════╪══════════╪═════════╪═════════╪══════════╡
│ 1 ┆ s1_1 ┆ 111 ┆ 1 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 2 ┆ s1_2 ┆ 111 ┆ 1 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 3 ┆ s1_3 ┆ 222 ┆ 1 ┆ B ┆ 2 ┆ 20. ┆ 150.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 4 ┆ s1_4 ┆ 333 ┆ 1 ┆ C ┆ 1 ┆ 30. ┆ 200.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 5 ┆ s1_5 ┆ 111 ┆ 1 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 6 ┆ s1_6 ┆ 111 ┆ 1 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 7 ┆ s2_1 ┆ 111 ┆ 5 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 8 ┆ s2_2 ┆ 111 ┆ 5 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 9 ┆ s2_3 ┆ 111 ┆ 5 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 10 ┆ s2_4 ┆ 222 ┆ 5 ┆ B ┆ 2 ┆ 20. ┆ 150.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 11 ┆ s2_5 ┆ 444 ┆ 5 ┆ D ┆ 1 ┆ 40. ┆ 250.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 12 ┆ s3_1 ┆ 111 ┆ 6 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 13 ┆ s3_2 ┆ 111 ┆ 6 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 14 ┆ s3_3 ┆ 111 ┆ 6 ┆ A ┆ 1 ┆ 10. ┆ 100.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 15 ┆ s3_4 ┆ 222 ┆ 6 ┆ B ┆ 2 ┆ 20. ┆ 150.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 16 ┆ s3_5 ┆ 444 ┆ 6 ┆ D ┆ 1 ┆ 40. ┆ 250.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
│ 17 ┆ s3_6 ┆ 333 ┆ 6 ┆ C ┆ 1 ┆ 30. ┆ 200.0000 ┆ [] ┆ [] │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ 000 ┆ ┆ │
└─────┴──────────┴──────────┴──────────┴────────┴──────┴──────────┴──────────┴──────────┴──────────┘
After I match these, I have an output dataframe that contains three columns of [list] type that aggregete the ids, str_ids and sources into groups of “duplicated” records:
┌─────────────┬──────────────────────────┬────────────────┐
│ id ┆ str_id ┆ source │
│ --- ┆ --- ┆ --- │
│ list[i64] ┆ list[str] ┆ list[i64] │
╞═════════════╪══════════════════════════╪════════════════╡
│ [5, 9, 14] ┆ ["s1_5", "s2_3", "s3_3"] ┆ [1, 5, 6] │
│ [2, 8, 13] ┆ ["s1_2", "s2_2", "s3_2"] ┆ [1, 5, 6] │
│ [6] ┆ ["s1_6"] ┆ [1] │
│ [3, 10, 15] ┆ ["s1_3", "s2_4", "s3_4"] ┆ [1, 5, 6] │
│ [1, 7, 12] ┆ ["s1_1", "s2_1", "s3_1"] ┆ [1, 5, 6] │
│ [11, 16] ┆ ["s2_5", "s3_5"] ┆ [5, 6] │
│ [4, 17] ┆ ["s1_4", "s3_6"] ┆ [1, 6] │
└─────────────┴──────────────────────────┴────────────────┘
Using this aggregated dataframe above I can now create following dataframe using this code:
def set_m_status(df: pl.DataFrame) -> pl.DataFrame:
df = df.with_columns(pl.col("source").cast(pl.List(pl.Utf8)))
df = (
df.with_columns(
l = pl.col.source.len(),
has1 = pl.col.source.list.contains("1"),
excl1 = pl.col.source.list.set_difference(["1"]).list.eval(pl.element() + "_x")
).explode(pl.col("id","str_id","source"))
.select(
pl.col("id","str_id","source"),
m_status =
pl.when(pl.col.l >= 2, pl.col.source == "1").then(pl.col.excl1)
.when(pl.col.l >= 2, pl.col.has1).then(["1_x"])
.otherwise([])
)
.sort("id")
)
┌─────┬────────┬────────┬──────────────────┐
│ id ┆ str_id ┆ source ┆ m_status │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ list[i64] │
╞═════╪════════╪════════╪══════════════════╡
│ 1 ┆ s1_1 ┆ 1 ┆ ["6_x", "5_x"] │
│ 2 ┆ s1_2 ┆ 1 ┆ ["6_x", "5_x"] │
│ 3 ┆ s1_3 ┆ 1 ┆ ["6_x", "5_x"] │
│ 4 ┆ s1_4 ┆ 1 ┆ ["6_x"] │
│ 5 ┆ s1_5 ┆ 1 ┆ ["6_x", "5_x"] │
│ 6 ┆ s1_6 ┆ 1 ┆ [] │
│ 7 ┆ s2_1 ┆ 5 ┆ ["1_x"] │
│ 8 ┆ s2_2 ┆ 5 ┆ ["1_x"] │
│ 9 ┆ s2_3 ┆ 5 ┆ ["1_x"] │
│ 10 ┆ s2_4 ┆ 5 ┆ ["1_x"] │
│ 11 ┆ s2_5 ┆ 5 ┆ [] │
│ 12 ┆ s3_1 ┆ 6 ┆ ["1_x"] │
│ 13 ┆ s3_2 ┆ 6 ┆ ["1_x"] │
│ 14 ┆ s3_3 ┆ 6 ┆ ["1_x"] │
│ 15 ┆ s3_4 ┆ 6 ┆ ["1_x"] │
│ 16 ┆ s3_5 ┆ 6 ┆ [] │
│ 17 ┆ s3_6 ┆ 6 ┆ ["1_x"] │
└─────┴────────┴────────┴──────────────────┘
I would also like to set m_group column, where I set integer values preceded by source number for groups from the previous, aggregated dataframe. So the resulting dataframe would be something like:
┌─────┬─────────────────────────────────┬────────────────┬────────────────┬────────────────┐
│ id ┆ str_id ┆ source ┆ m_status ┆ m_group │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ list[str] ┆ list[str] │
╞═════╪═════════════════════════════════╪════════════════╪════════════════╪════════════════╡
│ 1 ┆ s1_1 ┆ 1 ┆ ["6_x", "5_x"] ┆ ["6_1", "5_1"] │
│ 2 ┆ s1_2 ┆ 1 ┆ ["6_x", "5_x"] ┆ ["6_2", "5_2"] │
│ 3 ┆ s1_3 ┆ 1 ┆ ["6_x", "5_x"] ┆ ["6_3", "5_3"] │
│ 4 ┆ s1_4 ┆ 1 ┆ ["6_x" ] ┆ ["6_4"] │
│ 5 ┆ s1_5 ┆ 1 ┆ ["6_x", "5_x"] ┆ ["6_5", "5_4"] │
│ 6 ┆ s1_6 ┆ 1 ┆ [] ┆ [] │
│ 7 ┆ s2_1 ┆ 5 ┆ ["1_x"] ┆ ["1_1"] │
│ 8 ┆ s2_2 ┆ 5 ┆ ["1_x"] ┆ ["1_2"] │
│ 9 ┆ s2_3 ┆ 5 ┆ ["1_x"] ┆ ["1_4"] │
│ 10 ┆ s2_4 ┆ 5 ┆ ["1_x"] ┆ ["1_3"] │
│ 11 ┆ s2_5 ┆ 5 ┆ [] ┆ [] │
│ 12 ┆ s3_1 ┆ 6 ┆ ["1_x"] ┆ ["1_1"] │
│ 13 ┆ s3_2 ┆ 6 ┆ ["1_x"] ┆ ["1_2"] │
│ 14 ┆ s3_3 ┆ 6 ┆ ["1_x"] ┆ ["1_5"] │
│ 15 ┆ s3_4 ┆ 6 ┆ ["1_x"] ┆ ["1_3"] │
│ 16 ┆ s3_5 ┆ 6 ┆ [] ┆ [] │
│ 17 ┆ s3_6 ┆ 6 ┆ ["1_x"] ┆ ["1_4"] │
└─────┴─────────────────────────────────┴────────────────┴────────────────┴────────────────┘
Having dict that stores latest m_group values for each source different than “1” (I always match other source with respect to source “1”) like:
m_group_dict = {
"5": 20,
"6": 25,
}
I wanted to incorporate pl.int_range(start=m_group_dict[source])
into the code above that sets m_status to somehow set the m_group as well, but I can’t make it work. I was thinking about doing somehting like:
m_status =
pl.when(pl.col.l >= 2, pl.col.representation == "1").then(pl.col.excl1)
.when(pl.col.l >= 2, pl.col.has1).then(["1"])
.otherwise([]),
m_group =
pl.when(pl.col.l >= 2, pl.col.representation == "1")
.then(pl.int_range(start=m_group_dict[pl.col.source]).cast(pl.List(pl.Utf8)).list.eval(f"{pl.col.representation}_" + pl.element()))
.when(pl.col.l >= 2, pl.col.has1)
.then(pl.int_range(start=m_group_dict[pl.col.source]).cast(pl.List(pl.Utf8)).list.eval(f"1_" + pl.element()))
.otherwise([])
But cannot get anything like this to work.