Thiết kế website giá rẻ

Question

I have a dataframe that contains “duplicated” data in all columns but one called source. I match these records one to one per source into groups.

(I asked similar question here: how-to-efficiently-set-column-values-based-on-multiple-other-columns)

Using following data:

id,str_id,partition_number,source,type,state,quantity,price,m_group,m_status
1,s1_1,111,1,A,1,10,100.0,,0
2,s1_2,111,1,A,1,10,100.0,,0
3,s1_3,222,1,B,2,20,150.0,,0
4,s1_4,333,1,C,1,30,200.0,,0
5,s1_5,111,1,A,1,10,100.0,,0
6,s1_6,111,1,A,1,10,100.0,,0
7,s2_1,111,5,A,1,10,100.0,,0
8,s2_2,111,5,A,1,10,100.0,,0
9,s2_3,111,5,A,1,10,100.0,,0
10,s2_4,222,5,B,2,20,150.0,,0
11,s2_5,444,5,D,1,40,250.0,,0
12,s3_1,111,6,A,1,10,100.0,,0
13,s3_2,111,6,A,1,10,100.0,,0
14,s3_3,111,6,A,1,10,100.0,,0
15,s3_4,222,6,B,2,20,150.0,,0
16,s3_5,444,6,D,1,40,250.0,,0
17,s3_6,333,6,C,1,30,200.0,,0

I obtain following dataframe:

┌─────┬──────────┬──────────┬──────────┬────────┬──────┬──────────┬──────────┬──────────┬──────────┐
│ id  ┆ str_id   ┆ part_    ┆ source   ┆ type   ┆ stat ┆ quantity ┆ price    ┆ m_group  ┆ m_status │
│ --- ┆          ┆ number   ┆          ┆ ---    ┆ ---  ┆ ---      ┆ ---      ┆          ┆          │ 
│ i64 ┆ ---      ┆ ---      ┆ ---      ┆ str    ┆ i64  ┆ i64      ┆ f64      ┆ ---      ┆ ---      │
│     ┆ str      ┆ str      ┆ i64      ┆        ┆      ┆          ┆          ┆         ┆       │
╞═════╪══════════╪══════════╪══════════╪══════╪══════╪══════════╪═════════╪═════════╪══════════╡
│ 1   ┆ s1_1     ┆ 111      ┆ 1        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 2   ┆ s1_2     ┆ 111      ┆ 1        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 3   ┆ s1_3     ┆ 222      ┆ 1        ┆ B      ┆ 2    ┆ 20.      ┆ 150.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 4   ┆ s1_4     ┆ 333      ┆ 1        ┆ C      ┆ 1    ┆ 30.      ┆ 200.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 5   ┆ s1_5     ┆ 111      ┆ 1        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 6   ┆ s1_6     ┆ 111      ┆ 1        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 7   ┆ s2_1     ┆ 111      ┆ 5        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 8   ┆ s2_2     ┆ 111      ┆ 5        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 9   ┆ s2_3     ┆ 111      ┆ 5        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 10  ┆ s2_4     ┆ 222      ┆ 5        ┆ B      ┆ 2    ┆ 20.      ┆ 150.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 11  ┆ s2_5     ┆ 444      ┆ 5        ┆ D      ┆ 1    ┆ 40.      ┆ 250.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 12  ┆ s3_1     ┆ 111      ┆ 6        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 13  ┆ s3_2     ┆ 111      ┆ 6        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 14  ┆ s3_3     ┆ 111      ┆ 6        ┆ A      ┆ 1    ┆ 10.      ┆ 100.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 15  ┆ s3_4     ┆ 222      ┆ 6        ┆ B      ┆ 2    ┆ 20.      ┆ 150.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 16  ┆ s3_5     ┆ 444      ┆ 6        ┆ D      ┆ 1    ┆ 40.      ┆ 250.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
│ 17  ┆ s3_6     ┆ 333      ┆ 6        ┆ C      ┆ 1    ┆ 30.      ┆ 200.0000 ┆ []     ┆ []        │
│     ┆          ┆          ┆          ┆        ┆      ┆          ┆      000 ┆        ┆          │
└─────┴──────────┴──────────┴──────────┴────────┴──────┴──────────┴──────────┴──────────┴──────────┘

After I match these, I have an output dataframe that contains three columns of [list] type that aggregete the ids, str_ids and sources into groups of “duplicated” records:

┌─────────────┬──────────────────────────┬────────────────┐
│ id          ┆ str_id                   ┆ source         │
│ ---         ┆ ---                      ┆ ---            │
│ list[i64]   ┆ list[str]                ┆ list[i64]      │
╞═════════════╪══════════════════════════╪════════════════╡
│ [5, 9, 14]  ┆ ["s1_5", "s2_3", "s3_3"] ┆ [1, 5, 6]      │
│ [2, 8, 13]  ┆ ["s1_2", "s2_2", "s3_2"] ┆ [1, 5, 6]      │
│ [6]         ┆ ["s1_6"]                 ┆ [1]            │
│ [3, 10, 15] ┆ ["s1_3", "s2_4", "s3_4"] ┆ [1, 5, 6]      │
│ [1, 7, 12]  ┆ ["s1_1", "s2_1", "s3_1"] ┆ [1, 5, 6]      │
│ [11, 16]    ┆ ["s2_5", "s3_5"]         ┆ [5, 6]         │
│ [4, 17]     ┆ ["s1_4", "s3_6"]         ┆ [1, 6]         │
└─────────────┴──────────────────────────┴────────────────┘

Using this aggregated dataframe above I can now create following dataframe using this code:

def set_m_status(df: pl.DataFrame) -> pl.DataFrame:
    df = df.with_columns(pl.col("source").cast(pl.List(pl.Utf8)))
    df = (
        df.with_columns(
            l = pl.col.source.len(),
            has1 = pl.col.source.list.contains("1"),
            excl1 = pl.col.source.list.set_difference(["1"]).list.eval(pl.element() + "_x")
        ).explode(pl.col("id","str_id","source"))
        .select(
            pl.col("id","str_id","source"),
            m_status = 
            pl.when(pl.col.l >= 2, pl.col.source == "1").then(pl.col.excl1)
            .when(pl.col.l >= 2, pl.col.has1).then(["1_x"])
            .otherwise([])        
        )
        .sort("id")
    )

    
┌─────┬────────┬────────┬──────────────────┐
│ id  ┆ str_id ┆ source ┆ m_status         │
│ --- ┆ ---    ┆ ---    ┆ ---              │
│ i64 ┆ str    ┆ i64    ┆ list[i64]        │
╞═════╪════════╪════════╪══════════════════╡
│ 1   ┆ s1_1   ┆ 1      ┆ ["6_x", "5_x"]   │
│ 2   ┆ s1_2   ┆ 1      ┆ ["6_x", "5_x"]   │
│ 3   ┆ s1_3   ┆ 1      ┆ ["6_x", "5_x"]   │
│ 4   ┆ s1_4   ┆ 1      ┆ ["6_x"]          │
│ 5   ┆ s1_5   ┆ 1      ┆ ["6_x", "5_x"]   │
│ 6   ┆ s1_6   ┆ 1      ┆ []               │
│ 7   ┆ s2_1   ┆ 5      ┆ ["1_x"]          │
│ 8   ┆ s2_2   ┆ 5      ┆ ["1_x"]          │
│ 9   ┆ s2_3   ┆ 5      ┆ ["1_x"]          │
│ 10  ┆ s2_4   ┆ 5      ┆ ["1_x"]          │
│ 11  ┆ s2_5   ┆ 5      ┆ []               │
│ 12  ┆ s3_1   ┆ 6      ┆ ["1_x"]          │
│ 13  ┆ s3_2   ┆ 6      ┆ ["1_x"]          │
│ 14  ┆ s3_3   ┆ 6      ┆ ["1_x"]          │
│ 15  ┆ s3_4   ┆ 6      ┆ ["1_x"]          │
│ 16  ┆ s3_5   ┆ 6      ┆ []               │
│ 17  ┆ s3_6   ┆ 6      ┆ ["1_x"]          │
└─────┴────────┴────────┴──────────────────┘

I would also like to set m_group column, where I set integer values preceded by source number for groups from the previous, aggregated dataframe. So the resulting dataframe would be something like:


┌─────┬─────────────────────────────────┬────────────────┬────────────────┬────────────────┐
│ id  ┆ str_id                          ┆ source         ┆ m_status       ┆ m_group        │
│ --- ┆ ---                             ┆ ---            ┆ ---            ┆ ---            │
│ i64 ┆ str                             ┆ str            ┆ list[str]      ┆ list[str]      │
╞═════╪═════════════════════════════════╪════════════════╪════════════════╪════════════════╡
│ 1   ┆ s1_1                            ┆ 1              ┆ ["6_x", "5_x"] ┆ ["6_1", "5_1"] │
│ 2   ┆ s1_2                            ┆ 1              ┆ ["6_x", "5_x"] ┆ ["6_2", "5_2"] │
│ 3   ┆ s1_3                            ┆ 1              ┆ ["6_x", "5_x"] ┆ ["6_3", "5_3"] │
│ 4   ┆ s1_4                            ┆ 1              ┆ ["6_x"       ] ┆ ["6_4"]        │
│ 5   ┆ s1_5                            ┆ 1              ┆ ["6_x", "5_x"] ┆ ["6_5", "5_4"] │
│ 6   ┆ s1_6                            ┆ 1              ┆ []             ┆ []             │
│ 7   ┆ s2_1                            ┆ 5              ┆ ["1_x"]        ┆ ["1_1"]        │
│ 8   ┆ s2_2                            ┆ 5              ┆ ["1_x"]        ┆ ["1_2"]        │
│ 9   ┆ s2_3                            ┆ 5              ┆ ["1_x"]        ┆ ["1_4"]        │
│ 10  ┆ s2_4                            ┆ 5              ┆ ["1_x"]        ┆ ["1_3"]        │
│ 11  ┆ s2_5                            ┆ 5              ┆ []             ┆ []             │
│ 12  ┆ s3_1                            ┆ 6              ┆ ["1_x"]        ┆ ["1_1"]        │
│ 13  ┆ s3_2                            ┆ 6              ┆ ["1_x"]        ┆ ["1_2"]        │
│ 14  ┆ s3_3                            ┆ 6              ┆ ["1_x"]        ┆ ["1_5"]        │
│ 15  ┆ s3_4                            ┆ 6              ┆ ["1_x"]        ┆ ["1_3"]        │
│ 16  ┆ s3_5                            ┆ 6              ┆ []             ┆ []             │
│ 17  ┆ s3_6                            ┆ 6              ┆ ["1_x"]        ┆ ["1_4"]        │
└─────┴─────────────────────────────────┴────────────────┴────────────────┴────────────────┘

Having dict that stores latest m_group values for each source different than “1” (I always match other source with respect to source “1”) like:

m_group_dict = {
    "5": 20,
    "6": 25,
}

I wanted to incorporate pl.int_range(start=m_group_dict[source]) into the code above that sets m_status to somehow set the m_group as well, but I can’t make it work. I was thinking about doing somehting like:

m_status = 
pl.when(pl.col.l >= 2, pl.col.representation == "1").then(pl.col.excl1)
.when(pl.col.l >= 2, pl.col.has1).then(["1"])
.otherwise([]),
m_group = 
pl.when(pl.col.l >= 2, pl.col.representation == "1")
.then(pl.int_range(start=m_group_dict[pl.col.source]).cast(pl.List(pl.Utf8)).list.eval(f"{pl.col.representation}_" + pl.element()))
.when(pl.col.l >= 2, pl.col.has1)           
.then(pl.int_range(start=m_group_dict[pl.col.source]).cast(pl.List(pl.Utf8)).list.eval(f"1_" + pl.element()))
.otherwise([])

But cannot get anything like this to work.

Thiết kế website giá rẻ

Danh mục

Setting int value for paired rows based on other columns