(Please edit the title if you find a more suiting one)
I have following dataframe that contains grouped records:
Table 1
┌──────────────────────────┬─────────────────┬─────────────┐
│ str_id ┆ source ┆ id │
│ --- ┆ --- ┆ --- │
│ list[str] ┆ list[str] ┆ list[i64] │
╞══════════════════════════╪═════════════════╪═════════════╡
│ ["s1_5", "s2_3", "s3_3"] ┆ ["1", "5", "6"] ┆ [5, 9, 14] │
│ ["s1_2", "s2_2", "s3_2"] ┆ ["1", "5", "6"] ┆ [2, 8, 13] │
│ ["s1_3", "s2_4", "s3_4"] ┆ ["1", "5", "6"] ┆ [3, 10, 15] │
│ ["s1_1", "s2_1", "s3_1"] ┆ ["1", "5", "6"] ┆ [1, 7, 12] │
│ ["s1_4", "s3_6"] ┆ ["1", "6"] ┆ [4, 17] │
│ ["s2_5", "s3_5"] ┆ ["5", "6"] ┆ [11, 16] │
│ ["s1_6"] ┆ ["1"] ┆ [6] │
└──────────────────────────┴─────────────────┴─────────────┘
I want to create a new column called m_group that contains int values linking records within group, preceded by f"{source}_"
. I map only source==1
to other sources, therefore records for source==1
will have multiple links, while other sources will only have one. Example: source==1, m_group==5_3
links to source==5 m_group==1_3
The output is following (please disregard column m_status for now):
Table 2
┌─────┬──────────────┬────────────────┬────────────────┬────────────────┐
│ id ┆ str_id ┆ source ┆ m_group ┆ m_status │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ list[str] ┆ list[str] │
╞═════╪══════════════╪════════════════╪════════════════╪════════════════╡
│ 1 ┆ s1_1 ┆ 1 ┆ ["5_3", "6_3"] ┆ ["6_1", "5_1"] │
│ 2 ┆ s1_2 ┆ 1 ┆ ["5_1", "6_1"] ┆ ["6_1", "5_1"] │
│ 3 ┆ s1_3 ┆ 1 ┆ ["5_2", "6_2"] ┆ ["6_1", "5_1"] │
│ 4 ┆ s1_4 ┆ 1 ┆ ["6_4"] ┆ ["6_1"] │
│ 5 ┆ s1_5 ┆ 1 ┆ ["5_0", "6_0"] ┆ ["6_1", "5_1"] │
│ 6 ┆ s1_6 ┆ 1 ┆ [] ┆ [] │
│ 7 ┆ s2_1 ┆ 5 ┆ ["1_3"] ┆ ["1_1"] │
│ 8 ┆ s2_2 ┆ 5 ┆ ["1_1"] ┆ ["1_1"] │
│ 9 ┆ s2_3 ┆ 5 ┆ ["1_0"] ┆ ["1_1"] │
│ 10 ┆ s2_4 ┆ 5 ┆ ["1_2"] ┆ ["1_1"] │
│ 11 ┆ s2_5 ┆ 5 ┆ [] ┆ [] │
│ 12 ┆ s3_1 ┆ 6 ┆ ["1_3"] ┆ ["1_1"] │
│ 13 ┆ s3_2 ┆ 6 ┆ ["1_1"] ┆ ["1_1"] │
│ 14 ┆ s3_3 ┆ 6 ┆ ["1_0"] ┆ ["1_1"] │
│ 15 ┆ s3_4 ┆ 6 ┆ ["1_2"] ┆ ["1_1"] │
│ 16 ┆ s3_5 ┆ 6 ┆ [] ┆ [] │
│ 17 ┆ s3_6 ┆ 6 ┆ ["1_4"] ┆ ["1_1"] │
└─────┴──────────────┴────────────────┴────────────────┴────────────────┘
To achieve this, I iterate over rows in Table 1 and create a list of lists with these links, that I later append to the dataframe and explode().
Code 1
m_g = []
num = 0
for row in df.iter_rows(named=True):
temp = []
if "1" in row["source"]:
for x in row["source"]:
if x == "1":
temp.append([f"{x}_" + str(num) for x in row["source"] if x!='1'])
else:
temp.append(["1_" + str(num)])
m_g.append(temp)
num += 1
else:
for x in row["source"]:
temp.append([])
m_g.append(temp)
# m_g:
# [[['5_0', '6_0'], ['1_0'], ['1_0']], [['5_1', '6_1'], ['1_1'], ['1_1']], [['5_2', '6_2'],
['1_2'], ['1_2']], [['5_3', '6_3'], ['1_3'], ['1_3']], [['6_4'], ['1_4']], [[], []], [[]]]
I add column m_group using list m_g and obtain following dataframe:
Table 3
df = df.with_columns(pl.Series(name="m_group", values=m_g))
┌──────────────────────────┬─────────────────┬─────────────┬─────────────────────────────────┐
│ str_id ┆ source ┆ id ┆ m_group │
│ --- ┆ --- ┆ --- ┆ --- │
│ list[str] ┆ list[str] ┆ list[i64] ┆ list[list[str]] │
╞══════════════════════════╪═════════════════╪═════════════╪═════════════════════════════════╡
│ ["s1_5", "s2_3", "s3_3"] ┆ ["1", "5", "6"] ┆ [5, 9, 14] ┆ [["5_0", "6_0"], ["1_0"], ["1_… │
│ ["s1_2", "s2_2", "s3_2"] ┆ ["1", "5", "6"] ┆ [2, 8, 13] ┆ [["5_1", "6_1"], ["1_1"], ["1_… │
│ ["s1_3", "s2_4", "s3_4"] ┆ ["1", "5", "6"] ┆ [3, 10, 15] ┆ [["5_2", "6_2"], ["1_2"], ["1_… │
│ ["s1_1", "s2_1", "s3_1"] ┆ ["1", "5", "6"] ┆ [1, 7, 12] ┆ [["5_3", "6_3"], ["1_3"], ["1_… │
│ ["s1_4", "s3_6"] ┆ ["1", "6"] ┆ [4, 17] ┆ [["6_4"], ["1_4"]] │
│ ["s2_5", "s3_5"] ┆ ["5", "6"] ┆ [11, 16] ┆ [[], []] │
│ ["s1_6"] ┆ ["1"] ┆ [6] ┆ [[]] │
└──────────────────────────┴─────────────────┴─────────────┴─────────────────────────────────┘
Then I can do:
Code 2
df = (
df.with_columns(
).explode(pl.col("id","str_id","source", "m_group"))
.select(
pl.col("id","str_id","source", "m_group"),
)
.sort("id")
)
And obtain following output:
Table 4
┌─────┬──────────────┬────────────────┬────────────────┐
│ id ┆ str_id ┆ source ┆ m_group │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ list[str] │
╞═════╪══════════════╪════════════════╪════════════════╡
│ 1 ┆ s1_1 ┆ 1 ┆ ["5_4", "6_4"] │
│ 2 ┆ s1_2 ┆ 1 ┆ ["5_2", "6_2"] │
│ 3 ┆ s1_3 ┆ 1 ┆ ["5_3", "6_3"] │
│ 4 ┆ s1_4 ┆ 1 ┆ ["6_0"] │
│ 5 ┆ s1_5 ┆ 1 ┆ ["5_5", "6_5"] │
│ 6 ┆ s1_6 ┆ 1 ┆ [] │
│ 7 ┆ s2_1 ┆ 5 ┆ ["1_4"] │
│ 8 ┆ s2_2 ┆ 5 ┆ ["1_2"] │
│ 9 ┆ s2_3 ┆ 5 ┆ ["1_5"] │
│ 10 ┆ s2_4 ┆ 5 ┆ ["1_3"] │
│ 11 ┆ s2_5 ┆ 5 ┆ [] │
│ 12 ┆ s3_1 ┆ 6 ┆ ["1_4"] │
│ 13 ┆ s3_2 ┆ 6 ┆ ["1_2"] │
│ 14 ┆ s3_3 ┆ 6 ┆ ["1_5"] │
│ 15 ┆ s3_4 ┆ 6 ┆ ["1_3"] │
│ 16 ┆ s3_5 ┆ 6 ┆ [] │
│ 17 ┆ s3_6 ┆ 6 ┆ ["1_0"] │
└─────┴──────────────┴────────────────┴────────────────┘
Question: I’m looking for a way to make it more efficient.
For example the way to set m_status (Table 2) using dataframe from Table 1 is:
Code 3
def set_m_status(df: pl.DataFrame) -> pl.DataFrame:
df = (
df.with_columns(
l = pl.col.source.len(),
has1 = pl.col.source.list.contains("1"),
excl1 = pl.col.source.list.set_difference([1]).list.eval(pl.element() + "_1")
).explode(pl.col("id","str_id","source", "m_group"))
.select(
pl.col("id","str_id","source", "m_group"),
m_status =
pl.when(pl.col.l >= 2, pl.col.source == "1").then(pl.col.excl1)
.when(pl.col.l >= 2, pl.col.has1).then(["1"])
.otherwise([])
)
.sort("id")
)
return df
Example data from the Table 2 for the dataframe:
str_id_series = pl.Series("str_id", [["s1_5", "s2_3", "s3_3"], ["s1_2", "s2_2", "s3_2"], ["s1_3", "s2_4", "s3_4"], ["s1_1", "s2_1"], ["s1_4", "s3_6"], ["s2_5", "s3_5"], ["s1_6"]])
source_series = pl.Series("source", [["1", "5", "6"], ["1", "5", "6"], ["1", "5", "6"], ["1", "7"], ["4", "17"], ["5", "6"], ["1"]])
id_series = pl.Series("id", [[5, 9, 14], [2, 8, 13], [3, 10, 15], [1, 7, 12], [4, 17], [11, 16], [6]])
df = pl.DataFrame([str_id_series, source_series, id_series])
df_data = pl.read_csv("data.csv")
should load