TLDR: I need to map duplicated rows 1:1 within a dataframe. Is there any way to make join operation result in 1:1 mapping (enforce ‘validate’ parameter)?:
Example:
schema = {
"partition_number": pl.Utf8,
"type": pl.Utf8,
"state": pl.Int64,
"price": pl.Float64,
"quantity": pl.Int64,
"source": pl.Int64,
"some_other_random_column": pl.Int64,
}
df = pl.read_csv("data.csv", schema=schema)
source1 = df.filter(pl.col('source') == 1)
source2 = df.filter(pl.col('source') == 2)
overlap = source1.join(source2, on=['partition_number', 'type', 'state', 'price', 'quantity'],
how='inner', suffix='_other')
If there are duplicated rows in either of these, it will result in one_to_many / many_to_many mapping. So for example shapes (50,7) and (55,7) can result in (78,7) dataframe after join. I want it to be always (50,7) at best. Then I can check which rows are not present in the other dataframe by doing (but I don’t need to):
all_rows= source1.join(source2, on=['partition_number', 'type', 'state', 'price', 'quantity'],
how='full', suffix='_other')
only_in_source1 = all_rows.filter(pl.col('quantity_other').is_null()) # (0,7)
only_in_source2 = all_rows.filter(pl.col('quantity').is_null()) # (5,7)
I think in theory by doing the second code block I can calculate how many unique 1:1 mappings there are, but I want them to be mapped not just know the number, regardless of the order. I can then assign them a “mapped_id” value.
It doesn’t have to be done this way, I thought similar approach could probably be the fastest, but any suggestion that will help me map rows 1:1 in an optimized way will be great.