Left Anti Join is not performing as expected – It is supposed to remove the records which are already there. But duplicates are coming.
Here is the code –
bronze_population_only_utc = final_population.filter(
(sqlf.col("utuc").isNotNull())
& (sqlf.col("utuc").isin("ut_bronze_pros", "uc_bronze_pros"))
)
bronze_population_without_utc = final_population.join(
bronze_population_only_utc.select("index_key"),
on=["index_key"],
how="left_anti",
)
So,bronze_population_without_utc should not contain any records which are already present in bronze_population_only_utc.
final_population is unique on index_key – that we already checked.
But
bronze_utuc_in_without_utuc = bronze_population_without_utuc.filter(
(sqlf.col("utuc").isNotNull())
& (sqlf.col("utuc").isin("ut_bronze_pros", "uc_bronze_pros"))
)
gives us different records in every run.
I have got a fix but I am not really understanding why Left Anti Join is not working in this case – what could be the possible explanations.