I have two tables with columns containing the same information (e.g. company_name
, address
, zipcode
), but there some caveats:
- in one of the two tables, one of the information can be missing (either
company_name
oraddress
orzip
) - the same company can have different name formats (e.g.
Apple
,Apple Inc
) - the same company can have a different address format (e.g.
123 Street
andInside Mall X
)
So far I have tried to:
- merge the two datasets
- left join the new dataset with itself on all the columns by finding the first available match (with Polars, the join is on
[pl.coalesce('company_name', 'address', 'zipcode')]
After this, I thought of selecting the original columns with a condition to fetch the first available value between the original columns and the joined column.
I need to find a way to address point 2 and 3, but what I am really looking for is some understanding on how to approach the join in the context of point 1. Do you have any tip?
I can use either plain SQL or Spark or Polars.
Thanks a lot!