Operating in PySpark==3.4.1
There are two datasets each sharing a primary key as well as other common columns with a shared range. Datatypes match and the schema could read as nullable=false or nullable=true.
When joining with at scale(i.e. records > 100,000):
dataset1.join(dataset2, on="primary_key", how="left")
dataset1.join(dataset2, on="shared_label", how="left")
The primary key in the resulting dataset is overwritten where(for example) values ranging from 1-10 are now 11-20 – doubling the symmetric difference between primary key sets(at scale there is a small percentage primary keys which overlap, but the symmetric difference is ~1.9998).
Notably this occurs after running PySparks BisectingKMeans model.
Removing this join or attempting to use a different key result in pushing the issue downstream(same issue in a different place).
Potential solves are to switch to sql context, rdd.join, or create a monotonically increasing ID which maps and rewrites to the original PK set.
Matthew Ludwig is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.