This seems a known issue for pyspark in previous versions but I’m still seeing this error with
python==3.9.12
Pyspark==3.5.0:
from functools import reduce
res[0].join(res[1], on=['account_id']).count(), res[1].join(res[2], on=['account_id']).count(), res[0].join(res[2], on=['account_id']).count(),
reduce(lambda x,y: x.join(y, on=['account_id']), res[:3]).count()
It returns
(72122, 1000000, 1000000, 0)
I have three dataframes saved in res
and pariwise join never returns 0 but once I join all of them it becomes 0.