This seems a known issue for pyspark in previous versions but I’m still seeing this error with
python==3.9.12
Pyspark==3.5.0:
from functools import reduce
res[0].join(res[1], on=['account_id']).count(), res[1].join(res[2], on=['account_id']).count(), res[0].join(res[2], on=['account_id']).count(),
reduce(lambda x,y: x.join(y, on=['account_id']), res[:3]).count()
It returns
(72122, 1000000, 1000000, 0)
Note I have three dataframes (sharing 1000000 account_id as rows) saved in res
and pariwise join never returns 0 but once I join all of them it becomes 0. Theoretically count
followed by join should always return 1000000 but only (df1, df2), (df0, df2) is correct. Joining with (df0, df1) lost some data and (df0, df1, df2) lost all data.