Splitting Spark dataset / rdd into X smaller datasets, like randomSplit but w/o random
I thought that spark randomSplit
with equal weights with split dataset into equal parts w/o duplicates or records losses. Seems like it’s wrong assumption. https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc