I thought that spark randomSplit
with equal weights with split dataset into equal parts w/o duplicates or records losses. Seems like it’s wrong assumption. https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc
I’m doing this to split dataset into 3 parts.
val splits: List[Dataset[MyEntity]] = entitiesDataset.randomSplit(Array(1.0D, 1.0D, 1.0D)).toList
Is there an easy way to split spark dataset into X parts w/o duplicates, records losses?