Relative Content

Tag Archive for apache-sparkhadooppysparkhdfs

Spark fails at dropDuplicates() due to multiple ExecutorLostFailure

I have 479 parquet files (each approx. 120 MB, totaling a little over 2 billion records) stored on HDFS and I was trying to determine the best Spark configuration for this dataset, but I am unable to complete the dropDuplicates() operation. My HDFS configuration is:

Spark fails at dropDuplicates() due to multiple ExecutorLostFailure

I have 479 parquet files (each approx. 120 MB, totaling a little over 2 billion records) stored on HDFS and I was trying to determine the best Spark configuration for this dataset, but I am unable to complete the dropDuplicates() operation. My HDFS configuration is: