Spark fails at dropDuplicates() due to multiple ExecutorLostFailure
I have 479 parquet files (each approx. 120 MB, totaling a little over 2 billion records) stored on HDFS and I was trying to determine the best Spark configuration for this dataset, but I am unable to complete the dropDuplicates()
operation. My HDFS configuration is:
Spark fails at dropDuplicates() due to multiple ExecutorLostFailure
I have 479 parquet files (each approx. 120 MB, totaling a little over 2 billion records) stored on HDFS and I was trying to determine the best Spark configuration for this dataset, but I am unable to complete the dropDuplicates()
operation. My HDFS configuration is:
how to check which HDFS datanode ip is returned by namenode to spark?
If I’m reading/writing a dataframe in PySpark specifying HDFS namenode hostname and port: