I am getting error Spark shuffle FetchFailedException while executing spark in emr with 65 TB input data. The code is aggregated metrics spark sql on top of s3 parquert reading 30 days worth of data .
**Cluster config:
300 Task Node : r6g.16xlarge [64 vCore, 488 GiB memory]**
Spark Submit command
spark-submit --conf spark.sql.files.maxPartitionBytes=268435456
--master yarn --deploy-mode cluster --conf spark.yarn.maxAppAttempts=1
--conf spark.sql.adaptive.enabled=true --conf spark.dynamicAllocation.enabled=false
--conf spark.sql.parquet.filterPushdown=true
--conf spark.sql.adaptive.coalescePartitions.enabled=true
--conf spark.sql.adaptive.advisoryPartitionSizeInBytes=134217728
--conf spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled=true
--conf spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor=.2
--conf spark.sql.adaptive.coalescePartitions.parallelismFirst=false
--conf spark.sql.adaptive.coalescePartitions.initialPartitionNum=42000
--conf spark.sql.adaptive.localShuffleReader.enabled=true
--conf spark.network.timeout=5400s --conf spark.files.fetchTimeout=600s
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1g
--conf spark.memory.storageFraction=0.05 --conf spark.memory.fraction=.8
--conf spark.shuffle.compress=true --conf spark.shuffle.spill.compress=true
--conf spark.hadoop.fs.s3.multipart.th.fraction.parts.completed=0.99
--conf spark.sql.objectHashAggregate.sortBased.fallbackThreshold=4000000
--conf spark.reducer.maxReqsInFlight=1
--conf spark.network.timeout=1200s
--conf spark.executor.cores=5
--conf spark.executor.instances=3600
--conf spark.sql.shuffle.partitions=36000
--conf spark.executor.memory=36g --conf spark.driver.memory=60g --conf spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g
--conf spark.hadoop.fs.s3a.fast.output.enabled=true
--conf spark.executor.extraJavaOptions="-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -Djavax.net.ssl.trustStore=/home/hadoop/.config/certs/InternalAndExternalTrustStore.jks" --conf spark.driver.extraJavaOptions="-XX:+UseG1GC "
test.py
Error Message:
org.apache.spark.shuffle.FetchFailedException
at org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:437)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1233)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:971)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:86)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at
Spark UI