Relative Content

Tag Archive for apache-sparkpysparkapache-spark-sqlamazon-emr

Spark Repartition/shuffle optimization

I am trying to repartiton before applying any transformation logic. This takes a lot of time. Here is code and snapshot of UI below. Any optimization can be applied here?.

What does retry in Spark UI means?

I have spark executed in two different instances (one with spark.sql.adaptive.coalescePartitions.enabled=false and other spark.sql.adaptive.coalescePartitions.enabled=true). In the first instance the stage graphs has some retries and other one doesn’t have. Can you help me to understand retry means here and why the extra data shuffle it is doing.

Spark emr jobs: Is the number of task defined by AQE (adaptive.enabled)?

I see the number of task in spark job is only 1000 after initial read, where as number of cores available is 9000 (1800 executors*5 core each). I have enabled aqe and coalesce shuffle partition. In the pic below you can see there is only 1000 task runnings. However the input task with 256 mb split is around 141000. The code is in pyspark (sql aggregated function on top of s3 and writing back to s3)