Relative Content

Tag Archive for apache-sparkpysparkamazon-emr

Spark EMR executor container failing due to Java heap space

One of my Spark code is failing due to executor container failing due to “java.lang.OutOfMemoryError: Java heap space”. Any recommendation is appreciated.

AQEShuffleRead in Spark Creating few partitions though advisoryPartitionSizeInBytes and initialPartitionNum is provided

I have added the spark.sql.adaptive.advisoryPartitionSizeInBytes=268435456 and spark.sql.adaptive.enabled=true. However my data size for each partition is more than 256 mb. I see the Dag where the AQEShuffleRead creates less partition then spark.sql.adaptive.coalescePartitions.initialPartitionNum.

EMR-Spark Job creating max 1000 partitions/task when AQE is enabled

I see always 1000 task/partitions getting created for a spark jobs with AQE enabled. If I execute job for monthly(4 times weekly data) or a week data, the shuffle partitions are same.Whis is nothing but number of task running is 1000. Hence it’s throwing memory issues. Is there any parameter enables max 1000 partitions.

Spark aggregate on multiple columns or a hash

Supposed I want to drop duplicates or perform an aggregation on 3 columns in my Spark dataframe.
Would it be more optimal to do

Pyspark job restart from within script running on EMR

I have a pyspark job which runs in the EMR cluster. Is there any way that from the script itself I can fail the job and then restart it on certain condition? Currently I am throwing an exception but that just fails the job and stops. I want it to start automatically again.

Pyspark job piling up memory

I have below scenario coded in a pyspark job:

Thiết kế website giá rẻ

Danh mục