I use spark 3.4.1 in kubernetes in cluster mode. The problem I experienced was that I had a syntax error in log4j2 settings for executor and therefore executor pod fails as sonn as it is started. Then the driver will create another executor which also fails around 10-20sec later and so forth…
I wonder whether there is a spark setting which prevents the creation of executor pods beyond a certain threshold.
According to https://spark.apache.org/docs/latest/configuration.html there is a setting for spark 3.5.0 spark.executor.maxNumFailures. But I would like to avoid to upstep my spark version to 3.5.0 because there is rather big dependency list of other package (like kafka) which I would need to upstep and test as well.
Furthermore, spark.dynamicAllocation.maxExecutors seems to size number of executor running for a job, but not to limit total number of executor pods which are started by spark driver in case of failure.
Is there another mechanism which prevents that for a task spark driver creates infinite number of executor pods?
Thanks!