I am having a spark remote server setup in k8s , i am having a dataset of 70 million in which i perform some joins and then write them to db , all my data is successfully written in the db and there is no activity in the spark UI however my spark client applications hangs and do not exit most of the time.
here is my spark conf
spark.app.name spark
spark.submit.deployMode client
spark.scheduler.mode FIFO
spark.ui.port 4040
spark.connect.grpc.binding.port 15002
spark.kubernetes.namespace default
spark.kubernetes.container.image.pullSecrets pregistrykey
spark.driver.bindAddress 0.0.0.0
spark.driver.host connect-server-headless-service.default.svc.cluster.local
spark.speculation false
spark.executor.instances 9
spark.dynamicAllocation.executorIdleTimeout 180s
spark.driver.memory 8g
spark.executor.memory 21g
spark.executor.cores 6
spark.sql.autoBroadcastJoinThreshold 2m
spark.network.timeout 100000s
spark.kubernetes.executor.container.image xxxxx
spark.executor.extraClassPath /opt/spark/jars/postgresql-42.7.1.jar
spark.kubernetes.container.image.pullPolicy Always
spark.master k8s://https://kubernetes.default.svc
spark.kubernetes.authenticate.executor.serviceAccountName spark
spark.kubernetes.authenticate.driver.serviceAccountName spark
spark.stage.maxConsecutiveAttempts 1
spark.kubernetes.executor.deleteOnTermination true
spark.task.maxFailures 1
What can be the possible reasons for this
here are the last logs from driver
24/06/24 13:36:54 INFO BlockManagerInfo: Removed broadcast_1130_piece0 on 10.244.3.136:35345 in memory (size: 4.1 KiB, free: 12.4 GiB)
24/06/24 13:36:54 INFO BlockManagerInfo: Removed broadcast_1130_piece0 on 10.244.4.185:36247 in memory (size: 4.1 KiB, free: 12.4 GiB)
24/06/24 13:36:54 INFO BlockManagerInfo: Removed broadcast_1130_piece0 on 10.244.6.84:36919 in memory (size: 4.1 KiB, free: 12.4 GiB)
24/06/24 13:36:54 INFO BlockManagerInfo: Removed broadcast_1130_piece0 on 10.244.2.33:45801 in memory (size: 4.1 KiB, free: 12.4 GiB)
24/06/24 13:36:54 INFO BlockManagerInfo: Removed broadcast_1203_piece0 on connect-server-headless-service.default.svc.cluster.local:38109 in memory (size: 16.6 KiB, free: 4.6 GiB)
24/06/24 13:36:54 INFO BlockManagerInfo: Removed broadcast_1203_piece0 on 10.244.7.193:39017 in memory (size: 16.6 KiB, free: 12.4 GiB)
I tried increasing/ decreasing partitions, tried unpersisting dfs without any effect