I’m using Airflow’s SparkSubmitOperator to submit a Spark Streaming job to a Kubernetes cluster, and the job is running indefinitely (it’s a Spark Streaming job). The Airflow task is being run using the KubernetesPodOperator executor.
After around 30 hours of runtime, the task’s pod gets OOMKilled (Out of Memory Killed). By default, the resource configuration for the pod is:
yaml
Limits:
cpu: 500m
memory: 512Mi
Requests:
cpu: 500m
memory: 512Mi
It seems like the memory allocation isn’t sufficient for long-running jobs. I am unsure about the best practices for setting resource limits for these types of tasks in Airflow. How can I prevent the pod from getting OOMKilled for long-running Spark Streaming jobs?
I haven’t adjusted the memory settings yet but I suspect that increasing the memory might solve the problem. However, I’m looking for guidance on:
How to set appropriate resource requests and limits for long-running Spark Streaming jobs.
Best practices for handling resource allocation to avoid OOMKilled errors for streaming tasks.
Any help would be appreciated!
1