Airflow is deployed on GKE which is auto pilot k8s cluster. I’m experiencing an issue with Apache Airflow and Kubernetes where the status of the pods in Kubernetes is not being accurately reflected in Airflow. Here are the details: (Celery Kubernetes Executor is been used. airflow version 2.7.1)
Issue Summary:
When triggering multiple Airflow tasks that require significant resources (pod-size=XL), some pods get stuck in the “ContainerCreating” state due to resource constraints. After 120 seconds, the Airflow tasks move to the
up_for_retry
state. Eventually, the pods get the necessary resources and complete successfully, but the Airflow tasks remain in the
up_for_retry
state.
Example:
-
Trigger multiple Airflow tasks with significant resource requirements (pod-size=XL).
-
Some pods get stuck in the “ContainerCreating” state due to insufficient memory and CPU resources.
-
After 120 seconds, the Airflow tasks move to the
up_for_retry
state.
-
Eventually, the pods get the necessary resources and complete successfully.
-
Despite the pods completing successfully, the Airflow tasks remain in the
up_for_retry
state.
Error Logs:
[2024-07-01, 15:24:16 JST] {pod.py:560} ERROR - Pod Event: FailedScheduling - 0/8 nodes are available: 1 Insufficient memory, 8 Insufficient cpu. preemption: 0/8 nodes are available: 8 No preemption victims found for incoming pod.
Steps Taken:
-
Adjusted the
startup_timeout_seconds
to 600 seconds (default is 120 seconds) to allow more time for the pods to start up, but this did not resolve the issue.
-
Investigated potential IP address exhaustion, but no issues were found.