I’m noticing some unusual behaviour with an Airflow DAG of mine: One of the tasks (a PythonOperator) will fail. The downstream tasks will then show a status of “upstream failed”. Then, after another second or so, the original task that failed will suddenly be marked as “success”.
The task that fails/succeeds doesn’t show anything in the task’s log about the failure, and it has no retries so it’s no failing on the first attempt and then succeeding on a retry.
Looking into the environment logs, I can see one entry that reads as:
Setting task instance to failed for task_id: my_task
AFTER that line (2 seconds later, going by the log entry timestamps), I can see more log entries that are emitted by the task’s own logging.info
statements!
Then several more seconds later in the log, the task is marked as a success:
Marking task as SUCCESS. dag_id=my_dag, task_id=my_task, …
It looks like Airflow is preemptively deciding that the task failed before it even really got started. And then changed that failure to a success.
This behaviour seems weird to me, but the real problem is that when the task is first marked as failed, all downstream tasks are marked as “upstream failed” and when the task then switches to success, the downstream tasks STAY as “upstream failed” – Airflow doesn’t correct them. This causes problems because the DAG just gets stuck. I even have retries on the downstream tasks but they don’t retry in this situation.
Does anyone have any ideas about what might be causing this behaviour, or how to further debug it? I thought I was finally starting to understand how Airflow works, but now I find myself even more confused by this sort of thing.
Specs: Airflow is version 2.6.3 running in Composer (GCP).