I wrote a flask application, which runs an asynchronous data-processing job as a Celery-task in a k8s worker-pod. We use a PostgreSQL db as the celery backend. Normally, we have no issues — when a task fails, we can easily get the status of the celery result, along with a traceback to help debug.
Recently however, we’ve encountered situations where tasks are marked as failed, but the result and traceback in the backend are both Null. The error cannot be reproduced — when we reprocess the file, it processes normally as expected.
I suspected that the issue could be due to the worker pod going down for some reason, thereby killing the task without updating the status, but I cannot find any obvious evidence of an outage around the time of the error occurring.
I explored the option of automatically retrying the task on a failure, but I have had issues in the past with these failing to execute properly, whereby retries would get marked as PENDING
and simply never execute). I’ve also seen that others have had similar issues with tasks getting stuck in PENDING
on retry.
Is this a known Celery issue? Are there any other ways to handle this more gracefully?