When creating a kubernetes Job, it can fail for several different reasons, one of which is that the associated container image cannot be successfully pulled from the registry. However, after the job has completed I can’t figure out a way to definitively determine that the failure was due to an image pull failure rather than some other err that caused the deadline to be exceeded.
Consider the case where I create a job similar to the following YAML:
kind: Job
apiVersion: batch/v1
metadata:
name: test-job-image-pull
namespace: mynamespace
spec:
completions: 1
activeDeadlineSeconds: 300
template:
spec:
restartPolicy: Never
containers:
- name: mycontainer
command:
- mycommand
imagePullPolicy: IfNotPresent
image: 'my/nonexistentcontainer:latest'
When starting the job, it will create the pod and try to pull the my/nonexistentconatiner:latest
image, which will fail. While the pod is attempting to pull the image, I can check the pod status and will notice that the container status is in waiting
state with reason ErrImagePull
. But after the deadline is exceeded, the job will fail and the pod will be automatically deleted, so I can no longer retrieve any information about the pod failure from the pod itself. The Job itself will have a status like the following:
status:
conditions:
- type: Failed
status: 'True'
lastProbeTime: '2024-07-30T15:49:01Z'
lastTransitionTime: '2024-07-30T15:49:01Z'
reason: DeadlineExceeded
message: Job was active longer than specified deadline
startTime: '2024-07-30T15:48:31Z'
failed: 1
uncountedTerminatedPods: {}
ready: 0
So I can see that the job failed because of DeadlineExceeded
, but I can no longer definitively determine that the failure was due to an image pull error. Is there a way to get kubernetes to keep the pod around for inspection when image pull fails? Or another way to definitively determine the cause of failure?
jonner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.