We have k8s v1.27.7 on-prem cluster with 3 nodes.
One node is stopped and it goes in NotReady state, but pods don’t get transferred to any other node even after a couple of hours.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1 Ready control-plane 84d v1.27.7
node2 Ready control-plane 84d v1.27.7
node3 NotReady <none> 84d v1.27.7
node3 has 2 taints:
Taints
node.kubernetes.io/unreachable:NoSchedule
node.kubernetes.io/unreachable:NoExecute
and we have multiple pods on this node in “ready” status and some in “terminated” state, all stuck in this for hours. They all have pretty standard settings, so in describe:
kubectl describe pod <...>
<...>
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
<...>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
(events here is even funnier because it contained a “node is not ready” warning for first hour)
So in other words – I don’t get the expected behavior of “resettling” pods from down node onto other nodes. Not every pod gets stuck like this, some go over whole flow without an issue (while having the same deployment, with no significant difference aside from service image itself and ENV variables)
In events:
kubectl get events
<...>
12m Warning NodeNotReady pod1 Node is not ready
7m33s Normal TaintManagerEviction pod1 Marking for deletion Pod
7m32s Normal Scheduled pod1 Successfully assigned to node1
7m21s Normal AddedInterface pod1 Add eth0 from k8s-pod-network
7m20s Normal Pulled pod1 Container image on machine
7m20s Normal Created pod1 Created container
7m20s Normal Started pod1 Started container
7m32s Normal SuccessfulCreate replicaset/pod2 Created pod
<...>
13m Warning NodeNotReady podA Node is not ready
8m26s Normal TaintManagerEviction podA Marking for deletion Pod podA
<...>
basically podA was supposed to get deleted but instead stuck with “Ready” and never got onto the next stage of eviction.
What should I check and look for to get behind the reasons it happens? The further problem is that the real issue is on different (production) cluster that have different installation: even k8s there is 25.6, not 27.7, but on every other cluster I try and experiment – I can’t replicate the issue and everything is evicted correctly after 5 minutes. Other questions I found here were about a problem with eviction settings with older versions of k8s, but it’s not the same case here. So I’m lost in game “spot 10 differences” and need advice on where to dig up.
Semyon Chechin is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1