Did anyone get answer for this. I am having similar issue in EKS 1.28. The issue is intermittent like if i have 10 nodes some of them cant ping to each other on calico.vxlan interface( and so do the pods on these faulty nodes cant communicate to other pods on other nodes and even pods on same node). However many of them work fine. These are self managed nodes which are part of ASG. If I kill the faulty nodes the new nodes created by ASG works just fine. However we can’t do that always and I am not sure what’s going on. There are no network policies. So far I have tried to upgrade EKS to 1.29, tried to upgrade calico ( which is the CNI who manages private ip addresses to all pods) to latest version. It is not security group issue as I allowed all inbound and outbound ports to troubleshoot. There are no NACLs involved. I tried to kill calico daemonset pods on faulty nodes which would get immediately replaced with new one, tried to restart kubelet , reboot the faulty nodes but nothing seems to work. The pods land happily on faulty nodes but communications is just broken so they become useless. Any help is greatly appreciated.
The issue shouldn’t be intermittent.
So far I have tried to upgrade EKS to 1.29, tried to upgrade calico ( which is the CNI who manages private ip addresses to all pods) to latest version. It is not security group issue as I allowed all inbound and outbound ports to troubleshoot. There are no NACLs involved. I tried to kill calico daemonset pods on faulty nodes which would get immediately replaced with new one, tried to restart kubelet , reboot the faulty nodes but nothing seems to work. The pods land happily on faulty nodes but communications is just broken so they become useless. Any help is greatly appreciated.
Amol Pali is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.