I have 14 nodes cluster (kubeadm
) on Azure VMs (Ubuntu 22.04
). Lately service is very slow to almost unresponsive. I tried to ping node from another node, it’s <1ms
. Tried to ping pod to pod, also looks fine. I don’t have iperf
on the nodes and can’t install any network tools for now
I tried to pathping my UI endpoint, and here are the results:
pathping mydomain.com
Tracing route to mydomain.com [10.42.3.18]
over a maximum of 30 hops:
0 yy.mydomain.com [10.8.51.174]
1 10.8.0.2
2 10.8.170.145
3 10.8.170.17
4 192.168.170.88
5 10.42.0.110
6 * * *
Computing statistics for 125 seconds...
Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address
0 yy.mydomain.com [10.8.51.174]
1 0ms 0/ 100 = 0% 10.8.0.2
2 0ms 0/ 100 = 0% 10.8.170.145
3 1ms 0/ 100 = 0% 10.8.170.17
4 -- 100/ 100 = 100% 98/100= 98% 192.168.170.88
5 202ms 2/ 100 = 2% 0/100= 0% 10.42.0.110
I’m not sure what’s 192.168.X.X
address here. I use flannel
, and CIDR is 10.*
The communication is the following: Request is being sent to pod (3 replicas) that has frontend part of the app. When I open page that needs to go to another pod and fetch the results, that’s the point where it stucks, or it loads but after really long time. Can see in logs that request is timed out, but can’t find a reason why. I checked all metrics from node-exporter, everything seems fine, and also metrics on Azure about VMs, they also look fine. I have enough resources on the nodes, even tried to restart flannel
, really have no idea about this internal 192.X
IP and how to troubleshoot it further. Communication from ui to db pod works fine when i ping it.