We had an issue with our cluster resulting in outage due to failing outgoing connections.
The symptoms we were seeing were that a large portion of outgoing connections (connections to SQL server, connection to other HTTPS APIs) were failing with Timeouts.
Trying to access these resources from outside the cluster was fine hence the error must reside inside the cluster.
We found a single service that was connecting to a external service that had gone down was using up a lot of connections and seemingly exhausting everything on the VM.
We isolated the issue to happen only on services that were running on the same VM as the service that was using connections trying to connect to the failing external service.
Furthermore, we checked the number of failed SNAT connections, and even though it does show some failed SNAT connections during the first outage it didn’t occur when services were initially failing.
And during the second outage no failing SNAT connection occured.
Hence our conclusion; some resource that is linked to the VM must be exhausting before the SNAT connections, are there any limitations that could result in VM wide connection issues ?
We found and solved a range of issues resulting in the much lowered active connection (due to very high Max Pool Size on some legacy services).
Numbers from the LoadBalancer of the AKS: