We encounter an issue when we conduct load testing with lot of concurrent requests, Istio sometimes terminates connection unexpectedly resulted into EOF error.
We have 2 services deployed as Deployment object. Service A accepts requests from client and make request to service B. Service A has connection to service B over HTTP/2 using gRPC. Service A and B has istio sidecars injected and both have pretty simple configuration without destination rules, 2 replicas for both deployments and have only one virtual service for service A to connect outside mesh clients via istio gateway. Service A and B report about EOF in their logs. Service A is written on NodeJS and service is written on Go.
Istio version: 1.24.2
Kubernetes version: EKS 1.30
How we reproduce the problem?
We run a simple load test scenario with warm, ramp up and spike phases. Problem appears on last stage with approximately 100-150+ concurrent requests.
What actions mitigate the problem?
- Adding annotation to service B to prevent istio sidecar injection. In this case no EOF between service A and service B
- Keeping Istio sidecars, but reduce replicas to 1 for service B. This action also removes EOF between service A and service B
What we have tried:
- Played around with different configuration for DestinationRule
- Added keep alive settings on Istio side
- Injected fault and set artificial timeout for service B for 300s. After 300s service A got response successfully from service B and return result to client
- Tried different load balancer simple modes for service B in DestinationRule of service A. Also tried one consistent hash mode using source ip
- Tried to add 9 retries with 5s interval to mitigate the problem (partial success, but from ~20k requests still 10-20 terminated connections)