We have 40 Partitions in Kafka Topic, and we are running 20 Pods with 2 Threads each. During peak data rate time, we are seeing that some partitions show high Kafka lag.
One more point to note is that this is not specific to 1 Pod, but we are seeing it happening with more than 1 Pods, where 1 Thread in Pod P1 and 1 Thread in Pod P2 are not able to consume from the Kafka Partitions assigned to them.
Our CPU Usage and Memory usages have not reached to Limits we have set and is well below even during the peak data rate when we see the Kafka Lag coming in.
Also, processing delay(Time when data is sent/sinked from job – Time when data reach the job ) is coming close 100mS which means that procession the data is not at all the problem.
We are using K-Streams in Sprint Boot application running on Kubernetes.
Need some guidance on root causing the issue.
2