I am working with a Kafka consumer that occasionally hangs or lags while processing messages from a partition. This raises several concerns, and I need guidance on best practices and configurations to address these issues:
Fallback Consumer: If a Kafka consumer hangs or lags, how can I start another consumer to take over and read messages from the same partition?
Unprocessed Messages: If a consumer reads messages but hangs/does not process them, what happens to those unprocessed messages? Will they remain in the partition, or will they be considered processed?
Resolving Consumer Hang/Lag: What are the common reasons for a Kafka consumer to hang or lag, and what configurations or properties can be adjusted to prevent this? Specifically:
Which Kafka consumer properties (e.g., session.timeout.ms, max.poll.interval.ms, etc.) should be modified from their default values to handle such situations?
Are there other strategies, such as monitoring, error handling, or retries, that I can implement?
Here’s a simple example of the issue:
A Kafka consumer starts reading messages from a topic and processes them.
During processing, the consumer hangs or lags, and the processed messages are not committed.
I want to ensure that these unprocessed messages are handled correctly and prevent message loss or duplication.
Any guidance on resolving these issues and best practices for configuring Kafka consumers to handle such scenarios would be greatly appreciated.
1
-
There is no guarantee the new consumer wont hang as well
-
The new consumer would probably have to join the same consumer group, if not using the assignment API rather than subscription. But in that case, you are manually managing offsets for the “stuck” consumer group.
-
The broker doesn’t track processing. The consumer group tracks offsets. Records also don’t know if/when/where processing of the data happens… the broker is dumb/basic
My best guess, you have some consumer/broker config mismatch with what the producer has set, and the records are being dropped somewhere in the network. E.g. the producer allows for sending > 1MB records, and the broker can store those too, but your consumer is just using default settings. See max message fetch bytes – How can I send large messages with Kafka (over 15MB)?
monitoring, error handling, or retries, that I can implement
All of the above, yes.
want to ensure that these unprocessed messages are handled correctly and prevent message loss or duplication
- replication factor >= 3 + min.insync.replicas>=2 (replication minus 1)
- producer acks=all
- producer transactions = enabled + used in code (even for single records, but it has been enabled by default, for several recent versions)
- enable consumer idempotent-cy
- Regularly test for network faults between clients and brokers.
2