Currently I am running cluster of 3 Kafka machines. Two of those are hosted in same data center and last one is in different. At the end of post I will include node property file and jvm params.
Recently I moved my cluster from Zookeeper to Kraft. Cluster is working properly and Kafka is accessible 100% of time, but I am worried about things that can be seen in logs since migration was done.
It is hard to find any information if those are not harmful or are affecting cluster performance in any significant way. I am assuming it is related with some internet connection hiccups between nodes but I would like to know if it is normal or if it’s not then how can I strive to minimize or even remove those issues.
So first thing is setting Quorum leader to none.
It can happen when nodes is being disconnected for some reason or when candidate itself is experiencing some sort of “metadata event”. Former appears in random moments few times during week. Latter one can be logged multiple times per hour but it is mostly logged on two machines that are hosted in same data center (Can’t find any traces behind that “metadata event” in logs beside that it happened in general).
First case:
[2024-05-20 09:06:09,859] INFO [QuorumController id=1] In the new epoch 13004, the leader is (none). (org.apache.kafka.controller.QuorumController)
[2024-05-20 09:06:09,998] INFO [BrokerToControllerChannelManager id=1 name=heartbeat] Client requested disconnect from node 2 (org.apache.kafka.clients.NetworkClient)
[2024-05-20 09:06:10,449] INFO [RaftManager id=1] Completed transition to Unattached(epoch=13005, voters=[1, 2, 3], electionTimeoutMs=11) from Unattached(epoch=13004, voters=[1, 2, 3], electionTimeoutMs=628) (org.apache.kafka.raft.QuorumState)
[2024-05-20 09:06:10,449] INFO [RaftManager id=1] Vote request VoteRequestData(clusterId='ba92tKAvQY2zT-PzieD7sA', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=13005, candidateId=3, lastOffsetEpoch=13001, lastOffset=5515535)])]) with epoch 13005 is rejected (org.apache.kafka.raft.KafkaRaftClient)
[2024-05-20 09:06:10,449] INFO [QuorumController id=1] In the new epoch 13005, the leader is (none). (org.apache.kafka.controller.QuorumController)
Second case:
[2024-05-20 09:06:09,358] WARN [QuorumController id=2] Renouncing the leadership due to a metadata log event. We were the leader at epoch 13001, but in the new epoch 13002, the leader is (none). Reverting to last stable offset 5515581. (org.apache.kafka.controller.QuorumController)
Another thing is marking partition as failed. I assume those are related with node not catching up to current epoch state (Can be really wrong). It happens for all or almost all topics on singular nodes. (Got 3 nodes with replication level 3)
[2024-05-19 04:02:03,589] WARN [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-40 marked as failed (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,589] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition enrichment_topology_2-0 has an older epoch (67) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,589] WARN [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition enrichment_topology_2-0 marked as failed (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,590] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-36 has an older epoch (67) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,590] WARN [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-36 marked as failed (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,590] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-4 has an older epoch (67) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaFetcherThread)
[2024-05-19 04:02:03,590] WARN [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Partition __consumer_offsets-4 marked as failed (kafka.server.ReplicaFetcherThread)
Last thing that I noticed is ZK migration state log entries. I assume those are harmless but I am confused why log is on WARN level.
[2024-05-20 09:06:10,477] WARN [QuorumController id=1] Performing controller activation. Loaded ZK migration state of NONE. (org.apache.kafka.controller.QuorumController)
Those are all worries that I have in regards to my current Kafka cluster.
I would really appreciate if someone were to tell me if it is intended behavior or not.
If it is not then I would appreciate information how to debug Kafka better to spot where the issue lies.
Thank for all of your help in advance!
My Kafka heap options are following:
KAFKA_HEAP_OPTS=-Xmx6g -Xms6g -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+ExplicitGCInvokesConcurrent
This is my node property file (I excluded all most likely not useful entries from it. Like logging level etc):
process.roles=broker,controller
quorum.type=raft
inter.broker.listener.name=PLAINTEXT
advertised.listeners=PLAINTEXT://:9092 (3rd node needs to have his IP explicitly stated there since data center resolve his host name in some strange way)
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
metadata.replication.factor=3
log.message.format.version=3.4
num.partitions=1
default.replication.factor=3
min.insync.replicas=2
offsets.topic.replication.factor=3
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
ssl.cipher.suites=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
ssl.enabled.protocols=TLSv1.2
ssl.protocol=TLSv1.2
ssl.endpoint.identification.algorithm=HTTPS
broker.id=(1|2|3)
controller.quorum.voters=xxx
cluster.id=ba92tKAvQY2zT-PzieD7sA
Rafał Wójcik is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.