Our kafka cluster always enters an abnormal state after running for about two days. Messages get stuck on some partitions of the topic, which cannot be replicated by the follower or consumed by consumers. Sometimes, the producer cannot send messages to it either.
Env:
- jdk: 11.0.9.11-4
- os: centos 7 aarch64
- kafka: 2.12_2.8.0
- kafka cluster: three servers(id: 1,2,3)
The broker’s server.log:
Broker1:
<code>[2024-05-28 04:22:20,955] ERROR Error while appending records to biz-crm-8 in dir /data/kafka-2.8.0/kafka-logs (kafka.server.LogDirFailureChannel)
java.io.IOException: no space left on device
<code>[2024-05-28 04:22:20,955] ERROR Error while appending records to biz-crm-8 in dir /data/kafka-2.8.0/kafka-logs (kafka.server.LogDirFailureChannel)
java.io.IOException: no space left on device
</code>
[2024-05-28 04:22:20,955] ERROR Error while appending records to biz-crm-8 in dir /data/kafka-2.8.0/kafka-logs (kafka.server.LogDirFailureChannel)
java.io.IOException: no space left on device
jvm exit due to disk full.
Broker2
<code>[2024-05-28 04:22:22,156] ERROR [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Error for partition biz-data-8 at offset 4 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.KafkaStorageException: Disk error when trying to access log file on the disk.
[2024-05-28 04:22:22,900] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Error sending fetch request (sessionId=2016142534, epoch=5627796) to node 1: (org.apache.kafka.clients.FetchSessionHandler)
java.io.IOException: Connection to 1 was disconnected before the response was read
[2024-05-28 04:22:38,031] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions HashSet(rule-mq-1, t-alert-rule-1, t
[2024-05-28 04:22:38,290] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions HashSet(multistage-data-8, tm-geoLoc
[2024-05-28 04:22:38,546] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,546] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThdf -h
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Shutting down (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Stopped (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Shutdown completed (kafka.server.ReplicaFetcherTh
[2024-05-28 04:24:47,834] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Deleting segment LogSegment(baseOffset=8
[2024-05-28 04:24:47,835] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Incremented log start offset to 9692530
[2024-05-28 04:25:47,836] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Deleting segment files LogSegment(baseOf
<code>[2024-05-28 04:22:22,156] ERROR [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Error for partition biz-data-8 at offset 4 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.KafkaStorageException: Disk error when trying to access log file on the disk.
....
[2024-05-28 04:22:22,900] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Error sending fetch request (sessionId=2016142534, epoch=5627796) to node 1: (org.apache.kafka.clients.FetchSessionHandler)
java.io.IOException: Connection to 1 was disconnected before the response was read
...
[2024-05-28 04:22:38,031] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions HashSet(rule-mq-1, t-alert-rule-1, t
[2024-05-28 04:22:38,290] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions HashSet(multistage-data-8, tm-geoLoc
[2024-05-28 04:22:38,546] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,546] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThdf -h
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Shutting down (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Stopped (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Shutdown completed (kafka.server.ReplicaFetcherTh
[2024-05-28 04:24:47,834] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Deleting segment LogSegment(baseOffset=8
[2024-05-28 04:24:47,835] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Incremented log start offset to 9692530
[2024-05-28 04:25:47,836] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Deleting segment files LogSegment(baseOf
</code>
[2024-05-28 04:22:22,156] ERROR [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Error for partition biz-data-8 at offset 4 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.KafkaStorageException: Disk error when trying to access log file on the disk.
....
[2024-05-28 04:22:22,900] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Error sending fetch request (sessionId=2016142534, epoch=5627796) to node 1: (org.apache.kafka.clients.FetchSessionHandler)
java.io.IOException: Connection to 1 was disconnected before the response was read
...
[2024-05-28 04:22:38,031] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions HashSet(rule-mq-1, t-alert-rule-1, t
[2024-05-28 04:22:38,290] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions HashSet(multistage-data-8, tm-geoLoc
[2024-05-28 04:22:38,546] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,546] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThdf -h
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Shutting down (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Stopped (kafka.server.ReplicaFetcherThread)
[2024-05-28 04:22:38,553] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=1] Shutdown completed (kafka.server.ReplicaFetcherTh
[2024-05-28 04:24:47,834] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Deleting segment LogSegment(baseOffset=8
[2024-05-28 04:24:47,835] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Incremented log start offset to 9692530
[2024-05-28 04:25:47,836] INFO [Log partition=rule-kafka-5, dir=/data/kafka-2.8.0/kafka-logs] Deleting segment files LogSegment(baseOf
Broker3 are basically the same.
Then we restart broker1 at 09:00
Broker1:
<code>[2024-05-28 09:20:44,550] ERROR [Log partition=multistage-data-1, dir=/data/kafka-2.8.0/kafka-logs] Could not find offset index file corresponding to log file /data/kafka-2.8.0/kafka-logs/biz-stage-1/00000000000000000063.log, recovering segment and rebuilding index files... (kafka.log.Log)
[2024-05-28 09:27:44,380] WARN [Log partition=tm-property-1, dir=/data/kafka-2.8.0/kafka-logs] Corruption found in segment 9978054, truncating to offset 10719933 (kafka.log.Log)
[2024-05-28 09:28:06,705] INFO Registered broker 1 at path /brokers/ids/1 with addresses: PLAINTEXT://10.12.11.77:9092, czxid (broker epoch): 30064809294 (kafka.zk.KafkaZkClient)
[2024-05-28 09:32:54,710] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-3 (kafka.server.ReplicaManager)
java.lang.ArrayIndexOutOfBoundsException: -1
at sun.nio.ch.NativeThreadSet.remove(NativeThreadSet.java:76)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:215)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:89)
at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:157)
[2024-05-28 09:34:41,221] ERROR [ReplicaManager broker=1] Error processing fetch with max size 1015821 from consumer on partition biz-crm-3: PartitionData(fetchOffset=14871332, logStartOffset=-1, maxBytes=1048576, currentLeaderEpoch=Optional.empty, lastFetchedEpoch=Optional.empty) (kafka.server.ReplicaManager)
java.lang.ArrayIndexOutOfBoundsException
[2024-05-28 10:08:07,491] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-6 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 16] and count of records 18 in topic partition biz-crm-6.
[2024-05-28 10:59:24,799] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-3 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 12] and count of records 15 in topic partition biz-crm-3.
<code>[2024-05-28 09:20:44,550] ERROR [Log partition=multistage-data-1, dir=/data/kafka-2.8.0/kafka-logs] Could not find offset index file corresponding to log file /data/kafka-2.8.0/kafka-logs/biz-stage-1/00000000000000000063.log, recovering segment and rebuilding index files... (kafka.log.Log)
[2024-05-28 09:27:44,380] WARN [Log partition=tm-property-1, dir=/data/kafka-2.8.0/kafka-logs] Corruption found in segment 9978054, truncating to offset 10719933 (kafka.log.Log)
...
[2024-05-28 09:28:06,705] INFO Registered broker 1 at path /brokers/ids/1 with addresses: PLAINTEXT://10.12.11.77:9092, czxid (broker epoch): 30064809294 (kafka.zk.KafkaZkClient)
...
[2024-05-28 09:32:54,710] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-3 (kafka.server.ReplicaManager)
java.lang.ArrayIndexOutOfBoundsException: -1
at sun.nio.ch.NativeThreadSet.remove(NativeThreadSet.java:76)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:215)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:89)
at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:157)
...
[2024-05-28 09:34:41,221] ERROR [ReplicaManager broker=1] Error processing fetch with max size 1015821 from consumer on partition biz-crm-3: PartitionData(fetchOffset=14871332, logStartOffset=-1, maxBytes=1048576, currentLeaderEpoch=Optional.empty, lastFetchedEpoch=Optional.empty) (kafka.server.ReplicaManager)
java.lang.ArrayIndexOutOfBoundsException
[2024-05-28 10:08:07,491] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-6 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 16] and count of records 18 in topic partition biz-crm-6.
...
[2024-05-28 10:59:24,799] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-3 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 12] and count of records 15 in topic partition biz-crm-3.
</code>
[2024-05-28 09:20:44,550] ERROR [Log partition=multistage-data-1, dir=/data/kafka-2.8.0/kafka-logs] Could not find offset index file corresponding to log file /data/kafka-2.8.0/kafka-logs/biz-stage-1/00000000000000000063.log, recovering segment and rebuilding index files... (kafka.log.Log)
[2024-05-28 09:27:44,380] WARN [Log partition=tm-property-1, dir=/data/kafka-2.8.0/kafka-logs] Corruption found in segment 9978054, truncating to offset 10719933 (kafka.log.Log)
...
[2024-05-28 09:28:06,705] INFO Registered broker 1 at path /brokers/ids/1 with addresses: PLAINTEXT://10.12.11.77:9092, czxid (broker epoch): 30064809294 (kafka.zk.KafkaZkClient)
...
[2024-05-28 09:32:54,710] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-3 (kafka.server.ReplicaManager)
java.lang.ArrayIndexOutOfBoundsException: -1
at sun.nio.ch.NativeThreadSet.remove(NativeThreadSet.java:76)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:215)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:89)
at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:157)
...
[2024-05-28 09:34:41,221] ERROR [ReplicaManager broker=1] Error processing fetch with max size 1015821 from consumer on partition biz-crm-3: PartitionData(fetchOffset=14871332, logStartOffset=-1, maxBytes=1048576, currentLeaderEpoch=Optional.empty, lastFetchedEpoch=Optional.empty) (kafka.server.ReplicaManager)
java.lang.ArrayIndexOutOfBoundsException
[2024-05-28 10:08:07,491] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-6 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 16] and count of records 18 in topic partition biz-crm-6.
...
[2024-05-28 10:59:24,799] ERROR [ReplicaManager broker=1] Error processing append operation on partition biz-crm-3 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 12] and count of records 15 in topic partition biz-crm-3.
Broker2:
<code>[2024-05-28 10:13:22,816] ERROR [ReplicaManager broker=2] Error processing append operation on partition biz-crm-7 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 13] and count
<code>[2024-05-28 10:13:22,816] ERROR [ReplicaManager broker=2] Error processing append operation on partition biz-crm-7 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 13] and count
</code>
[2024-05-28 10:13:22,816] ERROR [ReplicaManager broker=2] Error processing append operation on partition biz-crm-7 (kafka.server.ReplicaManager)
org.apache.kafka.common.InvalidRecordException: Inconsistent batch offset range [0, 13] and count
After restarting, Broker1 entered an inconsistent state, always printing exception:*
<code>java.lang.ArrayIndexOutOfBoundsException: -1
at sun.nio.ch.NativeThreadSet.remove(NativeThreadSet.java:76)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:215)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:89)
at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:157)
<code>java.lang.ArrayIndexOutOfBoundsException: -1
at sun.nio.ch.NativeThreadSet.remove(NativeThreadSet.java:76)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:215)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:89)
at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:157)
</code>
java.lang.ArrayIndexOutOfBoundsException: -1
at sun.nio.ch.NativeThreadSet.remove(NativeThreadSet.java:76)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:215)
at org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:89)
at org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
at kafka.log.LogSegment.append(LogSegment.scala:157)
I has reviewed source code of FileChannelImpl and NativeThreadSet:
<code>public int write(ByteBuffer src) throws IOException {
throw new NonWritableChannelException();
synchronized (positionLock) {
Util.checkChannelPositionAligned(position(), alignment);
n = IOUtil.write(fd, src, -1, direct, alignment, nd);
} while ((n == IOStatus.INTERRUPTED) && isOpen());
return IOStatus.normalize(n);
assert IOStatus.check(n);
<code>public int write(ByteBuffer src) throws IOException {
ensureOpen();
if (!writable)
throw new NonWritableChannelException();
synchronized (positionLock) {
if (direct)
Util.checkChannelPositionAligned(position(), alignment);
int n = 0;
int ti = -1;
try {
begin();
ti = threads.add();
if (!isOpen())
return 0;
do {
n = IOUtil.write(fd, src, -1, direct, alignment, nd);
} while ((n == IOStatus.INTERRUPTED) && isOpen());
return IOStatus.normalize(n);
} finally {
threads.remove(ti);
end(n > 0);
assert IOStatus.check(n);
}
}
}
</code>
public int write(ByteBuffer src) throws IOException {
ensureOpen();
if (!writable)
throw new NonWritableChannelException();
synchronized (positionLock) {
if (direct)
Util.checkChannelPositionAligned(position(), alignment);
int n = 0;
int ti = -1;
try {
begin();
ti = threads.add();
if (!isOpen())
return 0;
do {
n = IOUtil.write(fd, src, -1, direct, alignment, nd);
} while ((n == IOStatus.INTERRUPTED) && isOpen());
return IOStatus.normalize(n);
} finally {
threads.remove(ti);
end(n > 0);
assert IOStatus.check(n);
}
}
}
Only threads.add() return -1, then remove -1. but it won’t return -1 no matter.
long th = NativeThread.current();
// 0 and -1 are treated as placeholders, not real thread handles
if (used >= elts.length) {
long[] nelts = new long[nn];
System.arraycopy(elts, 0, nelts, 0, on);
for (int i = start; i < elts.length; i++) {
<code>int add() {
long th = NativeThread.current();
// 0 and -1 are treated as placeholders, not real thread handles
if (th == 0)
th = -1;
synchronized (this) {
int start = 0;
if (used >= elts.length) {
int on = elts.length;
int nn = on * 2;
long[] nelts = new long[nn];
System.arraycopy(elts, 0, nelts, 0, on);
elts = nelts;
start = on;
}
for (int i = start; i < elts.length; i++) {
if (elts[i] == 0) {
elts[i] = th;
used++;
return i;
}
}
assert false;
return -1;
}
}
</code>
int add() {
long th = NativeThread.current();
// 0 and -1 are treated as placeholders, not real thread handles
if (th == 0)
th = -1;
synchronized (this) {
int start = 0;
if (used >= elts.length) {
int on = elts.length;
int nn = on * 2;
long[] nelts = new long[nn];
System.arraycopy(elts, 0, nelts, 0, on);
elts = nelts;
start = on;
}
for (int i = start; i < elts.length; i++) {
if (elts[i] == 0) {
elts[i] = th;
used++;
return i;
}
}
assert false;
return -1;
}
}
Then I reviewed heap dump of broker 1, found NativeThread.used appeared negative value
I really don’t know what happened to make this situation happen.