While testing whether quorum queues work well with our existing infrastructure, I noticed that our current setup is not as resilient against broker failures / restarts as I had previously assumed (NB: The problem also happens with mirrored classic queues, I verified that after noticing the problem).
For context: We are using the bitnami chart for our RabbitMQ deployment (see: https://github.com/bitnami/charts/tree/main/bitnami/rabbitmq), with three replicas. We opted for a Service of type LoadBalancer, so that external connections to the broker are possible. This means my application uses spring.rabbitmq.addresses=<LOADBALANCER IP>
.
Now, when the broker is restarted (e.g. by issuing kubectl rollout restart statefulset/rabbitmq
, or by draining nodes for maintenance) and my application is connected to one of the restarting pods while publishing messages, I observed some rather unexpected behavior: Depending on different factors I am losing up to 20,000 messages out of a million. I then, of course, tried to understand the problem and find solutions to it, but I feel quite stuck at the moment.
What I really want:
On temporary broker connection errors, like they are happening when restarting, I want to make sure all my messages are delivered to the broker after the connection is restored.
If that is not possible I would want to get a log message for each message that could not be delivered.
For reference I built a small project to reproduce my “current” problem, you can find it at https://github.com/Linus9000/spring-amqp-demo/tree/master
Now, what did I actually find and learn?
First: My code publishing a message looked something like this:
@RestController
@RequestMapping("/")
@CommonsLog
public class RabbitController {
private final RabbitTemplate rabbitTemplate;
public RabbitController(RabbitTemplate rabbitTemplate) {
this.rabbitTemplate = rabbitTemplate;
}
@GetMapping
public ResponseEntity<String> index() {
for (int i = 0; i < 1_000_000; i++) {
try {
this.sendMessage("myexchange", "myrouting", String.valueOf(i));
} catch (Exception e) {
log.error("Could not send message", e);
}
}
return ResponseEntity.noContent().build();
}
private void sendMessage(String exchange, String routingKey, Object content) {
MessageProperties properties = new MessageProperties();
properties.setDeliveryMode(MessageDeliveryMode.PERSISTENT);
Message message = new Message(content.toString().getBytes(), properties);
this.rabbitTemplate.convertAndSend(exchange, routingKey, message);
}
}
Now, when I restart the broker while I am in the for loop (i.e. waiting for my controller call to finish) I get one(!) error log:
2024-06-18T09:15:30.370+02:00 ERROR 25988 --- [rabbitmq-demo] [nio-8080-exec-1] c.example.rabbitmqdemo.RabbitController : Could not send message
org.springframework.amqp.AmqpIOException: java.net.SocketException: Connection reset by peer
Caused by: java.net.SocketException: Connection reset by peer
But when I check the broker (via management UI) I see only 959,372 messages in my queue. WTF happened here?
Naturally, this concerns me. After some digging I find the spring.rabbitmq.template.retry
configs, leading to:
spring:
rabbitmq:
template:
retry:
enabled: true
max-attempts: 100
multiplier: 2
max-interval: 120000
Do note the quite absurd values here.
Unfortunately, this only leads to missing more messages, I’m down to 909,813 messages which actually make it into the queue.
So, next experiment. How about spring.rabbitmq.template.mandatory
?
spring:
rabbitmq:
template:
mandatory: true
retry:
enabled: true
max-attempts: 100
multiplier: 2
max-interval: 120000
Still, only 947,848 messages.
Okay then, time for the next experiment: spring.rabbitmq.publisher-confirm-type=correlated
. Due to performance penalties I now only try to put 150,000 messages into the queue:
spring:
rabbitmq:
publisher-confirm-type: correlated
template:
mandatory: true
retry:
enabled: true
max-attempts: 100
multiplier: 2
max-interval: 120000
Now we’re getting somewhere: 149,970 messages in the queue. Only 30 messages missing! Time for the last puzzle piece, spring.rabbitmq.publisher-returns
:
spring:
rabbitmq:
publisher-confirm-type: correlated
publisher-returns: true
template:
mandatory: true
retry:
enabled: true
max-attempts: 100
multiplier: 2
max-interval: 120000
Now, we get additional error logs:
2024-06-18T09:39:53.980+02:00 ERROR 22904 --- [rabbitmq-demo] [nio-8080-exec-1] o.s.a.r.c.CachingConnectionFactory : Could not configure the channel to receive publisher confirms
java.io.IOException: null
Caused by: com.rabbitmq.client.ShutdownSignalException: connection error
Caused by: java.net.SocketException: Connection reset
And also:
2024-06-18T09:39:54.198+02:00 ERROR 22904 --- [rabbitmq-demo] [nio-8080-exec-1] c.example.rabbitmqdemo.RabbitController : Could not send message
org.springframework.amqp.AmqpException: PublisherCallbackChannel is closed
And still, only 149,972 messages.
Maybe me catching the exception when sending is wrong? Maybe the retry config already handles that?
Let’s try!
@GetMapping
public ResponseEntity<String> index() {
for (int i = 0; i < 150_000; i++) {
this.sendMessage("myexchange", "myrouting", String.valueOf(i));
}
return ResponseEntity.noContent().build();
}
Now it just stops after getting disconnected once, which in this case means only 30,524 messages made it into the queue.
Linus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.