I have designed an artemis (v2.19.1) (symmetric) cluster connection with 3 live-backup pairs to get rid of split-brain issue.Due to some constraints, I had to adopt Replication instead of shared storage.I have also grouped 1 live-backup pair as group1, group2 and group3 using .Recently I faced a production issue, where live from group-2 went down with Java Heap Memory issue. In log, it showed errors sequentially as follows:
ERROR AMQ224088 : Timeout (10 seconds) on acceptor during protocol handshake with /xxx.x.xxx.xx:xxxx has occured.
WARN AMQ22196 : Could not find binding with id=3,979,507,184 on routeFromCluster for message=CoreMessage[……..
ERROR AMQ224013 : failed to expire messages for queue: xxxx
WARN AMQ212037 : Connection Failure to /xxx.xx.xxx.xx has been detected: Java heap Space [code=GENERIC_EXCEPTION]
ERROR AMQ224016 : Caught exception: ActiveMQIllegalStateException[errortype=ILLEGAL_STATE message=AMQ229027: Could not find referrence on consumer ID=0, mesaageId = 3,979,701,033 queue= xxxxxx-xxxx-xxxx-xxxx]
ERROR AMQ229028: Consumer 0 doesn’t exist on the server
(…..continues in same sequence for around 1min)
FINALLY,
16:02 ERROR There is a possible split brain on nodeID xxxxxxxxx. Topology update ignored.
Then backup from group-2 became live but it couldn’t process whose logs are as follows:
16:02 INFO Initiating qurom vote
16:02 INFO Received all quorum votes
16:02 INFO Failing over based on quorum vote results
16:02 INFO Server is now live
.
.
.
16:03 INFO Auto removing Queue xxxxxxx
16:17 WARN Connection failure to xxx.x.xxx.xx
16:19 Auto removing address
Meanwhile artemis cluster had been restarted at around 16:43
17:02 INFO AMQ221000: backup Message broker is starting with configuration Broker Configuration.
My queries are:
- why there is no logging in backup server logs regarding data replication or it didnot happen atall from 16:02 till 16:43 ?
- How replication happens actually and if master is already out of the network, how data can be replicated to backup?
- why other queues and consumers conn3cted to live nodes from group1 and group3 also got impacted? They were supposed to work as expected.Can we configure spe ific queues to connect to specific node?
Since my application needs an immediate failover and it is expected that backup should immediately or within less time should be able to start ptocessing the stuck up messages, what can be a better design.
Sonali Mahapatra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.