everyone.
I am using AWS ElastiCache Redis OSS (cluster mode) with 3 shards, and each shard has 3 nodes, making a total of 9 nodes.
I’m also using the go-redis library in Golang to connect to the Redis cluster.
The issue I’m facing is that I keep encountering MOVED errors frequently.
Here’s a snippet of the code I’m using for the initial setup:
option := new(redis.ClusterOptions)
option.Addrs = []string{"qekc.clustercfg.usw1.cache.amazonaws.com:7480"} // configuration endpoint
option.Username = config.Username
option.Password = config.Password
option.ReadTimeout = time.Duration(config.Timeout) * time.Second
option.PoolSize = config.PoolSize
option.MaxRetries = 10
option.MaxRedirects = 20
option.ReadOnly = false
option.RouteByLatency = false
option.RouteRandomly = false
cluster.client = redis.NewClusterClient(option)
if pingResult, pingErr := cluster.client.Ping(ctx).Result(); pingErr != nil {
log.Error().Msgf("ping err: %v", pingErr)
}
This is my initial setup code.
From what I’ve researched online, I’ve tried adding all the node endpoints in the Addrs field, and also tried using just a single node endpoint. But the issue still persists.
When I reduce the MaxRedirects to a very low value, the frequency of MOVED errors seems to increase, which suggests that it’s related to the redirection logic.
As far as I understand, when using ClusterClient, MOVED errors can occur when nodes are shuffled in Redis, but the client should automatically find the correct node and complete the request. Therefore, I would expect a MaxRedirects value of 20 to be more than sufficient, but I’m still seeing MOVED errors and failed data retrievals.
I’ve been stuck on this issue for days and would appreciate any help or insights into what might be going wrong. If anyone could point out what I might be missing, I would be very grateful.
Thank you!
MOVED
errors occurs when your topology changes, can be for many reasons like failover, disconnection, scale in or out, manual changes, role changes and more.
Iv’e checked go-redis and I think the logic behind the topology refreshing (in their terminology its state) is not ideal if I’m being kind, and you can’t even configure a periodic check.
Problems with topology changes identification are common in almost all clients, and I’m dealing with it daily, but it seems that go-redis doesn’t trigger it at all if no error occurs, which is an extreme case.
In case you see many moves, try to run ClusterClient.ReloadState()
every 30/60 sec. It should solve it for most cases.
Note that each call cost some performance price, but your cluster is not big, so for you, it’s not expensive and with fewer moves you are going to earn much better performance overall.
No need to add more than a single endpoint, it is actually a bad idea.
The state is retrieved by sending cluster slots
and getting the topology view from the server.
Using more than one endpoint would not help, and since the node endpoint can change, but config endpoint don’t, you should use the config only to avoid errors.
What might happen with the moved redirect is that it is not built right and the thread refreshing the topology on moves takes his time to update the others.
When getting moves, it’s taking time to the all connection to be aware of the new cluster view.
So you keep getting move on the next few commands, and lowering the redirection not helping, it’s just sent more commands to the wrong shard.
But this is just an assumption.
Feel free to reach out for more question or follow up question.