Lets assume I have a 3 nodes setup, and a replication factor of 3 and a database of around 150GB.
-
Then I up scale it to 6 nodes, I assume yugabyte will auto rebalance all the replicas and etc correct?
-
Going forward, now I downscale it to 3 nodes again, same would happen. ( i guess i can drain and etc before). But what worries me is now, i have to scale it up again. what would be the prefer way for yugabyte?
-
The new nodes have a clean disk, or we should have old stale data? (lets say from a few days or weeks ago)
I ask this to understand if we should just remove the machines disks or pvcs or keep for later use?
Cause depending on how this replication/verification is done, it might be too “expensive” to verify / sync data over just copying it again. -
Is there an easy api/metric/etc where i can see the rebalance is “done” ?
Then i up scale it to 6 nodes, I assume yugabyte will auto rebalance all the replicas and etc correct?
Yes.
Going forward, now I downscale it to 3 nodes again, same would happen. ( i guess i can drain and etc before )
Yes, but you need to do this slowly, not just kill 3 nodes, because you might kill 3 replicas of a single tablet.
The new nodes have a clean disk, or we should have old stale data? (lets say from a few days or weeks ago)
Generally, the yb-tservers will need a clean “bootstrap” if data is older than (default 15 minutes) https://docs.yugabyte.com/preview/reference/configuration/yb-tserver/#log-min-seconds-to-retain
After 15 minutes, beyond the retention of WAL, the old replicas cannot catch up (because of log_min_seconds_to_retain
). New replicas are bootstrapped on new node (because of follower_unavailable_considered_failed_sec
). If nodes come back again, they will be seen as new ndoes, and some replicas will move to them to balance the load. After few days or week, that’s the expected way. Keeping weeks of WAL and replaying them will not be efficient.
I ask this to understand if we should just remove the machines disks or pvcs or keep for later use.
Just removing them is cleaner for this case.
Cause depending on how this replication/verification is done, it might be too “expensive” to verify / sync data over just copying it again
It’s just network bandwidth, not expensive in any other way.
Lets assume we have a database of around 150GB
Just use clean disks, much less complicated.
You should follow steps in https://docs.yugabyte.com/preview/manage/change-cluster-config/ (it has both steps of adding & removing).
Is there an easy api/metric/etc where i can see the rebalance is “done” ?
See get_load_move_completion
in https://docs.yugabyte.com/preview/manage/change-cluster-config/