With a clean install of Ceph version 18.2.4, I tried to add a new OSD in a HEALTH_OK cluster without triggering data movement.
Setting the “norebalance” flag didn’t prevent ceph to immediately move data to the new OSD.
Here is the trace of the commands I used :
<code># initial state: 3 hosts, 4 OSDs
$ ceph orch host ls
HOST ADDR LABELS STATUS
ceph01 192.168.101.10 _admin
ceph02 192.168.101.11 _admin
ceph03 192.168.101.12 _admin
3 hosts in cluster
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02197 root default
-3 0.01099 host ceph01
0 hdd 0.00549 osd.0 up 1.00000 1.00000
1 hdd 0.00549 osd.1 up 1.00000 1.00000
-5 0.01099 host ceph02
2 hdd 0.00549 osd.2 up 1.00000 1.00000
3 hdd 0.00549 osd.3 up 1.00000 1.00000
$ ceph -s
cluster:
id: 1ae2e354-6b87-11ef-aa6e-738d3ffe6d7d
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 6m)
mgr: ceph01.mbuuym(active, since 7m), standbys: ceph03.acyoah
osd: 4 osds: 4 up (since 6m), 4 in (since 6m)
data:
pools: 2 pools, 33 pgs
objects: 111 objects, 404 MiB
usage: 1007 MiB used, 21 GiB / 22 GiB avail
pgs: 33 active+clean
$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00549 1.00000 5.6 GiB 219 MiB 158 MiB 1 KiB 61 MiB 5.4 GiB 3.83 0.87 12 up
1 hdd 0.00549 1.00000 5.6 GiB 293 MiB 245 MiB 1 KiB 48 MiB 5.3 GiB 5.12 1.16 21 up
2 hdd 0.00549 1.00000 5.6 GiB 173 MiB 130 MiB 1 KiB 44 MiB 5.4 GiB 3.03 0.69 11 up
3 hdd 0.00549 1.00000 5.6 GiB 321 MiB 273 MiB 1 KiB 48 MiB 5.3 GiB 5.62 1.28 22 up
TOTAL 22 GiB 1007 MiB 806 MiB 6.2 KiB 201 MiB 21 GiB 4.40
MIN/MAX VAR: 0.69/1.28 STDDEV: 1.03
# adding a new OSD
$ ceph osd set norebalance
$ ceph orch daemon add osd ceph03:/dev/sdb
Created osd(s) 4 on host 'ceph03'
# final state: 3 hosts, 5 OSDs
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02747 root default
-3 0.01099 host ceph01
0 hdd 0.00549 osd.0 up 1.00000 1.00000
1 hdd 0.00549 osd.1 up 1.00000 1.00000
-5 0.01099 host ceph02
2 hdd 0.00549 osd.2 up 1.00000 1.00000
3 hdd 0.00549 osd.3 up 1.00000 1.00000
-7 0.00549 host ceph03
4 hdd 0.00549 osd.4 up 1.00000 1.00000
$ ceph -s
cluster:
id: 1ae2e354-6b87-11ef-aa6e-738d3ffe6d7d
health: HEALTH_WARN
norebalance flag(s) set
Reduced data availability: 3 pgs inactive, 1 pg peering
services:
mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 11m)
mgr: ceph01.mbuuym(active, since 12m), standbys: ceph03.acyoah
osd: 5 osds: 5 up (since 7s), 5 in (since 19s)
flags norebalance
data:
pools: 2 pools, 33 pgs
objects: 104 objects, 378 MiB
usage: 988 MiB used, 27 GiB / 28 GiB avail
pgs: 9.091% pgs unknown
15.152% pgs not active
23 active+clean
4 peering
3 unknown
2 active+undersized+remapped
1 remapped+peering
$ sleep 5m
$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00549 1.00000 5.6 GiB 182 MiB 121 MiB 1 KiB 61 MiB 5.4 GiB 3.19 0.87 9 up
1 hdd 0.00549 1.00000 5.6 GiB 221 MiB 173 MiB 1 KiB 48 MiB 5.4 GiB 3.86 1.05 14 up
2 hdd 0.00549 1.00000 5.6 GiB 163 MiB 102 MiB 1 KiB 61 MiB 5.4 GiB 2.85 0.78 8 up
3 hdd 0.00549 1.00000 5.6 GiB 246 MiB 198 MiB 1 KiB 48 MiB 5.3 GiB 4.30 1.17 18 up
4 hdd 0.00549 1.00000 5.6 GiB 240 MiB 214 MiB 1 KiB 26 MiB 5.4 GiB 4.20 1.14 17 up
TOTAL 28 GiB 1.0 GiB 808 MiB 7.8 KiB 245 MiB 27 GiB 3.68
MIN/MAX VAR: 0.78/1.17 STDDEV: 0.57
# new OSD #4 is already filled despite the "norebalance" flag
</code>
<code># initial state: 3 hosts, 4 OSDs
$ ceph orch host ls
HOST ADDR LABELS STATUS
ceph01 192.168.101.10 _admin
ceph02 192.168.101.11 _admin
ceph03 192.168.101.12 _admin
3 hosts in cluster
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02197 root default
-3 0.01099 host ceph01
0 hdd 0.00549 osd.0 up 1.00000 1.00000
1 hdd 0.00549 osd.1 up 1.00000 1.00000
-5 0.01099 host ceph02
2 hdd 0.00549 osd.2 up 1.00000 1.00000
3 hdd 0.00549 osd.3 up 1.00000 1.00000
$ ceph -s
cluster:
id: 1ae2e354-6b87-11ef-aa6e-738d3ffe6d7d
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 6m)
mgr: ceph01.mbuuym(active, since 7m), standbys: ceph03.acyoah
osd: 4 osds: 4 up (since 6m), 4 in (since 6m)
data:
pools: 2 pools, 33 pgs
objects: 111 objects, 404 MiB
usage: 1007 MiB used, 21 GiB / 22 GiB avail
pgs: 33 active+clean
$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00549 1.00000 5.6 GiB 219 MiB 158 MiB 1 KiB 61 MiB 5.4 GiB 3.83 0.87 12 up
1 hdd 0.00549 1.00000 5.6 GiB 293 MiB 245 MiB 1 KiB 48 MiB 5.3 GiB 5.12 1.16 21 up
2 hdd 0.00549 1.00000 5.6 GiB 173 MiB 130 MiB 1 KiB 44 MiB 5.4 GiB 3.03 0.69 11 up
3 hdd 0.00549 1.00000 5.6 GiB 321 MiB 273 MiB 1 KiB 48 MiB 5.3 GiB 5.62 1.28 22 up
TOTAL 22 GiB 1007 MiB 806 MiB 6.2 KiB 201 MiB 21 GiB 4.40
MIN/MAX VAR: 0.69/1.28 STDDEV: 1.03
# adding a new OSD
$ ceph osd set norebalance
$ ceph orch daemon add osd ceph03:/dev/sdb
Created osd(s) 4 on host 'ceph03'
# final state: 3 hosts, 5 OSDs
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02747 root default
-3 0.01099 host ceph01
0 hdd 0.00549 osd.0 up 1.00000 1.00000
1 hdd 0.00549 osd.1 up 1.00000 1.00000
-5 0.01099 host ceph02
2 hdd 0.00549 osd.2 up 1.00000 1.00000
3 hdd 0.00549 osd.3 up 1.00000 1.00000
-7 0.00549 host ceph03
4 hdd 0.00549 osd.4 up 1.00000 1.00000
$ ceph -s
cluster:
id: 1ae2e354-6b87-11ef-aa6e-738d3ffe6d7d
health: HEALTH_WARN
norebalance flag(s) set
Reduced data availability: 3 pgs inactive, 1 pg peering
services:
mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 11m)
mgr: ceph01.mbuuym(active, since 12m), standbys: ceph03.acyoah
osd: 5 osds: 5 up (since 7s), 5 in (since 19s)
flags norebalance
data:
pools: 2 pools, 33 pgs
objects: 104 objects, 378 MiB
usage: 988 MiB used, 27 GiB / 28 GiB avail
pgs: 9.091% pgs unknown
15.152% pgs not active
23 active+clean
4 peering
3 unknown
2 active+undersized+remapped
1 remapped+peering
$ sleep 5m
$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00549 1.00000 5.6 GiB 182 MiB 121 MiB 1 KiB 61 MiB 5.4 GiB 3.19 0.87 9 up
1 hdd 0.00549 1.00000 5.6 GiB 221 MiB 173 MiB 1 KiB 48 MiB 5.4 GiB 3.86 1.05 14 up
2 hdd 0.00549 1.00000 5.6 GiB 163 MiB 102 MiB 1 KiB 61 MiB 5.4 GiB 2.85 0.78 8 up
3 hdd 0.00549 1.00000 5.6 GiB 246 MiB 198 MiB 1 KiB 48 MiB 5.3 GiB 4.30 1.17 18 up
4 hdd 0.00549 1.00000 5.6 GiB 240 MiB 214 MiB 1 KiB 26 MiB 5.4 GiB 4.20 1.14 17 up
TOTAL 28 GiB 1.0 GiB 808 MiB 7.8 KiB 245 MiB 27 GiB 3.68
MIN/MAX VAR: 0.78/1.17 STDDEV: 0.57
# new OSD #4 is already filled despite the "norebalance" flag
</code>
# initial state: 3 hosts, 4 OSDs
$ ceph orch host ls
HOST ADDR LABELS STATUS
ceph01 192.168.101.10 _admin
ceph02 192.168.101.11 _admin
ceph03 192.168.101.12 _admin
3 hosts in cluster
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02197 root default
-3 0.01099 host ceph01
0 hdd 0.00549 osd.0 up 1.00000 1.00000
1 hdd 0.00549 osd.1 up 1.00000 1.00000
-5 0.01099 host ceph02
2 hdd 0.00549 osd.2 up 1.00000 1.00000
3 hdd 0.00549 osd.3 up 1.00000 1.00000
$ ceph -s
cluster:
id: 1ae2e354-6b87-11ef-aa6e-738d3ffe6d7d
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 6m)
mgr: ceph01.mbuuym(active, since 7m), standbys: ceph03.acyoah
osd: 4 osds: 4 up (since 6m), 4 in (since 6m)
data:
pools: 2 pools, 33 pgs
objects: 111 objects, 404 MiB
usage: 1007 MiB used, 21 GiB / 22 GiB avail
pgs: 33 active+clean
$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00549 1.00000 5.6 GiB 219 MiB 158 MiB 1 KiB 61 MiB 5.4 GiB 3.83 0.87 12 up
1 hdd 0.00549 1.00000 5.6 GiB 293 MiB 245 MiB 1 KiB 48 MiB 5.3 GiB 5.12 1.16 21 up
2 hdd 0.00549 1.00000 5.6 GiB 173 MiB 130 MiB 1 KiB 44 MiB 5.4 GiB 3.03 0.69 11 up
3 hdd 0.00549 1.00000 5.6 GiB 321 MiB 273 MiB 1 KiB 48 MiB 5.3 GiB 5.62 1.28 22 up
TOTAL 22 GiB 1007 MiB 806 MiB 6.2 KiB 201 MiB 21 GiB 4.40
MIN/MAX VAR: 0.69/1.28 STDDEV: 1.03
# adding a new OSD
$ ceph osd set norebalance
$ ceph orch daemon add osd ceph03:/dev/sdb
Created osd(s) 4 on host 'ceph03'
# final state: 3 hosts, 5 OSDs
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02747 root default
-3 0.01099 host ceph01
0 hdd 0.00549 osd.0 up 1.00000 1.00000
1 hdd 0.00549 osd.1 up 1.00000 1.00000
-5 0.01099 host ceph02
2 hdd 0.00549 osd.2 up 1.00000 1.00000
3 hdd 0.00549 osd.3 up 1.00000 1.00000
-7 0.00549 host ceph03
4 hdd 0.00549 osd.4 up 1.00000 1.00000
$ ceph -s
cluster:
id: 1ae2e354-6b87-11ef-aa6e-738d3ffe6d7d
health: HEALTH_WARN
norebalance flag(s) set
Reduced data availability: 3 pgs inactive, 1 pg peering
services:
mon: 3 daemons, quorum ceph01,ceph03,ceph02 (age 11m)
mgr: ceph01.mbuuym(active, since 12m), standbys: ceph03.acyoah
osd: 5 osds: 5 up (since 7s), 5 in (since 19s)
flags norebalance
data:
pools: 2 pools, 33 pgs
objects: 104 objects, 378 MiB
usage: 988 MiB used, 27 GiB / 28 GiB avail
pgs: 9.091% pgs unknown
15.152% pgs not active
23 active+clean
4 peering
3 unknown
2 active+undersized+remapped
1 remapped+peering
$ sleep 5m
$ ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00549 1.00000 5.6 GiB 182 MiB 121 MiB 1 KiB 61 MiB 5.4 GiB 3.19 0.87 9 up
1 hdd 0.00549 1.00000 5.6 GiB 221 MiB 173 MiB 1 KiB 48 MiB 5.4 GiB 3.86 1.05 14 up
2 hdd 0.00549 1.00000 5.6 GiB 163 MiB 102 MiB 1 KiB 61 MiB 5.4 GiB 2.85 0.78 8 up
3 hdd 0.00549 1.00000 5.6 GiB 246 MiB 198 MiB 1 KiB 48 MiB 5.3 GiB 4.30 1.17 18 up
4 hdd 0.00549 1.00000 5.6 GiB 240 MiB 214 MiB 1 KiB 26 MiB 5.4 GiB 4.20 1.14 17 up
TOTAL 28 GiB 1.0 GiB 808 MiB 7.8 KiB 245 MiB 27 GiB 3.68
MIN/MAX VAR: 0.78/1.17 STDDEV: 0.57
# new OSD #4 is already filled despite the "norebalance" flag
I’m particularly interested in understanding what happens in Ceph when a new
OSD is added to a healthy cluster, and why the “norebalance” flag doesn’t prevent
data movement in this situation.
Here are a few questions I’d like answered:
- what “events” (rebalancing, backfilling, recovering…) are triggered when
adding a new OSD to a healthy cluster ? - why are some PGs in “undersized+remapped” state just after adding the new
OSD, whereas they were not “undersized” before ? - what does “rebalancing” really mean in Ceph ? In which situation does it
occur ? - why doesn’t the “norebalance” flag prevent data movement when adding a new OSD to a healthy cluster ?
- in which situation is the “norebalance” flag usefull ?
3