This table summarizes UTF-8’s variable-width encoding scheme.
After noticing that not all the available 2-byte encodings are used, I extended the table as shown below, and saw that this was the case for 3-byte and 4-byte encodings as well (delta
> 0).
| number of bytes | bits available for encodings | number of available encodings | number of codepoints | delta = # avail encodings - # codepoints |
|-----------------|------------------------------|-------------------------------|-----------------------------------------------------|-------------------------------------------|
| 1 | 7 | 2^7 = 128 | 0x80 - 0x00 = 8*16 = 128 | 128 - 128 = 0 |
| 2 | 5+1*6 = 11 | 2^11 = 2,048 | 0x800 - 0x080 = 8*16^2 - 8*16 = 1,920 | 2,048 - 1,920 = 128 |
| 3 | 4+2*6 = 16 | 2^16 = 65,536 | 0x1_0000 - 0x800 = 16^4 - 8*16^2 = 63,488 | 65,336 - 63,488 = 2,048 |
| 4 | 3+3*6 = 21 | 2^21 = 2,097,152 | 0x11_0000 - 0x1_0000 = 0x10_0000 = 16^5 = 1,048,576 | 2,097,152 - 1,048,576 = 1,048,576 |
Is there a reason that so many encodings are unused?