There are many questions on this site regarding Unicode and wchar_t
. I guess I have grasped the concept, but then found something that proves most (if not all) answers wrong if it is true. On this page, Microsoft claims that one wchar_t
character can hold any Unicode character (emphasis mine):
A wide character is a 2-byte multilingual character code. Any character in use in modern computing worldwide, including technical symbols and special publishing characters, can be represented according to the Unicode specification as a wide character. Developed and maintained by a large consortium that includes Microsoft, the Unicode standard is now widely accepted.
A wide character is of type wchar_t. A wide-character string is represented as a wchar_t[] array. You point to the array with a wchar_t* pointer.
Since this statement is from Microsoft directly, I am quite worried now:
How can a “two-byte multilingual character code” hold any character of the Unicode character set that already contains around 150,000 code points (characters)? [ Plus, if we take into account the private use code points, surrogates, code points that already have been reserved, and so on, it would be over a 1,000,000 code points? ]
I hope that this question is not a duplicate because its core is that Microsoft itself states something which seems to be plain wrong, and I really would like to know what I have misunderstood specifically on the Microsoft page I have linked.
By the way, then there is this page that contradicts the first one and eventually tells the truth:
Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as one or two 16-bit values.
So obviously, we sometimes need two wchar_t
characters (4 bytes) to represent Unicode code points. Well, that would make sense somehow, but given the contradicting documentation, I am complete unsure now.
If somebody is interested in how that question originated:
In one of my projects, I have a string that must have a character at a certain fixed position replaced by another character. This happens in a loop and must be done as fast as possible. This is a no-brainer with normal char[]
strings. But the string in question is of type wchar_t[]
, and I don’t have control over the replacement characters.
Depending on which from the above Microsoft statements is true, this is either a no-brainer, too (if the first statement is true), but if the second statement is true, it would become quite a mess: I the could not just replace the wchar_t
character at the respective index by the replacement character, because the original character might be one wchar_t
while the second might need two wchar_t
, or vice versa.
That’s why I’d like know which documentation is true.