I’m working on an cross platform C++ project, which doesn’t consider unicode, and need change to support unicode.
There is following two choices, and I need to decide which one to choose.
- Using UTF-8 (std::string) which will make it easy to support posix system.
- Using UTF-32 (std::wstring) which will make it easy to call windows API.
So for item #1 UTF8, the benefit is code change will not too many. But the concern is some basic rule will broken for UTF8, for example,
- string.size() will not equal the character length.
- search an ‘/’ in path will be hard to implement (I’m not 100% sure).
So any more experience? And which one I should choose?
3
Use UTF-8. string.size()
won’t equal the amount of code points, but that is mostly a useless metric anyway. In almost all cases, you should either worry about the number of user-perceived characters/glyphs (and for that, UTF-32 fails just as badly), or about the number of bytes of storage used (for this, UTF-32 is offers no advantage and uses more bytes to boot).
Searching for an ASCII character, such as /
, will actually be easier than with other encodings, because you can simply use any byte/ASCII based search routine (even old C strstr
if you have 0 terminators). UTF-8 is designed such that all ASCII characters use the same byte representation in UTF-8, and no non-ASCII character shares any byte with any ASCII character.
The Windows API uses UTF-16, and UTF-16 doesn’t offer string.size() == code_point_count
either. It also shares all downsides of UTF-32, more or less. Furthermore, making the application handle Unicode probably won’t be as simple as making all strings UTF-{8,16,32}; good Unicode support can require some tricky logic like normalizing text, handling silly code points well (this can become a security issue for some applications), making string manipulations such as slicing and iteration work with glyphs or code points instead of bytes, etc.
There are more reasons to use UTF-8 (and reasons not to use UTF-{16,32}) than I can reasonably describe here. Please refer to the UTF-8 manifesto if you need more convincing.
3