When is it beneficial to use encodings other than UTF-8? Aside from dealing with pre-unicode documents, that is. And more importantly, why isn’t UTF-8 the default in most languages? That is, why do I often need to explicitly set it?
8
For an external encoding (i.e., an encoding of things not inside your program) it is very hard to beat UTF-8; it supports every character your users might ever reasonably need and there’s lots of support in many OSes and tools. (The one place that counts as an exception to this is in file names, where you must use the platform’s conventions if you want any kind of interoperability at all. Fortunately, many platforms now use UTF-8 for this so the warning is a moot point there.)
For an internal encoding, things are more complex. The issue is that a character in UTF-8 is not a constant number of bytes, which makes all sorts of operations rather more complex than you might hope. In particular, indexing into the string by character (a very common operation when doing string processing!) changes from an O(1) operation into an O(N) operation, and that can be a very significant performance issue. There are a number of possible workarounds, such as using a rope data-structure or converting the string into a fixed-width character format (typically ASCII, ISO 8859-1, UTF-16 or UTF-32, depending on the maximum Unicode value of the characters in the string). The problems that plague such formats (limited character support and/or endian-ness problems) don’t actually apply here because you can only apply a transformation where it is meaningful and you are only using it as an internal encoding.
Don’t think that you can get away with storing that internal encoding to disk or giving it to another program. It might be “convenient” but it’s a problem waiting to happen; send/store the data as UTF-8.
And don’t forget that there’s a lot of legacy data out there, far too much to dismiss. Of particular concern are various East Asian languages which have complex encodings that are potentially quite a bit shorter than UTF-8, so resulting in less pressure to convert, but there are many other issues lurking even in Western systems. (I don’t want to know what is happening in major bank databases…)
13
The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc).
However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice. Here are a few.
-
Internal encoding in Windows/C/C++/C#/Java/ObjectiveC. These environments do not internally support UTF-8 (or any multibyte encoding). Strings are respectively ANSI/UCS-2/UTF-16.
-
Legacy code, especially C/C++. Strings are typically ANSI/ISO/UTF-16/UTF-32.
-
Legacy data. There are vast mountains of textual data already encoded in some 8 bit format, including various code pages, JIS, etc.
The remaining cases involve the use of text files. They will likely remain an issue as long as plain old text files remain popular. The point is that text files do not encode their encoding, so the reader and writer have to make assumptions. Yes, there is something called a Byte Order Mark but it is neither required or recommended for UTF-8 files so any file containing 8 bit characters is of uncertain encoding.
Here are some examples using text files with little reason to allow or use UTF-8.
-
Software tools. Things like sed, awk, tr etc may or may not work with UTF-8. It’s often easier not to try.
-
Compilers. Most computer languages are defined in terms of 7 bit ASCII and read plain text files from disk, with special tricks for extended characters.
-
Log files, simple protocols, embedded systems. Sometimes 7/8 bit ASCII is just the easiest.
-
Not always needed. Most European languages can be encoded in code page 850 or 1252, with possible savings in space and coded logic.
I confidently expect that many of these will go away over time, but they are real reasons to avoid UTF-8 in certain situations until then.
8