When is it beneficial to not use utf-8? [duplicate]

When is it beneficial to use encodings other than UTF-8? Aside from dealing with pre-unicode documents, that is. And more importantly, why isn’t UTF-8 the default in most languages? That is, why do I often need to explicitly set it?

For an external encoding (i.e., an encoding of things not inside your program) it is very hard to beat UTF-8; it supports every character your users might ever reasonably need and there’s lots of support in many OSes and tools. (The one place that counts as an exception to this is in file names, where you must use the platform’s conventions if you want any kind of interoperability at all. Fortunately, many platforms now use UTF-8 for this so the warning is a moot point there.)

For an internal encoding, things are more complex. The issue is that a character in UTF-8 is not a constant number of bytes, which makes all sorts of operations rather more complex than you might hope. In particular, indexing into the string by character (a very common operation when doing string processing!) changes from an O(1) operation into an O(N) operation, and that can be a very significant performance issue. There are a number of possible workarounds, such as using a rope data-structure or converting the string into a fixed-width character format (typically ASCII, ISO 8859-1, UTF-16 or UTF-32, depending on the maximum Unicode value of the characters in the string). The problems that plague such formats (limited character support and/or endian-ness problems) don’t actually apply here because you can only apply a transformation where it is meaningful and you are only using it as an internal encoding.

Don’t think that you can get away with storing that internal encoding to disk or giving it to another program. It might be “convenient” but it’s a problem waiting to happen; send/store the data as UTF-8.

And don’t forget that there’s a lot of legacy data out there, far too much to dismiss. Of particular concern are various East Asian languages which have complex encodings that are potentially quite a bit shorter than UTF-8, so resulting in less pressure to convert, but there are many other issues lurking even in Western systems. (I don’t want to know what is happening in major bank databases…)

The answer is that UTF-8 is by far the best general-purpose data interchange encoding, and is almost mandatory if you are using any of the other protocols that build on it (mail, XML, HTML, etc).

However, UTF-8 is a multi-byte encoding and relatively new, so there are lots of situations where it is a poor choice. Here are a few.

Internal encoding in Windows/C/C++/C#/Java/ObjectiveC. These environments do not internally support UTF-8 (or any multibyte encoding). Strings are respectively ANSI/UCS-2/UTF-16.
Legacy code, especially C/C++. Strings are typically ANSI/ISO/UTF-16/UTF-32.
Legacy data. There are vast mountains of textual data already encoded in some 8 bit format, including various code pages, JIS, etc.

The remaining cases involve the use of text files. They will likely remain an issue as long as plain old text files remain popular. The point is that text files do not encode their encoding, so the reader and writer have to make assumptions. Yes, there is something called a Byte Order Mark but it is neither required or recommended for UTF-8 files so any file containing 8 bit characters is of uncertain encoding.

Here are some examples using text files with little reason to allow or use UTF-8.

Software tools. Things like sed, awk, tr etc may or may not work with UTF-8. It’s often easier not to try.
Compilers. Most computer languages are defined in terms of 7 bit ASCII and read plain text files from disk, with special tricks for extended characters.
Log files, simple protocols, embedded systems. Sometimes 7/8 bit ASCII is just the easiest.
Not always needed. Most European languages can be encoded in code page 850 or 1252, with possible savings in space and coded logic.

I confidently expect that many of these will go away over time, but they are real reasons to avoid UTF-8 in certain situations until then.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 07:52

Thẻ: character-encoding, text-encoding, utf-8

Thiết kế website giá rẻ

Danh mục

When is it beneficial to not use utf-8? [duplicate]