Should Latin-1 be used over UTF-8 when it comes to database configuration?

We are using MySQL at the company I work for, and we build both client-facing and internal applications using Ruby on Rails.

When I started working here, I ran into a problem what I had never encountered before; the database on the production server is set to Latin-1, meaning that the MySQL gem throws an exception whenever there is user input where the user copies & pastes UTF-8 characters.

My boss calls these “bad characters” since most of them are non-printable characters, and says that we need to strip them out. I’ve found a few ways to do this, but eventually we’ve ended up in a circumstance where a UTF-8 character was needed. Plus it’s a bit of a hassle, especially since it seems like the only solution I ever read about for this issue is to just set the database to UTF-8 (makes sense to me).

The only argument that I’ve heard for sticking with Latin-1 is that allowing non-printable UTF-8 characters can mess up text/full-text searches in MySQL. Is this really true?

Are there other reasons one should use Latin-1 over UTF-8? It’s my understanding that it is superior and becoming more ubiquitous.

24

Unicode is certainly difficult, and the UTF-8 encoding has a couple of inconvenient properties. However, UTF-8 has become the de-facto standard encoding on the web, surpassing ASCII, Latin-1, UCS-2 and UTF-16. Just use UTF-8 everywhere.

The most important reason why you should support Unicode is that you shouldn’t make unnecessary assumptions about user input. I have no idea what your domain is, but things like Hebrew usernames, a blog post about China, a comment with Emoji, or simply well styled text – like “this” – should be possible… Oh, those were typographically correct quotation marks (“” rather than ""), en-wide dashes, and an ellipsis, which are characters that are common in English text, but not supported by ASCII or Latin-1. So not supporting other scripts isn’t just a big f*ck you to other cultures, but sticking to Latin-1 doesn’t even allow you to write proper English.

The notion that Unicode only allows “bad characters” is wrong. Yes, text is really complicated, and Unicode won’t hide that from you. Your boss may be thinking about composed characters, where one base codepoint such as a is modified by subsequent codepoints that e.g. represent diacritics to form one visual character such as á. This doesn’t really get into your way when trying to do searches if you do some kind of normalization. For example, you could store all text in the NFC form which collapses such compositions into their precomposed form if one is available. When doing searching, you could also strip all composing characters from the text, but this may substantially change their meaning in some languages.

Unicode also adds a lot of unprintable characters – but even ASCII has loads of them. Will you handle a NUL in the middle of a string? How about 0x1C, a “File Separator”? I’ve never seen half of those. Latin-1 adds a soft hyphen that indicates word break opportunities, but is otherwise invisible. Does that also break your full-text search? In other words, even ASCII and Latin-1 allow you to completely break your input if you assume it’s all just printable text!

9

I think beyond the technical question, your boss may not have the time to keep up to date on current standards.

Since his stance is not completely out to lunch, just out-dated, respect his position when discussing this matter (and you need to remember to discuss, not argue), and try to work through concerns he has with regards to UTF-8. I suspect the underlying issue is not a technical issue and may require some level of soft-skill negotiation.

3

Which of us is right?

Once upon a time, your boss was. But as time goes by, things change. Nowadays, you are (but before running to your boss, be sure to read Nelson’s answer too).

Old versions of MySQL, and old versions of mostly everything, dealt much better with the older Latin1/ISO-8859-1(5) than UTF8.

There is a reason why UTF8 has been created, evolved, and pushed mostly everywhere: if properly implemented, it works much better. There are some performance and storage issues stemming from the fact that a Latin1 character is 8 bits, while a UTF8 character may be from 8 to 32 bits long. So when planning VARCHAR you need to take this into account. And your search routines will be a tad slower. They will be able to do more things (e.g. searches with accent sensitivity or without. Can’t do those in Latin1 without extensive work), but they will take a bit more time.

But on the other hand, storage is cheap, the realistic overhead on file sizes is less than 2-3%, computing power is also cheap and getting cheaper in good accord with Moore’s Law; while your time and your customers’ expectations definitely aren’t.

You might have to worry for search tools etc. if you were the one to develop such tools. But you probably aren’t. You use those tools; even those that were not completely UTF8 compliant yesterday (as the earlier MySQLs weren’t), are today, or soon will be (e.g. MySQL with utf8mb4 support).

So by carefully planning and implementing UTF8 the right way (not slapping it over Latin1 as an afterthought) you can have code that is very reasonably future-proof, which, if you plan on ever doing business with any Asiatic country, is a Very Good Thing. And if you have no such plans, other people will have, and those people could be your customers, suppliers, or partners.

So when they start sending you UTF8 data, you’ll have to set up a complicated thingamajig to convert to and fro Latin1, and deal with unsolvable cases.

When you factor in the budget the cost of several skirmishes against the evil mojibake ninjas, and consider that they are not going to go away – as you already discovered – then you’ll realize that going UTF8 is not only simpler, it’s going to be cheaper as well.

Some situations where restricting the character set only to ASCII may make sense is for limited choice fields, e.g. status fields, because you strictly control the values that can be there, and foreign key/references to external system, because there are rarely any reasons for them to have anything but alphanumeric characters and a few symbols.

For any other texts, just use UTF-8.

4

To begin with the answer, it doesn’t matter, how your server is configured. The character encoding in MySQL could be configured per-column (means, same table could hold characters in multiple encodings, easy). I.e. my server (and a number of legacy databases in it) is configured for cp1251 by default for old clients that unable to set correct collation upon connect (different hardware clients), but main databases in production are all using UTF-8.

Speaking of “wasted space” – you can’t realistically call important data a waste, can you? Storage space increase, however, will be different depending on the language your data is in. From insignificant (less than 1%) increase if your site is primarily in English and up to 100%, if it is mailny using characters outside the ASCII range. And even more, if you move firther east. Later UTF-8 (so-called UTF8mb4) specifications allow up to 4 bytes per code point.

And to “who’s right”… Truth is, this is a social question more than it is technical. There could be valid reasons for specific server setups, but you must know the implications. But if you ask me, there’s no reason to not use UTF-8. It’s the one kind to rule all texts in the world.

2

Just explain to him that UTF-8 is the default for web traffic. And any user can enter any valid unicode character in their browser.

Its just much easier to have utf-8/unicode all the way from front end to back end than to deal with the many and various issues that result from utf-8-> latin-1-> utf-8.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật