How to detect the encoding of a file?

On my filesystem (Windows 7) I have some text files (These are SQL script files, if that matters).

When opened with Notepad++, in the “Encoding” menu some of them are reported to have an encoding of “UCS-2 Little Endian” and some of “UTF-8 without BOM”.

What is the difference here? They all seem to be perfectly valid scripts. How could I tell what encodings the file have without Notepad++?

7

Files generally indicate their encoding with a file header. There are many examples here. However, even reading the header you can never be sure what encoding a file is really using.

For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file. However, it might be an ISO-8859-1 file which happens to start with the characters . Or it might be a different file type entirely.

Notepad++ does its best to guess what encoding a file is using, and most of the time it gets it right. Sometimes it does get it wrong though – that’s why that ‘Encoding’ menu is there, so you can override its best guess.

For the two encodings you mention:

  • The “UCS-2 Little Endian” files are UTF-16 files (based on what I understand from the info here) so probably start with 0xFF,0xFE as the first 2 bytes. From what I can tell, Notepad++ describes them as “UCS-2” since it doesn’t support certain facets of UTF-16.
  • The “UTF-8 without BOM” files don’t have any header bytes. That’s what the “without BOM” bit means.

11

You cannot. If you could do that, there would not be so many web sites or text files with “random gibberish” out there. That’s why the encoding is usually sent along with the payload as meta data.

In case it’s not, all you can do is a “smart guess” but the result is often ambiguous since the same byte sequence might be valid in several encodings.

5

The character encoding can generally not be determined completely. However, there are many hints:

  1. ASCII contains only bytes with values below 0x7F, originally it is a 7 bit encoding, but the byte values are simply zero-padded so the first bit is always zero;
  2. UTF-8 contains ASCII + additional bytes for which the highest bit is set, e.g. the three most significant bits may be set to 110 to indicate that two bytes are used instead of one to encode a character. UTF-8 may contain a Byte Order Mark (BOM), but usually it doesn’t. The encoding is identical to ASCII if only ASCII characters are present.
  3. UTF-16 is usually prefixed with a BOM as it is a 16 bit encoding that may either use big- or little endian w.r.t. the order of the bytes (not the bits inside the bytes). As it is a 16 bit encoding where the only the lowest 7 bits encode ASCII, it is usually easy to recognize humans, and easily distinguished using statistical heuristics as well.
  4. There are many, many 8 bit encoding schemes, such as Windows-1252 also known as CP-1252. This encoding extends ISO-8859-1 which encodes the Latin-1 character set. This by itself is a form of extended ASCII.

UTF-16 is generally easy to recognize due to the common BOM and many bytes set to zero – at least for Western languages that use Latin-1. UTF-8 usually doesn’t have a BOM, but the encoding scheme for additional characters is relatively easy to recognize.

A text editor that only sees ASCII will usually represent them using UTF-8 (now more and more the default) or Windows-1252. Sometimes applications and languages will simply keep to the system default. However, nowadays many text files do not do this and simply default to UTF-8 for all text. It has been the common default on Linux and Android for a long time now.


For older systems usually a system-specific code page was used. One of the more recognizable ones – at least for Westeners – is the IBM code page 437 as it was used for text-based windowing systems and a lot of ANSI art (sometimes incorrectly called ASCII art), going back to the time of DOS. However, quite often these code pages are not easily recognizable, which is why ASCII art often doesn’t look good when a text file is opened. It quite often defaults to the system default such as Windows-1252.

It is extremely uncommon, but sometimes other character encodings are used. Some of those are “dialects” of ASCII such as IA5 that are just slightly different. More commonly though they would be text files using a code page for another country, where the first 128 codes are ASCII compatible.

If you come across such an encoding then you could convert to UTF-8 which is generally recognized easily, and it contains all the possible characters from the various code pages.

Assuming you have a file that is given to you just as a sequence of bytes, with no indication of the encoding, and you want to either determine an encoding consistent with the bytes, or reject the file.

You can first check whether the bytes are consistent with an encoding. For example, many byte sequences are not valid ASCII, or valid UTF-8, or valid UTF-16 or UTF-32.

And then you can check whether your data looks reasonable in some encoding. For example, lots of data might be valid in some Chinese encoding, but look like complete nonsense. That has to be done carefully. For example, base-85 encoded data looks like nonsense even though it is valid ASCII or UFT-8.

Note that UTF-16 is interesting. In practice you can almost always detect that a file is UTF-16. For example the bytes in a file containing “hello” in utf-16 consist of 5 ASCII bytes, preceded or followed by a zero byte each. But most pairs of two ASCII bytes are valid utf-16, so many files with an even number of ASCII bytes could be UTF-16 with some very, very strange contents.

I tried to id the encoding on three files that actually ended up being encrypted without any headers, footer or checksum. chardet was no good, hexadecimal comparison or string extraction produced nothing, however, what did work was to assess the software that created the files.

But, for your situation, you would want to use hexadecimal and a visual inspector. As win7 has none, a batch would work. You could perhaps use copy or pipe tools. Here’s one someone else made. https://stackoverflow.com/questions/27575910/how-to-convert-binary-to-hex-in-batch

This batch complements the other answers I think, as it’s an actual technique that would likely work on win7 (I haven’t tried it though!)

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật