The UTF-8 encoding allows both having and not having a BOM at the beginning of the byte sequence. This seems to create a subtle ambiguity, because the BOM itself represents the Unicode character U+FEFF.
For example, what character string does the following UTF-8 byte sequence (in a hex format) represent?
EF, BB, BF, 42, 43, 44
It can represent the character string “BCD” (containing 3 characters), with the first 3 bytes (EF, BB, BF) regarded as the BOM sequence. This seems to be the usual interpretation.
However, it can also represent the character string “[U+FEFF]BCD” (containing 4 characters), with the first 3 bytes (EF, BB, BF) not regarded as the BOM sequence but regarded as an ordinary UTF-8 encoding sequence of the Unicode character U+FEFF.
So, how to handle this ambiguity? Does the UTF-8 encoding have the rule that, if the byte sequence EF, BB, BF is at the beginning of the whole byte sequence, it must be interpreted as a BOM sequence instead of an encoding sequence of the Unicode character U+FEFF? But if this is the case, then the UTF-8 encoding cannot encode some Unicode character strings, namely, any Unicode character string starting with the Unicode character U+FEFF.
Other Unicode encodings, for example, UTF-16, may also have similar problems.