If I copy and paste some UTF-8 text [eg. “Wands!”
] into a TMemo
, it displays as expected.
If I generate a string containing the 3 bytes (as characters) for '“'
(ie 0xE2, 0x80, 0x9C
) and use Memo1.Lines.Add(x)
, it displays as 'â'
(0xE2 in extended ASCII) which it has stored as 0xC3, 0xA2
(UTF-8). The other two bytes of the string are stored as 0xC2, 0x80
& 0xC2, 0x9C
.
How can I indicate that the string that I am adding already has UTF-8 multi-byte characters? And why is the string pasted into the Memo not treated the same way?
I am trying to process text extracted from ePub files. Originally the idea was to generate sort versions of text containing characters with diacritics by replacing them with the un-accented characters, but I ran into this problem of inconsistent displays.
TMemo
(and more generally, TStrings
) works with Delphi’s native string
type only, which in Delphi 2009+ is a UTF-16 encoded UnicodeString
.
Since the Add()
method in your case expects a normal UTF-16 UnicodeString
, you can’t add UTF-8 encoded bytes using this method.
If you have UTF-8 bytes in memory, you have to either:
-
decode the UTF-8 first, such as with
TEncoding.UTF8.GetString()
, eg:Memo1.Lines.Add(TEncoding.UTF8.GetString(utf8Bytes));
-
put the UTF-8 bytes into a
UTF8String
, which the RTL can decode into aUnicodeString
, eg:var utf8Str: UTF8String; SetString(utf8Str, PAnsiChar(utf8Bytes), utf8Length); Memo1.Lines.Add(string(utf8Str));
As for why things work ok when copy/pasting, it is because the text is extracted from the clipboard as UTF-16 when pasted into TMemo
. The copier has to choose whether to place text on the clipboard using either the ANSI (CF_TEXT
) or UTF-16 (CF_UNICODETEXT
) format (the clipboard doesn’t natively support UTF-8, but the copier can use CF_LOCALE
to specify a locale when using CF_TEXT
). The clipboard automatically converts the text to UTF-16 if it is not already in UTF-16.
Best practice is to convert data to/from UTF-16 at the boundaries where the data enters/leaves your app, and then operate only with UTF-16 in memory.