(I posted a similar problem here, but this new question is not a duplicate).
Small reproducible problem:
b = "LxF6sé侍"
We have here a string, where one of its one bytes is illegal in UTF-8 (it’s the byte with hex F6). The encoding of the String from the Ruby viewpoint is Encoding::UTF_8
. Looking at the byte sequence, we can see
p b.bytes.to_a
=>
[76, 246, 115, 195, 169, 228, 190, 141]
My goal is to remove from the string all bytes which are illegal in UTF-8. I want to get in my simple example a string with content "Lsé侍"
.
I tried
c1 = b.encode('UTF-8', invalid: :replace, replace: '')
but c1
has the same content as b
. Then I tried
b.force_encoding(Encoding::ASCII_UTF8)
c2 = b.encode('UTF-8', invalid: :replace, replace: '')
but this also erases the characters é and 侍, since they are not valid in ASCII.
I also was thinking of putting together a hard coded list of those byte values which are invalid in UTF8, and simply delete them from the string, but this is ugly.
Any ideas how this can be done?
1