I am reading files into Ruby strings, and these strings are later processed further (for instance, using the CSV module). The external encoding of the files is a parameter, and supposedly, the files to be processed should be of that specified encoding.
During reading, I convert the files from the supposed external encoding into UTF-8.
Occasionally, I get erroneous files which are encoded in a different way than the specified encoding.
Of course if the encoding is wrong, my program will read only garbage, but if the encoding is not only wrong, but even contains byte sequences which are illegal under the supposed encoding, I will get an exception when processing the file.
The specification requires that those byte sequences, which can not be deciphered due to incorrect encoding, should be simply removed from the input file instead of causing the program to abort.
To implement this, I am reading a file into a string like this:
UTF8_CONVERTER = ->(field) { field.encode('utf-8', invalid: :replace, undef: :replace, replace: "") }
read_flags = {
external_encoding: ext_enc, # i.e. Encoding::ISO_8859_1
internal_encoding: Encoding::UTF_8,
converters: UTF8_CONVERTER
}
file_content = IO.read(file_path, read_flags)
IMO, this should make a file_content a valid string which is UTF-8 encoded. If my program later decides that this string should be CSV parsed, it invokes the csv parser like this:
e_enc = file_content.encoding
i_enc = Encoding::UTF_8
...
csv_opt = { col_sep: ';', row_sep: :auto, external_encoding: e_enc, internal_encoding: i_enc}
CSV.foreach(file_content, csv_opt) { .... }
The reason why I redundantly specify the encoding here too, is, that the method which is processing the CSV, has a general purpose, and also should work if Strings have a different encoding.
However, this does not work:
If I am processing a file which is supposed to be UTF-8 (i.e. ext_enc
equals Encoding::UTF_8
), but in reality was encoded in for instance Windows-1252, and there are some byte sequence in it, which would be illegal under UTF, CSV.foreach
raises the exception ArgumentError: invalid byte sequence in UTF-8.
I conclude from this, that my UTF8_CONVERTER
did not remove the incorrect bytes.
Can anybody see what I’m doing wrong here?