I have a messy CSV file where one of the city columns has half bytes and half utf-8 and includes double quotes with b’.
Example Row: column1,"b'xc5x81xc3xb3dxc5xba, Poland'", column3
Since it is already a string, I am unable to use .decode(‘utf-8’) and it makes me encode to bytes again …which creates a double encoding. While by itself:
b'xc5x81xc3xb3dxc5xba, Poland'.decode('utf-8')
works in Jupyter notebook to get the correct result:
'Łódź, Poland'
When trying with:
column3.encode('utf-8').decode('utf-8')
the result is:
"b'Åx81ódź, Poland'"
How can I correctly decode this half bytes / half utf-8 string? Splitting and replacing the b’ and ” doesn’t seem to work.
You mean your column actual content is python bytes as string? Not sure why in the world you’d have something like that, but the way to decode an actual string out of it would be parsing/evaluating it first to get the bytes, and then decoding the bytes. Something like this:
import ast
wrapped_bytes = r"b'xc5x81xc3xb3dxc5xba, Poland'"
encoded_str = ast.literal_eval(wrapped_bytes)
str = encoded_str.decode()
print(str)
Edit: just in case the wrapped bytes are actually UTF8 character codes already, you can simply pull them out as, well, character codes. This said, the correct place to fix it is in whatever creates the offending csv.
wrapped_bytes = "b'xc5x81xc3xb3dxc5xba, Poland'"
just_bytes = bytes(ord(c) for c in wrapped_bytes[2:-1])
str = just_bytes.decode()
print(str)
2
Given your example row in input.csv
, below will convert the 2nd column by evaluting it as a Python byte string, then decoding it as UTF-8:
import csv
import ast
with open('input.csv', newline='') as fin:
reader = csv.reader(fin)
for row in reader:
row[1] = ast.literal_eval(row[1]).decode()
print(row)
Output:
['column1', 'Łódź, Poland', ' column3']
The incorrect CSV could have been written using the csv
as well, but someone encoded the city column:
import csv
with open('incorrect.csv', 'w', newline='') as fout:
writer = csv.writer(fout)
writer.writerow(['column1', 'Łódź, Poland'.encode(), 'column3'])
incorrect.csv:
column1,"b'xc5x81xc3xb3dxc5xba, Poland'",column3
It should not be encoded, but simply left as 'Łódź, Poland'
and the file itself should be opened as UTF-8 to write it properly:
import csv
with open('correct.csv', 'w', encoding='utf8', newline='') as fout:
writer = csv.writer(fout)
writer.writerow(['column1', 'Łódź, Poland', 'column3'])
correct.csv (UTF-8-encoded):
column1,"Łódź, Poland",column3
As a side note, to read correct.csv
from Excel, use encoding='utf-8-sig'
to write the file instead. Excel assumes a localized encoding such as cp1252
unless UTF-8 w/ BOM is the encoding. This writes a signature at the beginning of the file indicating UTF-8.
All you need to do are:
- change doubled backslash to single one.
- delete the ‘b’ letter at the beginning.
- remove the quote mark at both head and tail.
- convert the string to bytes with a raw encoding and then decode it back.
I think this could help:
column3 = "b'xc5x81xc3xb3dxc5xba, Poland'"
s_string = eval(repr(column3).replace('\\', '\').replace('b', '', 1)[1:-1])
result = s_string.encode('raw_unicode_escape').decode('utf-8')
print(result)
2