I want to read a text file with a python script and determine a best guess as to its character-encoding using the chardetng library. I’m using the chardetng.py binding to the chardetng Rust library.
I’ve created the following python script:
# Import sys module to allow processing of command-line arguments.
import sys
# Import chardetng-py detect module.
from chardetng_py import detect
# Define a function to open file, read file, and analyze character encoding.
def detect_encoding (fileName):
try:
with open (fileName, 'rb') as file:
# Read file.
file_data = file.read ()
# Show how many bytes were read.
print (file.tell (), "bytes read from file:", fileName)
# Analyze file.
result = detect (file_data)
# Close input file.
file.close()
return result
except FileNotFoundError as e:
print (f"Error: File not found: {e}")
sys.exit (1)
except IOError as e:
print (f"Error reading file: {e}")
sys.exit (1)
def main ():
if len (sys.argv) != 2:
print ("Usage: python3 charenc2.py <filename>")
sys.exit (1)
file_name = sys.argv[1]
encoding = detect_encoding (file_name)
# Show results.
print (f"The character encoding is most likely: {encoding}")
if __name__ == "__main__":
main ()
This almost always results in ‘windows-1252’, whereas chardet and chardetect (which came with Ubuntu linux) get it right. I’ve tried on test files from:
https://github.com/stain/encoding-test-files
as well as some old subtitle files. Out of the test files above, it only detects KOI8-R. All others are predicted to be windows-1252, even the utf8.txt file. These are small files, so perhaps chardetng needs larger amounts of text than chardet and chardetect?
Am I miss-using chardetng-py, or is there perhaps a threshold setting that needs adjustment?
Any advice?
Thanks
1