As I mentioned yesterday, I started with Python recently and decided to make a bot for Telegram that has a Markov chain, and to train the Markov chain I downloaded the chat history of my group of friends and made a script to filter and separate only the messages from the group, but some errors appeared in the json and as the file is very large, 7193776 lines, I wrote a script that automates the correction of the json as it is unfeasible to correct it manually, I handled exceptions and it is returning this coding error:
It was not possible correct the JSON: Extra data: line 1 column 3 (char 2)
JSON decoding error on line: Extra data: line 1 column 7 (char 6)
I think it’s because json has more than one object per line, I wanted to know your opinion, what do you think it could be?
Here is a snippet of code from the script (I won’t go through it in full so as not to make this post too long):
item = json.loads(line)
return item
except json.JSONDecodeError as e:
print(f"Unable to fix JSON: {e}")
return None
# Open input file
with open(input_file, 'r', encoding='utf-8') as in_file:
for line in in_file:
# Tenta corrigir a linha
fixed_line = fix_json_line(line.strip())
if fixed_line:
fixed_messages.append(fixed_line)
else:
unfixed_lines.append(line)
# Saves corrected messages to a new JSON file
with open(output_file, 'w', encoding='utf-8') as out_file:
json.dump(fixed_messages, out_file, ensure_ascii=False, indent=2)
# Saves lines that could not be corrected in a separate file
if unfixed_lines:
with open('unfixed_lines.txt', 'w', encoding='utf-8') as unfixed_file:
unfixed_file.writelines(unfixed_lines)
print(f"Lines that could not be corrected were saved in 'unfixed_lines.txt'")
else:
print("All lines were successfully corrected.")
print(f"Lines corrected and saved in {output_file}")
I’ve already tried searching but as the description of the error is very vague (I couldn’t specify the error further) Google returned several possible errors but I don’t know what it could be, I asked GPT chat but some days it seems like it doesn’t work properly. .. And I asked my technical course teacher but he couldn’t answer me! This is an example of how the json is formatted:
{
"id": 610775,
"type": "message",
"date": "2024-06-27T13:55:13",
"date_unixtime": "1719507313",
"from": "Swelve",
"from_id": "user5957514107",
"reply_to_message_id": 610761,
"text": "antes de pisar numa universidade ele deveria revisar esse português dele",
"text_entities": [
{
"type": "plain",
"text": "antes de pisar numa universidade ele deveria revisar esse português dele"
}
]
},
{
"id": 610776,
"type": "message",
"date": "2024-06-27T13:56:31",
"date_unixtime": "1719507391",
"from": "Old dirty bastard λ",
"from_id": "user1758042831",
"text": "No mínimo",
"text_entities": [
{
"type": "plain",
"text": "No mínimo"
}
]
}
]
}
From my point of view there is nothing wrong with json, but as I have just started with python and machine learning I could be very wrong (before I was a functional programming-loving hippie, but here in my country there are more python jobs so I started studying Python to have a chance at a job) I’m asking this because in order to correct the error I first have to know what the error is, and I don’t even know that… If anyone can help me I’ll be extremely grateful!
Bruno Ciccarino is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.