I had a task about analysing data in a newline delimited JSON, using Python. I was asked to give 2 approaches, one that optimizes time spent, and one that optimizes memory used.
Each line of the JSON represented an object, something like { "animal": "cow", other fields... }
The questions were things like “Get the top 5 animals that appear in the most objects”.
After toying a bit I found the approach that was both fastest and used least memory was using orjson
and read the JSON line-by-line. Trying to load the whole JSON into memory (to optimize time over memory) yielded worst results on both ends. Other approaches like trying multithreading, concurrent.futures
and pandas
all were worse also.
As such I’m stumped, I have one approach that is best on both ends. Maybe someone with more knowledge on the matter can tell me some idea on what alternate approaches might be best suited for both optimizations?
I was suggested to use cloud solutions also, but as I understand, it would still mean having to upload the JSON to the cloud and then run the analysis. It isn’t faster, and I’m not sure if it counts as being more memory efficient either.
My current code is like this
import orjson
from collections import defaultdict
def analyse(file_path: str):
objects_per_animal = defaultdict(lambda: {"total": 0})
for obj in json_streaming(file_path):
animal = obj["animal"]
objects_per_animal[animal]["total"] += 1
top_animals = sorted(objects_per_animal.keys(), key=lambda x: objects_per_animal[x]["total"], reverse=True)[:5]
result = [(animal) for animal in top_animals]
return result
def json_streaming(file_path):
with open(file_path, 'rb') as f:
for line in f:
yield orjson.loads(line)
The JSON is 500mb, this script takes around 2seconds, 1.2 spent on the json_streaming alone. So reading the JSON is the bottleneck but I really can’t make it faster or use less memory. Memory usage starts with like 25mb (before the script starts), goes to 32mb, and that’s it.