I have a multiprocessing thread pool where each job returns a requested batch of JSON (an array of objects). I want all results to write into a single file, but without storing the full result (as a single list) in-memory due to RAM constraints (the full result contains about 1.5 million records totaling about 1.5 GB).
I see solutions that suggest the use of json-stream
but these do not seem to account for batching (they handle either a list or dicts, not a list of lists of dicts). I also see solutions suggesting string manipulations, e.g., opening the file and writing a [
followed by a batch of objects, separated by writes of ,
, followed by a trimming of trailing ,
and ]
characters…
Is there a way to do this, either using an imported library or in a “safer” manner?
from multiprocessing.pool import ThreadPool
def thread(worker, jobs, threads):
pool = ThreadPool(threads)
with open("tmp.json", "w") as f:
for result in tqdm(pool.imap_unordered(worker, jobs), total=len(jobs)):
# 1. stream each object from each result array)
for element in result:
### CODE GOES HERE ###
# 2. just stream the result array
### CODE GOES HERE ###