How to efficiently parse a large gzipped JSON file with ijson without encountering “trailing garbage” errors?

I am working with a large gzipped JSON file containing review data, formatted as a list of JSON objects. Each object is separated by a newline character. My goal is to efficiently extract the review_text field from each object using ijson without loading the entire file into memory, as the file contains over 15 million records.

However, when trying to parse the file using ijson, I encounter the following error:

IncompleteJSONError: parse error: trailing garbage
          votes": 16, "n_comments": 0} {"user_id": "8842281e1d1347389f
                     (right here) ------^

This is the actual sample…

['{"user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "24375664", "review_id": "5cd416f3efc3f944fce4ce2db2290d5e", "rating": 5, "review_text": "Mind blowingly cool. Best science fiction I've read in some time. I just loved all the descriptions of the society of the future - how they lived in trees, the notion of owning property or even getting married was gone. How every surface was a screen. \n The undulations of how society responds to the Trisolaran threat seem surprising to me. Maybe its more the Chinese perspective, but I wouldn't have thought the ETO would exist in book 1, and I wouldn't have thought people would get so over-confident in our primitive fleet's chances given you have to think that with superior science they would have weapons - and defenses - that would just be as rifles to arrows once were. \n But the moment when Luo Ji won as a wallfacer was just too cool. I may have actually done a fist pump. Though by the way, if the Dark Forest theory is right - and I see no reason why it wouldn't be - we as a society should probably stop broadcasting so much signal out into the universe.", "date_added": "Fri Aug 25 13:55:02 -0700 2017", "date_updated": "Mon Oct 09 08:55:59 -0700 2017", "read_at": "Sat Oct 07 00:00:00 -0700 2017", "started_at": "Sat Aug 26 00:00:00 -0700 2017", "n_votes": 16, "n_comments": 0}n', '{"user_id": "8842281e1d1347389f2ab93d60773d4d", "book_id": "18245960", "review_id": "dfdbb7b0eb5a7e4c26d59a937e2e5feb", "rating": 5, "review_text": "This is a special book. It started slow for about the first third, then in the middle third it started to get interesting, then the last third blew my mind. This is what I love about good science fiction - it pushes your thinking about where things can go. \n It is a 2015 Hugo winner, and translated from its original Chinese, which made it interesting in just a different way from most things I've read. For instance the intermixing of Chinese revolutionary history - how they kept accusing people of being \"reactionaries\", etc. \n It is a book about science, and aliens. The science described in the book is impressive - its a book grounded in physics and pretty accurate as far as I could tell. Though when it got to folding protons into 8 dimensions I think he was just making stuff up - interesting to think about though. \n But what would happen if our SETI stations received a message - if we found someone was out there - and the person monitoring and answering the signal on our side was disillusioned? That part of the book was a bit dark - I would like to think human reaction to discovering alien civilization that is hostile would be more like Enders Game where we would band together. \n I did like how the book unveiled the Trisolaran culture through the game. It was a smart way to build empathy with them and also understand what they've gone through across so many centuries. And who know a 3 body problem was an unsolvable math problem? But I still don't get who made the game - maybe that will come in the next book. \n I loved this quote: \n \"In the long history of scientific progress, how many protons have been smashed apart in accelerators by physicists? How many neutrons and electrons? Probably no fewer than a hundred million. Every collision was probably the end of the civilizations and intelligences in a microcosmos. In fact, even in nature, the destruction of universes must be happening at every second--for example, through the decay of neutrons. Also, a high-energy cosmic ray entering the atmosphere may destroy thousands of such miniature universes....\"", "date_added": "Sun Jul 30 07:44:10 -0700 2017", "date_updated": "Wed Aug 30 00:00:26 -0700 2017", "read_at": "Sat Aug 26 12:05:52 -0700 2017", "started_at": "Tue Aug 15 13:23:18 -0700 2017", "n_votes": 28, "n_comments": 1}n']

here is the original dataset under book reviews:
https://mengtingwan.github.io/data/goodreads.html#datasets

  • Using ijson to parse the file directly, but it leads to trailing garbage errors.
  • Cleaning up each line before parsing it as JSON, which partially works but isn’t efficient for large files.

When using ijson, I keep running into the following error:

IncompleteJSONError: parse error: trailing garbage
          votes": 16, "n_comments": 0} {"user_id": "8842281e1d1347389f
                     (right here) ------^

How can I efficiently parse the large gzipped JSON file using ijson or any other approach that avoids loading the entire file into memory and does not result in the “trailing garbage” error? What adjustments can I make to handle this file format correctly?

Here is the current attempted code that produces that error

import ijson
import pandas as pd
import gzip

review_texts = []

gzip_file_path = 'goodreads_dataset.json.gz'

with gzip.open(gzip_file_path, 'rt', encoding='utf-8') as f:
    objects = ijson.items(f, 'item')  # Use 'item' if it's a top-level array

    for obj in objects:
        if 'review_text' in obj:
            review_texts.append(obj['review_text'])

df = pd.DataFrame(review_texts, columns=['review_text'])
df.to_pickle('reviews.pkl')

print(f"Saved {len(df)} review_text entries to 'reviews.pkl')

1

The data file contains a JSON on every line, so its format is actually JSON Lines
and the archive extension should be .jsonl.gz.

You can simply read the file line by line, and use the regular json module to parse the JSON on every line:

import gzip
import json

review_texts = []

gzip_file_path = 'goodreads_dataset.json.gz'

with gzip.open(gzip_file_path, 'rt', encoding='utf-8') as f:
    for line in f:
        obj = json.loads(line)
        review_texts.append(obj['review_text'])

Duff you see this answer? It sounds like it’s your case. It’s also in the ijson FAQ.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị
Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa
Thiết kế website Thiết kế website Thiết kế website Cách kháng tài khoản quảng cáo Mua bán Fanpage Facebook Dịch vụ SEO Tổ chức sinh nhật