Given a jsonl file like this:
{"abc1": "hello world", "foo2": "foo bar"}
{"foo2": "bar bar blah", "foo3": "blah foo"}
I could convert it to a dataframe like this:
import pandas as pd
import numpy as np
with open('mydata.jsonl') as fin:
df = pd.json_normalize(pd.DataFrame(fin.read().splitlines())[0].apply(eval))
# Sometimes, there's some encoding issues, so this is done:
for col in df.columns:
if df[col].dtype == object:
df[col] = df[col].apply(lambda x: np.nan if x==np.nan else str(x).encode('utf8', 'replace').decode('utf8'))
df.to_parquet('mydata.parquet')
The above would work for small dataset but for dataset with billions of lines, reading everything into RAM is excessive and won’t fit onto a normal machine.
Are there other ways to convert the dataset into parquet efficiently without reading the whole data into RAM?