I have collections of CSV files, up to 1000, each being ~1Gb uncompressed. I want to create a single parquet dataset from them.
In doing this, I want to record which file each set of rows comes from.
I want to do all this in less than ~10Gb of RAM, in python.
The obvious place to start was with Dask.
If I do something like:
for infile in file_list:
ddf = dd.read_csv(infile)
ddf = ddf.assign(filename=infile)
ddf.to_parquet("output_parquet_path",
append=True,
write_index=False,
write_metadata_file=True,
compute=True)
Then I get an error message about the column types from the second file onwards – it seems that the text columns are of type string[pyarrow]
in the parquet file, but object
in the Dask dataframe (see ValueError: Appended dtypes differ when appending two simple tables with dask).
If I try to rely on the lazy nature of Dask, and do:
frame_list = list()
for infile in file_list:
ddf = dd.read_csv(infile)
ddf = ddf.assign(filename=infile)
frame_list.append(ddf)
full_frame = ddf.concat(frame_list)
full_frame.to_parquet("output_parquet_path",
write_index=False,
write_metadata_file=True,
compute=True)
Then a compute is triggered early and it tries to load all the frames into memory at once.
1