I would like to lazy load a large parquet file. I then need to process it in batches because I’m writing into a database, and there is a limit to how many rows can be written to the database at once. LazyFrame has no write_database method, so I’m collecting batches into a dataframe to use DataFrame.write_database.
The problem is that collect
ing the dataframe takes longer as I work through the batches, but only for certain datasets. If both columns in the frame are strings, collect
ing is fairly consistently fast: the first collect
takes ~0.02 seconds, and subsequent collect
s take about 1.5 seconds. But if one column is i64, the first collect
is fast at ~0.02 seconds, but subsequent collect
operations take 10+ seconds. The estimated size of the various dataframe batches is about the same across datasets: 7-8MB.
Here’s simplified code:
lf = pl.scan_parquet(file)
n_batch = 100_000
row = 0
while row < 500_000:
lf_batch = lf.slice(row, n_batch)
df_batch = lf_batch.collect()
row += n_batch
Is there a better way to collect batches?