Currently, I’m using PyArrow RecordBatchReader
for processing possibly quite large datasets. What I need is to create an appropriate reader for the PyArrow Dataset’s write_dataset
function.
I’m considering the use of PyArrow BufferReader
instead, so I can skip the step where I create batches from my dataset.
However, according to the write_dataset
function’s docstring, it seems to me that I can’t use the BufferReader
as the data parameter:
data : Dataset, Table/RecordBatch, RecordBatchReader, list of Table/RecordBatch, or iterable of RecordBatch
Do you have any experience with these two alternative approaches? Could you please share your opinion on this matter?
I have tried to use the RecordBatchReader
, but it requires batching of the dataset.
I have tried to use the BufferReader
, but it seems to me that I have to load the whole dataset in memory while creating the stream that the BufferReader
requires.