Relative Content

Tag Archive for pythonparquetpyarrow

What are the advantages and disadvantages of using PyArrow RecordBatchReader vs BufferReader?

Currently, I’m using PyArrow RecordBatchReader for processing possibly quite large datasets. What I need is to create an appropriate reader for the PyArrow Dataset’s write_dataset function.
I’m considering the use of PyArrow BufferReader instead, so I can skip the step where I create batches from my dataset.
However, according to the write_dataset function’s docstring, it seems to me that I can’t use the BufferReader as the data parameter:

FIlename output when partitioning on timestamp using PyArrow

I am currently using pyarrow to partition using a column called ‘req_moment’ to partition the data in a pyarrow dataframe. The partitioning process itself is okay, however the timestamp which is shown in the filename is polluted with different character (I think representing the white spaces and “:”). I am running the code below: