When reading parquet files into a Pandas Dataframe, I am doing so using the following
df = pd.read_parquet(PATH_TO_FILE, dtype_backend='pyarrow')
In this case, my dataframe is quite large, 52 million rows by 9 columns.
When I try subsetting the dataframe with relatively simple operations, like:
df.loc[df['example_column'].str.startswith('B')].shape
I get
ArrowInvalid: offset overflow while concatenating arrays
However, when I process the dataframe first in batches, for example 2 million rows at a time, this works and I get no errors. Why is this?