Why does a pyarrow backend df need more RAM than a numpy backend?
I am reading a large parquet file with int, string and date columns.
When using dtype_backend="pyarrow"
instead of default dtype_backend="numpy_nullable"
, I get 15.6 GB instead of 14.6 GB according to df.info()
.
Furthermore, I experienced even larger relative overhead using pyarrow for other datasets.
pyarrow.lib.ArrowCapacityError when creating string
I’d like to create a new string column, but pandas using pyarrow backend throws an ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 3525828799
get seconds from pandas timedelta with pyarrow dtype
I have a dataframe with pyarrow dtypes such as `duration[ns][pyarrow]’.