Polars consistently produces unexpectedly large ipc file sizes when persisting to disk. With parquet, file sizes are comparable to pandas / slightly smaller as expected.
Rough ballpark of my problem: pandas feather: ~250MB – polars ipc: ~2.5GB
Most files get blown up between a factor 1.5x to 10x, and seems to scale with number of columns.
I do not think this is a pyarrow problem, as pandas produces the expected file size, give or take.
I don’t think this a polars-python bug, I assume this is a me/my computer problem…
Maybe someone can help?
minimum viable code to reproduce:
import os
import yfinance
import polars as pl
import subprocess
os.environ["POLARS_VERBOSE"] = "1"
# datetime example
data = yfinance.download(
tickers="^AXJO ^N225", start="2005-01-01", end="2024-06-01", interval="1d", threads=True, prepost=True
)
data.columns = data.columns.get_level_values(1) + "_" + data.columns.get_level_values(0)
data.index = pd.DatetimeIndex(data.index, name="date_utc").tz_localize(None)
data = data.reset_index()
# both results in datetime[ns] / float64 for all columns
data.to_feather("banana.feather")
pl.DataFrame(data).write_ipc("banana_pl2.feather")
# integer example w larger data
data2 = {
"vals1": range(50_000),
"vals2": range(50_000),
"vals3": range(50_000)
}
# disproving this is a pandas->polars artefact
pd.DataFrame(data2).to_feather("b.feather")
pl.DataFrame(data2).write_ipc("b_pl.feather")
command = "ls -lha | grep feather"
subprocess.Popen(command, shell=True)
>>>-rw-r--r-- 1 xxx.xxx 884741199 589K Jun 9 11:41 b.feather
>>>-rw-r--r-- 1 xxx.xxx 884741199 1.1M Jun 9 11:41 b_pl.feather
>>>-rw-r--r-- 1 xxx.xxx 884741199 338K Jun 9 11:41 banana.feather
>>>-rw-r--r-- 1 xxx.xxx 884741199 523K Jun 9 11:41 banana_pl.feather
I have gotten the same results for the following two configs:
--------Version info---------
Polars: 0.20.31
Index type: UInt32
Platform: macOS-13.6.5-arm64-arm-64bit
Python: 3.10.6 (main, Aug 7 2023, 13:38:39) [Clang 14.0.3 (clang-1403.0.22.14.1)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 3.0.0
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: <not installed>
gevent: <not installed>
hvplot: <not installed>
matplotlib: 3.8.0
nest_asyncio: 1.5.8
numpy: 1.26.0
openpyxl: 3.1.2
pandas: 2.2.2
pyarrow: 16.1.0
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.22
torch: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
and
--------Version info---------
Polars: 0.20.19
Index type: UInt32
Platform: macOS-13.6.5-arm64-arm-64bit
Python: 3.10.6 (main, Aug 7 2023, 13:38:39) [Clang 14.0.3 (clang-1403.0.22.14.1)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: 3.0.0
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: <not installed>
gevent: <not installed>
hvplot: <not installed>
matplotlib: 3.8.0
nest_asyncio: 1.5.8
numpy: 1.26.0
openpyxl: 3.1.2
pandas: 2.1.1
pyarrow: 13.0.0
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: 2.0.22
xlsx2csv: <not installed>
xlsxwriter: <not installed>