I’m working with huge datasets and considering using Polars.
My goal is to write data in NDJSON format to S3. And it should be partitioned.
My data is usually much bigger than my SSD.
There are few problems:
-
When I have a
LazyFrame
there is asink_parquet_cloud
, but there is nosink_parquet_cloud
. So how to write it to AWS S3 without storing it locally first? (the data is big) -
As I understand,
PartitionedWriter
is available forIpc
only. So for JSON it could be done by the series of filters. Meanwhile, to do filters by the distinct values of partitioning column I need to collect all the data first. So my question is is there any more effective way?