I have a hive layout S3 bucket with keys that are formatted as an ISO date. S3 File keys are of the form:
staging/extract/contracts/record_date=2024-01-01/contracts_0_0_2024-02-15T16:21:51.975005+00:00.parquet
IE the table is staging/extract/contracts
partitioned on record_date
which is a date.
I want to get polars to interpret this as a date but I can’t figure out what I’ve got wrong:
This “works” but requires the LazyFrame to be queried with a string
frame = polars.scan_parquet(url, storage_options=storage_options)
results = frame.filter(polars.col("record_date") == target_date.isoformat()).collect()
It works, meaning debug output includes a lot of lines saying:
parquet file can be skipped, the statistics were sufficient to apply the predicate.
This doesn’t work
frame = polars.scan_parquet(url, storage_options=storage_options, hive_schema={"record_date": polars.Date})
results = frame.filter(polars.col("record_date") == target_date).collect()
It doesn’t work, meaning all files seem to be read with debug lines saying:
parquet file must be read, statistics not sufficient for predicate.
Note we are new to hive layouts, if the file key is incorrect, we can change the keys.