I’m trying to export a DuckDB table to a Parquet file using hive partitioning. Unfortunately, DuckDb writes the partition both as hive partition and as a column in the file. This doesn’t seem correct, and most of the tools fail to read the file when they find the same column twice (once in the file and once in the hive partition).
Is there a way to tell DuckDb to avoid writing the partition column in Parquet file?
To reproduce it:
Create a table with
CREATE TABLE t (
id BIGINT,
other_col VARCHAR,
part VARCHAR
);
INSERT INTO t
VALUES
(1, '1', 'part1'),
(2, '2', 'part2');
Try to export the table to Parquet, partitioning by the column part
with
COPY t
TO 'table_partitioned'
(FORMAT PARQUET, PARTITION_BY part);
Output path: table_partitioned/part=part1/data_0.parquet
File schema:
message duckdb_schema {
OPTIONAL INT64 id (INT_64);
OPTIONAL BYTE_ARRAY other_col (UTF8);
OPTIONAL BYTE_ARRAY part (UTF8);
}
2
As of commit 311124a26f of August 7, the omission of the partition columns is the default.
To verify this, use:
SELECT * FROM parquet_metadata( <pathname_to_one_of_the_parquet_files> );
Using the OP example,
SELECT * FROM parquet_metadata('table_partitioned/part=part1/data_0.parquet');
produces just two lines, one each for the non-partitioned columns. To see their names, you could SELECT path_in_schema FROM ...