I have jsonl files partitioned by user_id
, and report_date
. I am converting these jsonl files into parquet files and save them in the same folder using the following commands in DuckDB
jsonl_file_path ='/users/user_id=123/report_date=2024-04-30/data.jsonl'
out_path = '/users/user_id=123/report_date=2024-04-30/data.parquet'
db.sql(
f"""
COPY (
SELECT * FROM read_json_auto(
'{jsonl_file_path}',
maximum_depth=-1,
sample_size=-1,
ignore_errors=true
)
)
TO '{out_path}' (
FORMAT PARQUET,
ROW_GROUP_SIZE 100000,
OVERWRITE_OR_IGNORE 1
);
"""
)
It works fine, but the problem is DuckDB is inserting the hive partition values into the parquet file which are user_id and report_date, these values are not in jsonl file. I tried to add hive_partitioning = false
, but the problem still, anyone know how to solve this issue?