I have some streaming coming in as JSON that I transform into a polars dataframe and then I write out the data as parquet partitioned by two columns. I am noticing that if a new record has the same partition then instead of writing an additional file in the folder it is just overwriting the old data with the new data. I want to keep the old data and write new data to the partition folder.
Here is a reproducible example:
# Create a record
df_a = pl.DataFrame(
{
'type': ['a'],
'date': ['2024-08-15'],
'value': [68]
}
)
# Create a record
df_b = pl.DataFrame(
{
'type': ['a'],
'date': ['2024-08-15'],
'value': [39]
}
)
# Write the first record partitioned by type and date
df_a.write_parquet('./data/example.parquet', partition_by=['type', 'date'])
# Write the second record partitioned by type and date where the type and date are the same as the first record
df_b.write_parquet('./data/example.parquet', partition_by=['type', 'date'])
# Read the written data
check = pl.read_parquet('./data/example.parquet/')
You’ll see that only df_b
data is retained in the read data. I want both data to be retained. I also do not want to read the previous data and append the new data to it with pl.concat()
before writing it out again.