I am currently using pyarrow to partition using a column called ‘req_moment’ to partition the data in a pyarrow dataframe. The partitioning process itself is okay, however the timestamp which is shown in the filename is polluted with different character (I think representing the white spaces and “:”). I am running the code below:
pq.write_to_dataset(
new_df.to_arrow(),
f"abfss://{os.environ['STORAGE_ACCOUNT_CONTAINER']}/{os.environ['STORAGE_ACCOUNT_PATH']}",
partition_cols = ['req_moment']
filesystem=fs,
)
This results in filenames which en like this:
/req_moment=2023-12-15%2009%3A31%3A18/d2c327fa52dd4c1ebd6afdf8f4cea7fe-0.parquet
The desired format is:
/req_moment=2023-12-15 HH:mm:ss/d2c327fa52dd4c1ebd6afdf8f4cea7fe-0.parquet
Is this possible or is it normal behavior?
I have tried converting the datatypes of the columns etc. but this all does not seem to help.
1
for each filename run unquote to get your actual text
import urllib.parse
for filename in filenames:
filename = urllib.parse.unquote(filename)