I have a big parquet table stored in multiple partitions, and each partition having multiple files inside of it, like:
s3
├── gender=M
│ ├── age=10
│ │ ├── part-000.parquet
│ │ ├── part-001.parquet
│ │ └── part-002.parquet
│ └── age=20
│ ├── part-000.parquet
│ ├── part-001.parquet
│ └── part-002.parquet
└── gender=W
├── age=10
│ ├── part-000.parquet
│ ├── part-001.parquet
│ └── part-002.parquet
└── age=20
├── part-000.parquet
├── part-001.parquet
└── part-002.parquet
I want to apply transformations in some columns of the data. These transformations will not change schema/dimensions of the table, nor will they depend on data “outside” of the current row (for example a transformation that adds 1 to the values a numerical column).
Since the partitions can be very large, my approach is to apply this transformations on each parquet file separately and then overwrite the original parquet file with the transformed one. In pseudo-code, it goes like this:
for each .parquet file, do the following:
(taking as example the file 's3://gender=M/age=10/part-000.parquet')
move the file to a temporary location:
's3://gender=M/age=10/part-000.parquet' ⟼ 's3://tmp/gender=M/age=10/part-000.parquet'
use spark to read from temporary location and transform it:
df = spark.read.parquet('s3://tmp/gender=M/age=10/part-000.parquet')
transformed_df = transform(df)
use spark to write the transformed file to a new temporary location:
transformed_df.coalesce(1).write.parquet('s3://tmp/transformed/gender=M/age=10/')
# this will create a file with a different name, for example s3://tmp/transformed/gender=M/age=10/part-xyz.parquet
move transformed file to original location, with original file name:
's3://tmp/transformed/gender=M/age=10/part-xyz.parquet' ⟼ 's3://gender=M/age=10/part-000.parquet'
remove 's3://tmp/'
After testing this approach, it looks that the resulting data is correct, but I don’t know to which extents it won’t cause problems in future use of the data. Is it problematic to do something like this? In terms of metadata of the table, do I lose anything by doing so (the _SUCCESS
file change inside each partition can cause conflicts?)? Am I mistaken to think that this is an efficient solution, considering that I can parallelize the process for each file using multiple threads/cores?