I’m new to Azure data bricks.
Using Azure data bricks notebook I’m reading data from a parquet file into a df (Bronze layer). Then doing some transformations like date format.
Now I want to save this data back into the parquet file (Bronze Layer). For this I have 2 options:
df.write.format("delta").save("/path/to/output")
df.write.format("parquet").save("/path/to/output")
-
How do I choose which one to use?
-
Do I need to specify mode as overwrite or is that the default mode?
-
If I want to store this data into Silver layer, then which option is suitable? (Parquet or delta)?
2
I have tried the below approach:
To the Bronze Layer:
bronze_delta_path ="abfss://[email protected]/bronze/delta-table"
df_transformed.write.format("delta").mode("overwrite").save(bronze_delta_path)
To the Silver Layer:
silver_delta_path = "abfss://[email protected]/silver/delta-table"
df_transformed.write.format("delta").mode("overwrite").save(silver_delta_path)
In the above I wrote to ADLS as Delta table
Because delta give you ability ACID
transaction gives , schema enforcement, and time travel capabilities.
I agree with @Ganesh Chandrasekaran
If you have frequent updates, deletions, or complex transformations, Delta Lake is the best. Because using the MERGE
statement it takes care of Insert & Update so it handle Append & Overwrite For saving data into the Silver layer
Reference: Delta Lake vs. Parquet Comparison
Delta should be the go to format, since in the backend it anyways saves data in parquet, but also gives you an additional transactional layer, which lets you do schema evolution, ACID transactions and time travel.
Sohail Maqsood Khan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.