I daily generate and save a dataset partitioned by the field ‘Hotel’ and I was wondering to know if reading a partitioned file is slower that reading the same file without partitions.
Lets, say that my main first parquet file is (without partitioning):
- main_bookings.parquet
Then I also have, as I said, a file with its. For example, one partition is: - main_bookings_partitioned.parquet/Hotel=Barcelona_1
- main_bookings_partitioned.parquet/Hotel=Barcelona_2
- main_bookings_partitioned.parquet/Hotel=Barcelona_3
- main_bookings_partitioned.parquet/Hotel=Madrid_1
- main_bookings_partitioned.parquet/Hotel=…
I was wondering to know if, from Databricks and using PySpark, reading the main_bookings.parquet (without partitions) is much faster than reading main_bookings_partitioned (the same file but with partitions)
Thank you in advance,
I would like to get a response of the solution and understand why this is happening.
Albert Cuspinera Permanyer ES is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.