I’m trying to optimize query performance for a PySpark SQL query of parquet files in Azure Synapse Analytics. My data set is billions of records, so any bit of performance I can get is great. My basic question is does the columnar storage of parquet really help me with my where clause for Year, or must I use the /Year=2023 with the OPENROWSET method to get that real performance boost?
Details
From the Spark job that creates the partitioned parquet files I calculate two fields relevant to this post – Year and event_year. The calculation for both is the same – year(event_date). I create it twice because when I save the parquet files I’m using the partitionBy Year (as well as Month and Day). Apparently the partitionBy function removes the field from the actual output, so event_year is also included in the actual output.
The output goes to /myFiles/events/**Year**=XXXX/Month=XX/Day=XX
When I query the data with OPENROWSET I can obviously specify directories if I want a particular year, but my goal is to create a single view in Azure Synapse Analytics serverless pool called dbo.events
and users could query where Year = XXXX
to be as efficient as possible. I have years ranging from 2020 through current year, current year and previous year 2023 and 2024 at time of this post.
I come from a traditional SQL world where an index on Year would guarantee me what I’m looking for. However in this serverless pool the results of from dbo.events where event_year = 2023
don’t seem to perform any better than the OPENROWSET(/myFiles/events/**Year=2023**)
method.
So my question is – does the columnar storage of parquet really help me with my where event_year = 2023
, or must I use the Year=2023 in the OPENROWSET file path to get that real performance boost?
Thanks!