I’m trying to optimize the querying of a large number of parquet files I have, most files are hundreds of MB large. These parquet files contain the fields source_ip
and destination_ip
, among other fields, and my query is searching for a specific IP in either of the two fields.
I have 2 sets of parquet files written differently to compare the query time needed.
Set A:
spark.sql("set spark.sql.files.maxRecordsPerFile=6000000")
df.coalesce(80).write.mode('append').option("parquet.block.size",134217728).parquet("set_A")
Set B:
spark.sql("set spark.sql.files.maxRecordsPerFile=6000000")
spark.sql("set spark.parquet.enabled.dictionary=true")
df.coalesce(20).sortWithinPartitions("source_ip").write.mode('append').option("parquet.block.size",134217728).parquet("set_B")
Query:
df_result = df.filter((col('source_ip') == "1.1.1.1") | (col('destination_ip') == "1.1.1.1"))
Querying Set A took about 75 seconds, querying Set B took about 60 seconds. If I query Set B for source_ip
only, then it takes only 2 seconds.
My problem is I need to query on both source_ip
and destination_ip
. Also, choosing to sortWithinPartitions()
on source_ip
was purely to test the performance only, as I could have sorted by destination_ip
as well.
Preferably, I would like to achieve the same kind of 2-second query speed on both fields. Is there a way to do that?