I have two parquet datasources in hadoop both of which have runtime filters on them. I have a query which joins some of the data from each table together into a results set. Each table has an overlap of runtime filters for example runtime_partition1 = 0
exists on each table and means the same thing.
I join these two tables together in my query using an inner join therefore I don’t have to have WHERE runtime_partition1 = 0
set in both WHERE
clauses to return the correct set of data.
My question is, although I don’t have to have the WHERE
clause set on both tables, is it beneficial to specifiy it in my query on both tables? As it is also a clause on a runtime filter will there be any performance gain from leaving it in potentially?
In my particular case I did not observe a performance benefit from leaving it in, but the underlying systems are complicated and there are layers of caching so it is hard to get a “true” baseline here. This question is also largely for curiosity and to improve my understanding. My expectation is that the query planner basically knows the runtime filters, and partition keys of the data sources that have been set, it sees the predicates set and the join type and therefore optimises accordingly.