I have a pyspark join
dfnew = dfnew.join(df, on=((f.col('id1')==f.col('id2')))&(f.col('time').between(f.col("StartTime"), f.col("endTime")))&(f.col('tag1')==f.col('tag2')), how="inner")
When I run this query, databricks generates message stating to enable range join to improve the performance.
So I changed the code like the below
dfnew = dfnew.join(df.hint("range_join", 60*60), on=((f.col('id1')==f.col('id2')))&(f.col('time').between(f.col("StartTime"), f.col("endTime")))&(f.col('tag1')==f.col('tag2')), how="inner")
I used df.hint("range_join", 60*60)
because I know I am extracting close to 1 hour of data and the time
is in timestamp format. The documentation is not particularly clear to me.
Is this the correct way to enable range join optimization?