Suppose I have the following 2 tables:
Table1:
date | symbol | time | value |
---|---|---|---|
20090101 | A | 0930 | 1 |
20090101 | A | 0931 | 2 |
20090101 | A | 0932 | 5 |
20090101 | B | 0930 | 1 |
20090101 | B | 0931 | 6 |
20090101 | B | 0932 | 3 |
20090102 | A | 0930 | 7 |
20090102 | A | 0931 | 9 |
20090102 | A | 0932 | 4 |
20090102 | B | 0930 | 8 |
20090102 | B | 0931 | 2 |
20090102 | B | 0932 | 5 |
Table2:
date | symbol | starttime | endtime |
---|---|---|---|
20090101 | A | 0930 | 0931 |
20090101 | B | 0931 | 0932 |
20090102 | A | 0932 | 0932 |
The goal is to get the sum of the values in the time interval as specified by Table2 from Table1.
In the above example, the output would be:
date | symbol | starttime | endtime | agg_value |
---|---|---|---|---|
20090101 | A | 0930 | 0931 | 3 |
20090101 | B | 0931 | 0932 | 9 |
20090102 | A | 0932 | 0932 | 4 |
Table1’s date,symbol,time are unique across rows. Table2’s date,symbol can contain duplicates.
A naïve approach would be to do an right outer join on date,symbol for Table1 and Table2, then remove rows where time falls outside of starttime and endtime, then aggregate. The issue is that Table1 is very large (a few billion rows). Table2 can contain 50 million rows.
What would be an optimal query to accomplish this?