I am working with two tables in BigQuery:
- qt: A small table (~a few thousand records) with a single string column called
rowKey
. - tt: A massive external table (~5TB) from Bigtable, accessed through an external table connection in BigQuery.
My goal is to query tt to fetch only the rows where rowKey matches the values in qt.rowKey
.
Here’s what I’ve tried so far:
Attempt 1: Using a subquery with IN
Copy code
SELECT t.*
FROM tt t
WHERE t.rowKey IN (SELECT rowKey FROM qt)
Issue: This query runs indefinitely and does not finish.
Attempt 2: Using a LEFT JOIN
Copy code
SELECT t.*
FROM qt
LEFT JOIN tt ON tt.rowKey = qt.rowKey
Issue: This query also takes an extremely long time and does not finish.
Attempt 3: Hardcoding a small subset of keys
Copy code
SELECT t.*
FROM tt
WHERE rowKey IN ("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
Result: This query runs successfully in a few seconds.
It seems like the IN clause or the JOIN is causing a full table scan on the external table (tt). When I hardcode a small set of rowKey values, the query performs as expected.
I want to efficiently query tt for the rowKey values present in qt.rowKey without causing a full scan of the massive tt table.
How can I optimize this query to avoid a full table scan on the tt external table? Are there specific techniques or best practices for querying Bigtable data in BigQuery with filters from a smaller table?
Bashar Eter is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1