I have a gigantic indexed (104 million row) table of docs, pages and words:
CREATE TABLE my_table (
doc_id integer,
page_id integer,
word_id integer
);
CREATE INDEX my_table_word ON my_table (word_id);
CREATE INDEX my_table_doc ON my_table (doc_id);
CREATE INDEX my_table_page my_table (page_id);
I want to find pages within the same documents that have both word A and word B. My current attempts are as follows:
Attempt 1 – aggregate things:
SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2
having count(distinct word_id) = 2
-- ~39k row result, took 20 seconds
Attempt 2) with CTEs, marginally faster
with foo as (
select doc_id, page_id
from my_table
where word_id = 123 -- foo -- 44k rows
),
bar as (
select doc_id, page_id
from my_table
where word_id = 456 -- bar -- 439k rows
)
select f.doc_id, f.page_id
from foo f
inner join bar b on f.doc_id = b.doc_id and f.page_id = b.page_id
-- same results, takes 15 seconds
Attempt 3) – doing an INTERSECT
between the two CTEs is exactly the same 15 seconds, probably same query plan.
Is there a faster way to do this? I’m hoping to get this down to < 1 second for a web app with somewhat impatient users.
5
Try a self join:
SELECT DISTINCT doc_id, page_id
FROM my_table AS a
JOIN my_table AS b USING(doc_id, page_id)
WHERE b.word_id = 456
AND a.word_id = 123
With an index
CREATE INDEX ON my_table (word_id, page_id, doc_id);
which is a covering index, allowing an index-only query.
9
A basic intersect
seems to perform fine:
demo at db<>fiddle
select doc_id, page_id
from my_table
where word_id = 456
INTERSECT
select doc_id, page_id
from my_table
where word_id = 123;
In my test on 800k docs, 100 pages each, using 1k words, it runs about as fast as your second example, while being simpler and duplicate-free because it defaults to INTERSECT DISTINCT
.
You can also use an exists
:
select distinct doc_id, page_id
from my_table as a
where word_id = 456
and exists(select from my_table as b
where a.doc_id=b.doc_id
and a.page_id=b.page_id
and word_id = 123);
And that runs about just as fast.
Still, you were right to try and find everything you needed in a single pass, a single query, using aggregation and having
:
select doc_id, page_id
from my_table
where word_id in(456,123)
group by doc_id, page_id
having bool_or(word_id=456)
and bool_or(word_id=123);
The two bool_or()
‘s check if any of the words on the page is your first word, and the same for the other one.
As suggested by @Bohemian, a fully covering index can speed things up significantly:
CREATE INDEX my_table_word_page_doc ON my_table (word_id,doc_id,page_id);
SELECT DISTINCT doc_id, page_id
FROM my_table a
JOIN my_table b using (doc_id,page_id)
WHERE b.word_id = 456
AND a.word_id = 123
Besides the answers about how to write the SQL, you may be able to get a speedup by turning your indexes into covering indexes.
CREATE INDEX my_table_word ON my_table (word_id, page_id, doc_id);
For example, with this simple query:
SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2
Since all the values involved are in the index, the DB should not have to go read the rows from the table to get the values of doc_id
and page_id
because they are in the index.
How well this works undoubtedly differs from RDBMS to RDBMS.
2