Fastest way to find SQL intersection between 2 subsections of a huge dataset?

I have a gigantic indexed (104 million row) table of docs, pages and words:

CREATE TABLE my_table (
    doc_id integer,
    page_id integer,
    word_id integer
);

CREATE INDEX my_table_word ON my_table (word_id);
CREATE INDEX my_table_doc ON my_table (doc_id);
CREATE INDEX my_table_page my_table (page_id);

I want to find pages within the same documents that have both word A and word B. My current attempts are as follows:

Attempt 1 – aggregate things:

SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2 
having count(distinct word_id) = 2

-- ~39k row result, took 20 seconds

Attempt 2) with CTEs, marginally faster

with foo as (
    select doc_id, page_id
    from my_table
    where word_id = 123 -- foo -- 44k rows
),

bar as (
    select doc_id, page_id
    from my_table
    where word_id = 456 -- bar -- 439k rows
)

select f.doc_id, f.page_id
from foo f
inner join bar b on f.doc_id = b.doc_id and f.page_id = b.page_id

-- same results, takes 15 seconds

Attempt 3) – doing an INTERSECT between the two CTEs is exactly the same 15 seconds, probably same query plan.

Is there a faster way to do this? I’m hoping to get this down to < 1 second for a web app with somewhat impatient users.

Try a self join:

SELECT DISTINCT doc_id, page_id
FROM my_table AS a
JOIN my_table AS b USING(doc_id, page_id)
WHERE b.word_id = 456
  AND a.word_id = 123

With an index

CREATE INDEX ON my_table (word_id, page_id, doc_id);

which is a covering index, allowing an index-only query.

A basic intersect seems to perform fine:
_{demo at db<>fiddle}

select doc_id, page_id
from my_table
where word_id = 456
INTERSECT
select doc_id, page_id
from my_table
where word_id = 123;

In my test on 800k docs, 100 pages each, using 1k words, it runs about as fast as your second example, while being simpler and duplicate-free because it defaults to INTERSECT DISTINCT.

You can also use an exists:

select distinct doc_id, page_id
from my_table as a
where word_id = 456
  and exists(select from my_table as b 
             where a.doc_id=b.doc_id
               and a.page_id=b.page_id
               and word_id = 123);

And that runs about just as fast.

Still, you were right to try and find everything you needed in a single pass, a single query, using aggregation and having:

select doc_id, page_id
from my_table
where word_id in(456,123)
group by doc_id, page_id
having bool_or(word_id=456) 
   and bool_or(word_id=123);

The two bool_or()‘s check if any of the words on the page is your first word, and the same for the other one.

As suggested by @Bohemian, a fully covering index can speed things up significantly:

CREATE INDEX my_table_word_page_doc ON my_table (word_id,doc_id,page_id);

SELECT DISTINCT doc_id, page_id
FROM my_table a
JOIN my_table b using (doc_id,page_id)
WHERE b.word_id = 456
  AND a.word_id = 123

Besides the answers about how to write the SQL, you may be able to get a speedup by turning your indexes into covering indexes.

CREATE INDEX my_table_word ON my_table (word_id, page_id, doc_id);

For example, with this simple query:

SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2

Since all the values involved are in the index, the DB should not have to go read the rows from the table to get the values of doc_id and page_id because they are in the index.

How well this works undoubtedly differs from RDBMS to RDBMS.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 08:49

Thẻ: postgresqlindexingquery-optimization

Thiết kế website giá rẻ

Danh mục

Fastest way to find SQL intersection between 2 subsections of a huge dataset?