I have two columns: column1 with two distinct values (0, 1) and column2 with 3 distinct values (‘A’, ‘B’, ‘C’). Value in column1 is always 0 for ‘A’ and ‘B’ in column2, but if the value in column2 is ‘C’ then column1 has the next distribution: (0: 15%, 1: 85%). But I have another column departmentid. In some departments distribution of column1 in case column2 = ‘C’ is (0: 0%, 1: 100%). So in some cases when I have a query
SELECT * FROM mytable WHERE departmenid = 42 AND column2 = 'C' AND column1 = 0 ORDER BY id LIMIT 10;
PostgreSQL chooses to index scan by id and assumes there will be ~25000 rows, but in the table, there are no rows with this filter. So the query scans the whole table by index scan and it takes too long. If db selects bitmap scan it will be 50x times faster (based on other queries on this table). Indexes on all these columns are present.
I have two questions:
- I created the statistic object on column1, column2 and departmentid, but there are no two interesting to me dependencies: from column2 to column1 and from column2, departmenid to column1. Why? In the statistic object on column1 and column2 dependencies are null. Of course I made analyze.
- How can I speed up this query? Is there any way to speed up this query without creating a specific 4-column index (departmenid, column2, column1, id)? Because there are a lot of other filters and different orders in real production query (the problem reproduces with this minimal query).
PostgreSQL 16.1, auto -vacuum and -analyze every day, ~3 million rows in the table