Change to Postgres indexing scheme for IS NOT NULL?

I have a Postgres table that has numerous columns which frequently show up in the where clause of select queries. The table has been indexed accordingly, with indexes on all of these columns (mostly single-column indexes, but with some composite indexes thrown in). However, there is one new kind of query that this indexing isn’t fully supporting: queries with a deleted_at is null condition in the where clause (we soft-delete records using this column). Some queries with this are running very slowly despite all of the other columns they use being indexed. Naturally, I want to find a way to improve these queries with a change to our indexing.

An example would be:

<code>select count(distinct user_id)

from my_table

where group_id = '123' and deleted_at is null

</code>

<code>select count(distinct user_id) from my_table where group_id = '123' and deleted_at is null </code>

select count(distinct user_id)
from my_table
where group_id = '123' and deleted_at is null

In this example, both user_id and group_id are indexed. Without and deleted_at is null, the query runs quickly. With it, slowly.

I have four competing solutions. I plan on testing them, but I really want to know if any old hands are able to look at this situation and have a simple explanation for why one should be expected to perform better than the other. I’m just getting the hang of thinking about indexing after being spoiled by Snowflake for years so I’m really looking for how one would reason about this.

My solutions:

An index on the expression (docs) deleted_at is null. Basically: CREATE INDEX deleted_at_is_null ON my_table ((deleted_at is null));. This is the simplest solution. It’s just one more index, and one with a clear purpose. I’m not sure, though, if it should actually be expected to help in queries where we have other indexed columns in the where clause! Can Postgres use them separately or do they need to be composite?
Replace each of the current indexes (like the ones on user_id and group_id above) with composite indexes on that column, plus deleted_at is null.
Same as 2, but instead of replacing the indexes, add the composite indexes in addition to the currently-existing indexes. This feels wrong and redundant, but I am not sure.
Add a new partial index for each of the currently-existing indexes with a where deleted_at is not null condition. Like number 3, this feels like too many indexes.

I’m assuming that an index on deleted_at itself is overkill since I never need to query for specific ranges/values of it – only for whether it is null. Please correct me if I am wrong, though!

One other thing to note is that the vast majority of the records have null deleted_at.

Any help would be much appreciated! Just looking for some intuition and best practices around this problem.

New contributor

You can see these tested on db<>fiddle here.

Adding just a plain index on my_table((deleted_at is null)) would likely speed up queries using only that exact expression – you are not wrong about that. It’s cheaper to use the info from another one it just scanned to jump to heap and check deleted_at there directly, rather than search this index as an addition, intersect which parts of the heap they both point at, and only then do the jump.
Adding (deleted_at is null) to your other indexes is simple, cheap and effective. It matters whether you place it on the first or last position, though. If it leads, you might get just an index scan, but a more thorough one. If it follows, you’ll get a bitmap index scan with a re-check on the heap, but with high enough selectivity, it might finish faster, especially since you need the user_id from the heap anyways.
Duplicating the indexes just so that each one has its expanded version added separately would only make sense if this addition weighed those one-column indices down significantly, which it does not. It’s just one more junction on top/bottom of the tree and other queries that don’t care about deleted_at will not mind that it’s there.
Adding the predicate indexes without removing the regular ones does make sense because queries that mismatch the predicate disqualify that index, so keeping the regulars offers something to fall back on.

As an addition to your list:

Replace the indexes you have with their partial versions, if your queries always want the (deleted_at is null). If you rarely ever query for deleted_at is not null and/or you can afford a seq scan in those cases, you can stop there. Otherwise, it could make sense to replace the base index with one that uses the opposite predicate – that way you’re covered in either case. That means setting both of these up:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>create index idx4_a on my_table(group_id)where(deleted_at is null);
create index idx4_b on my_table(group_id)where(deleted_at is not null);
</code>
<code>create index idx4_a on my_table(group_id)where(deleted_at is null); create index idx4_b on my_table(group_id)where(deleted_at is not null); </code>
```
create index idx4_a on my_table(group_id)where(deleted_at is null);
create index idx4_b on my_table(group_id)where(deleted_at is not null);
```
A silly quirk of that is when you’re not specifying whether deleted_at should be null or not and you don’t have a non-predicate index to fall back on, you should actually specify it could be either, so this:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>(deleted_at is not null or deleted_at is null)
</code>
<code>(deleted_at is not null or deleted_at is null) </code>
```
(deleted_at is not null or deleted_at is null)
```
rather than omitting that entirely, even though these conditions cancel each other out. The reason is that the planner is matching your conditions against the predicates so listing them convinces the planner it can run a BitmapOr scanning both. Omitting the column causes a seq scan instead if it’s missing the base index at that point.
The two indexes combined are as small as the base index was and they can cover all 3 cases (null, not null, unspecified), although the last one’s a bit inconvenient and slightly slower than the base one.
Don’t add (deleted_at is null) as either a key column or a partial index predicate but rather strap deleted_at on as payload using include.

The former is a missing combination of those you already established, the latter could work sort of against what the documentation clearly says about non-key column inclusion:

A non-key column cannot be used in an index scan search qualification.

And it is not used in the qualification, but it is used in the scan, speeding things up by saving a whole subsequent heap scan. If you just add deleted_at, Postgres still prefers a plain index on group_id, then a re-check on the heap because it needs to consult both deleted_at as well as user_id it’s looking for.
If you add both as payload:

<code>create index on my_table(group_id)include(user_id,deleted_at);

</code>

<code>create index on my_table(group_id)include(user_id,deleted_at); </code>

create index on my_table(group_id)include(user_id,deleted_at);

Everything is in the index. Now Postgres sees deleted_at is in the index it’s already using, so both the output and the filter can re-use that:
_{demo at db<>fiddle}

QUERY PLAN
Aggregate (cost=4.50..4.51 rows=1 width=8) (actual time=0.071..0.072 rows=1 loops=1)
Output: count(DISTINCT user_id)
-> Index Only Scan using idx2 on public.my_table (cost=0.42..4.49 rows=1 width=4) (actual time=0.055..0.057 rows=2 loops=1)
Output: group_id, user_id, deleted_at
Index Cond: (my_table.group_id = 123)
Filter: (my_table.deleted_at IS NULL)
Rows Removed by Filter: 2
Heap Fetches: 0
Planning Time: 0.108 ms
Execution Time: 0.095 ms

That’s on 100k random group_id‘s and user_id‘s spread over 300k rows with 20% deleted_at IS NULL.

While an inert column in the index simplifies its traversal compared to adding stuff as additional key columns, include doesn’t support expressions, so it might actually get larger than a version with an expression on second position. For not- null the whole timestamp gets pulled in there.
This sort of configuration only works if you set up these indexes to contain everything you’re querying, making them covering indexes. This might be a problem with many, especially wide columns, or even a few columns if they’re wide enough.
You might want to look for more combinations of columns that are queried together or can at least re-use the same index. It’s even possible to use partials to speed up statements that reference the opposite of the predicate and none of its indexed columns.
The tests I did are only a simple reference. You should run pgbench on something mirroring your actual cluster config and data characteristics.

The alternative to adding deleted_at to your other indexes is to partition your table on whether deleted_at is set. This will create two tables which appear to be one: one table has deleted rows, the other has active rows. If you include where deleted_at is null in a query PostgreSQL will simply query the appropriate partition and PostgreSQL can treat the query as if deleted_at wasn’t there at all. It also makes it more efficient to remove “deleted” rows without blocking other queries.

There are caveats. First, you can’t partition an existing table. You have to make a new table, partition it, and copy your data over.

Second, if you have a primary key the partition key has to part of it. And nulls aren’t allowed in primary keys. To get around this, we’ll only put a primary key on the active partition.

For convenience and safety, I’ll create a view which only selects non-deleted rows.

<code>-- Move the old table out of the way.

alter table things rename to things_original;

-- Create the new partitioned table

-- It has no primary key, that is handled by the partitions.

create table things_all (

id serial,

name text not null,

deleted_at timestamp default null

) partition by list(deleted_at);

-- Make a partition for the active rows.

create table things_active partition of things_all for values in (null);

-- Add a primary key

alter table things_active add primary key(id);

-- And one for the inactive rows.

create table things_deleted partition of things_all default;

-- Copy rows from the old table. They'll automatically be partitioned.

insert into things_all select * from things_original;

-- Add your indexes to things_active (not shown)

-- Create a view to only get active rows.

create view things as

select *

from things_all

where deleted_at is null;

-- This will only query the things_active table, no deleted_at index required.

select * from things;

</code>

<code>-- Move the old table out of the way. alter table things rename to things_original; -- Create the new partitioned table -- It has no primary key, that is handled by the partitions. create table things_all ( id serial, name text not null, deleted_at timestamp default null ) partition by list(deleted_at); -- Make a partition for the active rows. create table things_active partition of things_all for values in (null); -- Add a primary key alter table things_active add primary key(id); -- And one for the inactive rows. create table things_deleted partition of things_all default; -- Copy rows from the old table. They'll automatically be partitioned. insert into things_all select * from things_original; -- Add your indexes to things_active (not shown) -- Create a view to only get active rows. create view things as select * from things_all where deleted_at is null; -- This will only query the things_active table, no deleted_at index required. select * from things; </code>

-- Move the old table out of the way.
alter table things rename to things_original;

-- Create the new partitioned table
-- It has no primary key, that is handled by the partitions.
create table things_all (
  id serial,
  name text not null,
  deleted_at timestamp default null
) partition by list(deleted_at);

-- Make a partition for the active rows.
create table things_active partition of things_all for values in (null);

-- Add a primary key
alter table things_active add primary key(id);

-- And one for the inactive rows.
create table things_deleted partition of things_all default;

-- Copy rows from the old table. They'll automatically be partitioned.
insert into things_all select * from things_original;

-- Add your indexes to things_active (not shown)

-- Create a view to only get active rows.
create view things as 
select * 
from things_all 
where deleted_at is null;

-- This will only query the things_active table, no deleted_at index required.
select * from things;

Demonstration

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 17:24

Thẻ: databasepostgresqlindexingrdbms

Thiết kế website giá rẻ

Danh mục

Change to Postgres indexing scheme for IS NOT NULL?