I’ve been working on improving the query times for few common query patterns in Postgres. The table structure can be simplified to the following
create table file_group (
id serial primary key,
tag varchar(64),
creation_time timestamp
);
create table file (
id serial primary key,
file_group_id integer not null references file_group,
name text,
ordering int
);
For a given file_group
, we can have 0-N files
referencing it. For all queries, I need to determine the most recent file_group
for each unique tag value, and then retrieve the files
associated with them. All of the common queries follow a structure similar to this
select * from
(
select distinct on (file_group.tag) *
from file_group
where file_group.tag < 'C' -- dynamic user supplied filter
order by file_group.tag asc, file_group.creation_time desc
) fg
left join file on fg.id = file.file_group_id -- left join in case of 0 files in group
order by -- returned ordering of files is important
fg.tag asc,
fg.creation_time desc,
file.ordering asc;
I’ve tried a few different indexing approaches to improve the speed of this query pattern. Currently there is an index on file_group (tag, creation_time desc)
which works well for the dynamic user supplied filters that are applied to the where clause.
However, I’m currently stuck on the distinct on
subquery. When there are a large number of matching rows, the planner uses relatively slow nested loops to join the subquery and file
table. I’ve been struggling to come up with a solution without subqueries using some window function or similar.
The desired output can be seen in this fiddle here.
Some other details
- In reality the
file_group
table has multipletag
columns, previously this was stored in a single HSTORE column - The
file_group
table has ~50M records and thefile
table has ~60M. Most file groups have a single file in them, but some have 0 and some have many.