Imagine I have a table defined like:
CREATE TABLE my_table (
id SERIAL PRIMARY KEY,
fruit VARCHAR(50),
pet VARCHAR(50),
country VARCHAR(50),
color VARCHAR(50),
car_brand VARCHAR(50),
month DATE,
value DECIMAL(10, 2)
);
Filled with data (currently around 6 million rows) like:
INSERT INTO my_table (fruit, pet, country, color, car_brand, month, value)
SELECT
-- Randomly select one of 10 fruits
(ARRAY['Apple', 'Banana', 'Cherry', 'Date', 'Elderberry', 'Fig', 'Grape', 'Honeydew', 'Kiwi', 'Lemon'])[floor(random() * 10 + 1)],
-- Randomly select one of 10 pets
(ARRAY['Dog', 'Cat', 'Hamster', 'Parrot', 'Fish', 'Rabbit', 'Turtle', 'Lizard', 'Snake', 'Frog'])[floor(random() * 10 + 1)],
-- Randomly select one of 10 countries
(ARRAY['USA', 'Canada', 'Mexico', 'Brazil', 'UK', 'France', 'Germany', 'Italy', 'Spain', 'Japan'])[floor(random() * 10 + 1)],
-- Randomly select one of 10 colors
(ARRAY['Red', 'Green', 'Blue', 'Yellow', 'Purple', 'Orange', 'Black', 'White', 'Pink', 'Brown'])[floor(random() * 10 + 1)],
-- Randomly select one of 10 car brands
(ARRAY['Toyota', 'Ford', 'Honda', 'Chevrolet', 'Nissan', 'BMW', 'Mercedes', 'Volkswagen', 'Hyundai', 'Audi'])[floor(random() * 10 + 1)],
-- Randomly select a month between 2010 and 2030
date_trunc('month', '2010-01-01'::date + (random() * (365*20)) * '1 day'::interval),
-- Randomly generate a decimal value between 0 and 100000
round((random() * 100000)::numeric, 2)
FROM generate_series(1, 6000000);
and I need to return the sum of value grouped by each other column individually, filtered by zero or more of the columns, and such that month is between two given months. This can be done in several queries, or in a single query if that is beneficial.
An example of one of the queries could be (in this case not filtered by any other column):
SELECT
fruit,
sum(value) AS value
FROM
my_table
WHERE
date_trunc('month', month) >= date_trunc('month', TIMESTAMP '2013-01-01')
AND date_trunc('month', month) < date_trunc('month', TIMESTAMP '2024-12-31')
GROUP BY
fruit
HAVING
sum(value) >= 100;
With no indexes, this one grouping alone takes around 1.3s.
None of the filters I have attempted have been used by the query planner, so they had no effect on execution time:
CREATE INDEX idx_my_table_month_fruit ON my_table (month, fruit);
CREATE INDEX idx_my_table_month_fruit_include_value ON my_table (month, fruit) INCLUDE (value);
CREATE INDEX idx_my_table_month_fruit_value ON my_table (month, fruit, value);
Here is the output from EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
:
Finalize GroupAggregate (cost=125540.63..125637.11 rows=3 width=38) (actual time=1383.800..1465.536 rows=10 loops=1)
Output: fruit, sum(value)
Group Key: my_table.fruit
Filter: (sum(my_table.value) >= '100'::numeric)
Buffers: shared hit=61266
-> Gather Merge (cost=125540.63..125636.81 rows=20 width=38) (actual time=1372.784..1465.513 rows=30 loops=1)
Output: fruit, (PARTIAL sum(value))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=61266
-> Partial GroupAggregate (cost=124540.60..124634.48 rows=10 width=38) (actual time=1273.888..1367.455 rows=10 loops=3)
Output: fruit, PARTIAL sum(value)
Group Key: my_table.fruit
Buffers: shared hit=61266
Worker 0: actual time=1272.827..1365.943 rows=10 loops=1
Buffers: shared hit=20096
Worker 1: actual time=1268.420..1363.470 rows=10 loops=1
Buffers: shared hit=20340
-> Sort (cost=124540.60..124571.85 rows=12500 width=14) (actual time=1263.047..1293.971 rows=1192562 loops=3)
Output: fruit, value
Sort Key: my_table.fruit
Sort Method: quicksort Memory: 98529kB
Buffers: shared hit=61266
Worker 0: actual time=1262.223..1293.317 rows=1173014 loops=1
Sort Method: quicksort Memory: 96704kB
Buffers: shared hit=20096
Worker 1: actual time=1257.874..1289.398 rows=1186804 loops=1
Sort Method: quicksort Memory: 97264kB
Buffers: shared hit=20340
-> Parallel Seq Scan on public.my_table (cost=0.00..123690.00 rows=12500 width=14) (actual time=0.033..1069.481 rows=1192562 loops=3)
Output: fruit, value
Filter: ((date_trunc('month'::text, (my_table.month)::timestamp with time zone) >= '2013-01-01 00:00:00'::timestamp without time zone) AND (date_trunc('month'::text, (my_table.month)::timestamp with time zone) < '2024-12-01 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 807438
Buffers: shared hit=61190
Worker 0: actual time=0.028..1054.352 rows=1173014 loops=1
Buffers: shared hit=20058
Worker 1: actual time=0.034..1063.693 rows=1186804 loops=1
Buffers: shared hit=20302
Planning Time: 0.316 ms
Execution Time: 1467.926 ms
Obviously it doesn’t get better when I want to do the same for pet
, country
, color
, car_brand
at the same time.
Any ideas on how to optimize this would be greatly appreciated. I am open to both hardware upgrades, configuration changes, and of course, query/index/etc changes. The latter is obviously preferred.
Note: in reality there are not only 10 of each option. Some of them have tens, some of them hundreds, some of them thousands. So far I have not noticed a difference in behavior from if there were 10, so for simplicity I’ve left it like that in the example