Thiết kế website giá rẻ

Question

Imagine I have a table defined like:

CREATE TABLE my_table (
    id SERIAL PRIMARY KEY,
    fruit VARCHAR(50),
    pet VARCHAR(50),
    country VARCHAR(50),
    color VARCHAR(50),
    car_brand VARCHAR(50),
    month DATE,
    value DECIMAL(10, 2)
);

Filled with data (currently around 6 million rows) like:

INSERT INTO my_table (fruit, pet, country, color, car_brand, month, value)
SELECT 
    -- Randomly select one of 10 fruits
    (ARRAY['Apple', 'Banana', 'Cherry', 'Date', 'Elderberry', 'Fig', 'Grape', 'Honeydew', 'Kiwi', 'Lemon'])[floor(random() * 10 + 1)],
    -- Randomly select one of 10 pets
    (ARRAY['Dog', 'Cat', 'Hamster', 'Parrot', 'Fish', 'Rabbit', 'Turtle', 'Lizard', 'Snake', 'Frog'])[floor(random() * 10 + 1)],
    -- Randomly select one of 10 countries
    (ARRAY['USA', 'Canada', 'Mexico', 'Brazil', 'UK', 'France', 'Germany', 'Italy', 'Spain', 'Japan'])[floor(random() * 10 + 1)],
    -- Randomly select one of 10 colors
    (ARRAY['Red', 'Green', 'Blue', 'Yellow', 'Purple', 'Orange', 'Black', 'White', 'Pink', 'Brown'])[floor(random() * 10 + 1)],
    -- Randomly select one of 10 car brands
    (ARRAY['Toyota', 'Ford', 'Honda', 'Chevrolet', 'Nissan', 'BMW', 'Mercedes', 'Volkswagen', 'Hyundai', 'Audi'])[floor(random() * 10 + 1)],
    -- Randomly select a month between 2010 and 2030
    date_trunc('month', '2010-01-01'::date + (random() * (365*20)) * '1 day'::interval),
    -- Randomly generate a decimal value between 0 and 100000
    round((random() * 100000)::numeric, 2)
FROM generate_series(1, 6000000);

and I need to return the sum of value grouped by each other column individually, filtered by zero or more of the columns, and such that month is between two given months. This can be done in several queries, or in a single query if that is beneficial.

An example of one of the queries could be (in this case not filtered by any other column):

SELECT
    fruit,
    sum(value) AS value
FROM
    my_table
WHERE
    date_trunc('month', month) >= date_trunc('month', TIMESTAMP '2013-01-01')
    AND date_trunc('month', month) < date_trunc('month', TIMESTAMP '2024-12-31')
GROUP BY
    fruit
HAVING
    sum(value) >= 100;

With no indexes, this one grouping alone takes around 1.3s.
None of the filters I have attempted have been used by the query planner, so they had no effect on execution time:

CREATE INDEX idx_my_table_month_fruit ON my_table (month, fruit);
CREATE INDEX idx_my_table_month_fruit_include_value ON my_table (month, fruit) INCLUDE (value);
CREATE INDEX idx_my_table_month_fruit_value ON my_table (month, fruit, value);

Here is the output from EXPLAIN (ANALYZE, BUFFERS, VERBOSE):

Finalize GroupAggregate  (cost=125540.63..125637.11 rows=3 width=38) (actual time=1383.800..1465.536 rows=10 loops=1)
  Output: fruit, sum(value)
  Group Key: my_table.fruit
  Filter: (sum(my_table.value) >= '100'::numeric)
  Buffers: shared hit=61266
  ->  Gather Merge  (cost=125540.63..125636.81 rows=20 width=38) (actual time=1372.784..1465.513 rows=30 loops=1)
        Output: fruit, (PARTIAL sum(value))
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=61266
        ->  Partial GroupAggregate  (cost=124540.60..124634.48 rows=10 width=38) (actual time=1273.888..1367.455 rows=10 loops=3)
              Output: fruit, PARTIAL sum(value)
              Group Key: my_table.fruit
              Buffers: shared hit=61266
              Worker 0:  actual time=1272.827..1365.943 rows=10 loops=1
                Buffers: shared hit=20096
              Worker 1:  actual time=1268.420..1363.470 rows=10 loops=1
                Buffers: shared hit=20340
              ->  Sort  (cost=124540.60..124571.85 rows=12500 width=14) (actual time=1263.047..1293.971 rows=1192562 loops=3)
                    Output: fruit, value
                    Sort Key: my_table.fruit
                    Sort Method: quicksort  Memory: 98529kB
                    Buffers: shared hit=61266
                    Worker 0:  actual time=1262.223..1293.317 rows=1173014 loops=1
                      Sort Method: quicksort  Memory: 96704kB
                      Buffers: shared hit=20096
                    Worker 1:  actual time=1257.874..1289.398 rows=1186804 loops=1
                      Sort Method: quicksort  Memory: 97264kB
                      Buffers: shared hit=20340
                    ->  Parallel Seq Scan on public.my_table  (cost=0.00..123690.00 rows=12500 width=14) (actual time=0.033..1069.481 rows=1192562 loops=3)
                          Output: fruit, value
                          Filter: ((date_trunc('month'::text, (my_table.month)::timestamp with time zone) >= '2013-01-01 00:00:00'::timestamp without time zone) AND (date_trunc('month'::text, (my_table.month)::timestamp with time zone) < '2024-12-01 00:00:00'::timestamp without time zone))
                          Rows Removed by Filter: 807438
                          Buffers: shared hit=61190
                          Worker 0:  actual time=0.028..1054.352 rows=1173014 loops=1
                            Buffers: shared hit=20058
                          Worker 1:  actual time=0.034..1063.693 rows=1186804 loops=1
                            Buffers: shared hit=20302
Planning Time: 0.316 ms
Execution Time: 1467.926 ms

Obviously it doesn’t get better when I want to do the same for pet, country, color, car_brand at the same time.

Any ideas on how to optimize this would be greatly appreciated. I am open to both hardware upgrades, configuration changes, and of course, query/index/etc changes. The latter is obviously preferred.

Note: in reality there are not only 10 of each option. Some of them have tens, some of them hundreds, some of them thousands. So far I have not noticed a difference in behavior from if there were 10, so for simplicity I’ve left it like that in the example

Thiết kế website giá rẻ

Danh mục

Optimize query/queries that groups and sums by several individual columns