I’ve got two tables:
utilities
:
id | timestamp | action |
---|---|---|
901 | 2024-08-11 09:59:25.000 | on power |
902 | 2024-08-11 09:59:35.000 | on water |
903 | 2024-08-11 09:59:55.000 | off power |
904 | 2024-08-11 10:01:25.000 | on gas |
905 | 2024-08-11 10:02:35.000 | off water |
906 | 2024-08-11 10:11:18.000 | off power |
907 | 2024-08-11 10:31:28.000 | off gas |
908 | 2024-08-11 11:15:37.000 | on power |
items
:
id | timestamp | action |
---|---|---|
906 | 2024-08-11 09:59:45.000 | on lights |
907 | 2024-08-11 09:59:58.000 | off lights |
908 | 2024-08-11 10:15:34.000 | on tap |
909 | 2024-08-11 10:18:25.000 | on heating |
910 | 2024-08-11 10:21:44.000 | off heating |
911 | 2024-08-11 11:02:35.000 | off tap |
912 | 2024-08-11 12:01:08.000 | open door |
913 | 2024-08-11 12:11:28.000 | closer door |
I’m trying to combine them, without the ids clashing (generate a new id?) and then want to do a date_trunc('hour', timestamp) as time, COUNT(*) as metric
to work out how many actions are happening in the table per hour, so the end result would be:
time | metric |
---|---|
2024-08-11 09:00:00.000 | 5 |
2024-08-11 10:00:00.000 | 7 |
2024-08-11 11:00:00.000 | 2 |
2024-08-11 12:00:00.000 | 2 |
I tried this query, but it complains about:
Utilities.timestamp must appear in the GROUP BY …
WITH one AS (
SELECT
date_trunc('hour', timestamp) as timeOne,
COUNT(*) as utilities_count
FROM Utilities
ORDER BY timeOne
),
two AS (
SELECT
date_trunc('hour', timestamp) as timeTwo,
COUNT(*) as item_count
FROM Items
ORDER BY timeTwo
)
SELECT
SUM(utilities_count, item_count) as metric,
timeOne as time
FROM one, two
ORDER BY 1;
Any idea how to get the data combined and counted with the correct hourly binning?
Simply:
SELECT date_trunc('hour', timestamp) AS time, count(*) AS metric
FROM (
SELECT timestamp FROM utilities
UNION ALL
SELECT timestamp FROM items
) sub
GROUP BY 1
ORDER BY 1;
Clashing IDs are not relevant to this query. Simply append rows from both tables with UNION ALL
(not just UNION
!), use date_trunc()
and aggregate.
I removed a more complex query that would only make sense with (unlikely) expression indexes on underlying tables.
3
If you like negligible performance gains, you can overcomplicate things with the overkill below. It results in an emulated index skip scan:
with recursive cte as (
(select date_trunc('hour',recorded_at) as t,count(*)
from utilities
group by 1 order by 1 limit 1)
union all
select lat.t,lat.count
from cte cross join lateral
( select date_trunc('hour',recorded_at) as t,count(*)
from utilities
where date_trunc('hour',recorded_at)>cte.t
group by 1 order by 1 limit 1) lat )
,cte2 as (
(select date_trunc('hour',recorded_at) as t,count(*)
from items
group by 1 order by 1 limit 1)
union all
select lat.t,lat.count
from cte cross join lateral
( select date_trunc('hour',recorded_at) as t,count(*)
from items
where date_trunc('hour',recorded_at)>cte.t
group by 1 order by 1 limit 1) lat )
select t,coalesce(cte.count,0)+coalesce(cte2.count,0) as count
from cte full join cte2 using (t);
If you set up 2 tables, populate them with 100k random timestamps each, eqip both with a covering index on expression:
select setseed(.42);
create table utilities(recorded_at)as
select (now()-random()*'3 months'::interval)::timestamp
from generate_series(1,1e5);
create table items(recorded_at)as
select (now()-random()*'3 months'::interval)::timestamp
from generate_series(1,1e5);
create index utilities_idx on utilities(date_trunc('hour',recorded_at) asc)
include(recorded_at)--necessary to upgrade index scan to index-only scan
with(fillfactor=100);
create index items_idx on items(date_trunc('hour',recorded_at) asc)
include(recorded_at)
with(fillfactor=100);
Then make sure they are freshly analyse
d (and actually even cluster
ed to help out seq scans), you’ll find how little this much code and effort contributes to performance:
explain
plans and performance tests at db<>fiddle
variant | mode | avg | min | max | stddev |
---|---|---|---|---|---|
joined_index_skip_scans_zegarek | 00:00:00.0693 | 00:00:00.082845 | 00:00:00.0693 | 00:00:00.280274 | 00:00:00.031557 |
grouped_union_of_ungrouped_erwin | 00:00:00.084824 | 00:00:00.100407 | 00:00:00.084824 | 00:00:00.185166 | 00:00:00.020863 |
grouped_union_of_groups_schwern | 00:00:00.086139 | 00:00:00.100492 | 00:00:00.086139 | 00:00:00.187825 | 00:00:00.021184 |
grouped_union_of_groups_erwin | 00:00:00.087176 | 00:00:00.106341 | 00:00:00.087176 | 00:00:00.312548 | 00:00:00.038265 |
It does look like 30%, but that’s 30% of very little already – and it’s a bit of a pain to get there.
An interesting observation is that not only is the index strictly necessary for this to work, it also has to be additionally weighed down to include
the raw timestamp as payload on top of the date_trunc()
expression just to enable an index-only scan, even though the query never uses the raw timestamp, making it feel like it’s just an expensive way to parley with the planner.
Meanwhile, the seq-scanning queries don’t really need the index at all.
1
count
is an aggregate function which combines the values of multiple rows together. You have to tell Postgres how to group rows with a group by
clause. You want to group by hour, so GROUP BY date_trunc('hour', timestamp)
or GROUP BY 1
.
Also, you do not want to do a cross join which will result in each row in one being joined with every row in two. Instead you can union both tables together and then group, but this can’t take advantage of indexes. Instead, count both tables (which can use an index), then union and group and sum the counts together (only a handful of rows).
WITH hours AS (
SELECT
date_trunc('hour', recorded_at) as "hour",
COUNT(*) as "count"
FROM utilities
GROUP BY 1
UNION ALL
SELECT
date_trunc('hour', recorded_at) as "hour",
COUNT(*) as "count"
FROM items
GROUP BY 1
)
SELECT
"hour",
SUM("count") as "count"
FROM hours
GROUP BY "hour"
ORDER BY "hour";
Demonstration.
Other notes…
- Ordering the subqueries will just slow things down. Only the final query needs to be ordered.
- Avoid naming columns using keywords like
timestamp
. They risk errors and don’t have a lot of descriptive power. Follow thething_at
for timestamps andthing_on
for dates. - There are a lot of SQL keywords and they’re hard to avoid, quote them if you use them to avoid ambiguity and bizarre errors.
0