I have a number of data sets which I’m trying to get a seven figure summary for (see https://en.wikipedia.org/wiki/Seven-number_summary#Bowley%E2%80%99s_seven-figure_summary for background)
It’s pretty trivial to get each of the various numbers (min, max, first and third quartiles, first and last deciles and median) from a single result set.
However, if I want to get it off various stats, I find myself dealing with a lot more redundant code than there should be.
For example:
create table example (x integer, y integer, z interval);
insert into example values
(1,2,'1 min'),
(3,10,'20 seconds'),
(3,40,'3 seconds'),
(2,1,'3 minutes'),
(10,10,'1 hour');
select percentile_disc(0) within group(order by value) as minimum,
percentile_disc(0.10) within group(order by value) as first_decile,
percentile_disc(0.25) within group(order by value) as first_quartile,
percentile_disc(0.5) within group(order by value) as median,
percentile_disc(0.75) within group(order by value) as third_quartile,
percentile_disc(0.9) within group(order by value) as last_decile,
percentile_disc(1) within group(order by value)
from (
select x as value from example
) data;
select percentile_disc(0) within group(order by value) as minimum,
percentile_disc(0.10) within group(order by value) as first_decile,
percentile_disc(0.25) within group(order by value) as first_quartile,
percentile_disc(0.5) within group(order by value) as median,
percentile_disc(0.75) within group(order by value) as third_quartile,
percentile_disc(0.9) within group(order by value) as last_decile,
percentile_disc(1) within group(order by value)
from (
select y as value from example
) data;
select percentile_disc(0) within group(order by value) as minimum,
percentile_disc(0.10) within group(order by value) as first_decile,
percentile_disc(0.25) within group(order by value) as first_quartile,
percentile_disc(0.5) within group(order by value) as median,
percentile_disc(0.75) within group(order by value) as third_quartile,
percentile_disc(0.9) within group(order by value) as last_decile,
percentile_disc(1) within group(order by value)
from (
select z as value from example
) data;
When run, this gives me:
postgres=# i ~/cc/bowley-example.sql
CREATE TABLE
INSERT 0 5
minimum | first_decile | first_quartile | median | third_quartile | last_decile | percentile_disc
---------+--------------+----------------+--------+----------------+-------------+-----------------
1 | 1 | 2 | 3 | 3 | 10 | 10
(1 row)
minimum | first_decile | first_quartile | median | third_quartile | last_decile | percentile_disc
---------+--------------+----------------+--------+----------------+-------------+-----------------
1 | 1 | 2 | 10 | 10 | 40 | 40
(1 row)
minimum | first_decile | first_quartile | median | third_quartile | last_decile | percentile_disc
----------+--------------+----------------+----------+----------------+-------------+-----------------
00:00:03 | 00:00:03 | 00:00:20 | 00:01:00 | 00:03:00 | 01:00:00 | 01:00:00
(1 row)
postgres=#
This is fine as output but all but one of those lines in each of the three queries is identical. Code duplication being a nasty small, I feel like there has to be a neater solution.
Note that the real data sets I’m working with are much bigger so performance is a consideration. I’m also intending on using first (see https://wiki.postgresql.org/wiki/First/last_(aggregate) ) to get the best and worst examples in certain cases but I’ve omitted that in the interest of simplicity.