I am working on onboarding data onto a platform which can visualize data only at hour grain or day grain. I provided them with connection string to my db and that system pulls data from it and shows it as a graph and do a bunch of slice and dice operations. Problem is its chart based visualization works only at hour or day grain and my data is once every 4 hours. It considers this as missing data and the graph comes up as dots instead of continuous line. So I thought of exposing a view which can time gap fill data at hour grain with the last data used and the platform can use it. This is presenting with some issues.
DB schema: record_date, segment_key, segment_value, avg_consumption, avg_charge … (more numeric columns)
Each segment_key, segment_value combination is updated every 4 hours and I was able to form a query to gap fill following this guide Fill time gaps and imputing missing values
But this is a query with a temporary table as a solution, not a view and the original table can contain lot of data a few years worth which can make this non-performant. Also the start/end times for the temp table cannot be hard coded as this is rolling data.
Is there a more elegant way I can do it where I gap fill without using any joins? And if use joins it still is performant.
Tried query following the guide from [Microsoft website](Fill time gaps and imputing missing values
). But it reads all the data from the table before filling and exposing.
3
From the description, there are two parts to this problem:
- Get a contiguous range of dates/times – even if any point in the range is not present in the data.
- In the result, for any missing point, pull the latest non-NULL value from previous points in the range.
For Azure SQL Database, SQL Server 2022, and the four people using Managed Instance, it is quite easy to generate a contiguous series of dates or times using GENERATE_SERIES
. It is going to be both simpler and more efficient than a solution involving looping to populate a #temp table. You could create a table-valued function, depending on the interval you want, and pass in start/end dates. This one handles days or hours, and could easily be expanded to handle minutes, weeks, months, etc.:
CREATE OR ALTER FUNCTION dbo.GetInterval
(
@part varchar(4),
@start date,
@end date
)
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
(
SELECT period_start = s,
period_end = CASE @part
WHEN 'day' THEN DATEADD(DAY, 1, s)
WHEN 'hour' THEN DATEADD(HOUR, 1, s)
END
FROM
(
SELECT s = DATEADD(HOUR, Value-1, CONVERT(datetime, @start))
FROM GENERATE_SERIES(1, 1+DATEDIFF(HOUR, @start, @end))
WHERE @part = 'hour'
UNION ALL
SELECT s = DATEADD(DAY, Value-1, @start)
FROM GENERATE_SERIES(1, 1+DATEDIFF(DAY, @start, @end))
WHERE @part = 'day'
) AS x
);
Now for the data. You didn’t provide any, so I’ll make some up:
CREATE TABLE #sales
(
record_date date,
segment_key int,
segment_value int,
consumption decimal(10,2),
charge decimal(12,2)
);
INSERT #sales VALUES
('20240101',1,1,10,40),('20240101',1,2,20,80),
('20240101',1,1,20,70),('20240101',1,2,30,100),
('20240102',1,1,10,90),('20240102',1,2,25,150),
/* no rows at all for Jan 3 */
('20240104',1,1,12,48), /* missing segment_value */
('20240105',1,1,15,75),('20240105',1,2,11,44);
SELECT *
FROM #sales
ORDER BY record_date;
Output:
I’m also going to make up your desired results, because you didn’t provide those, either. I’m guessing you want aggregates per combo for every {period}, and if a row is missing, take the most recent one, e.g. by day it would be:
The problem now is twofold: you need to fill in gaps for days that are missing altogether, and you need to fill in gaps for segment combos that are missing. An ordinary aggregate query:
SELECT record_date, segment_key, segment_value,
avg_consumption = AVG(consumption),
avg_charge = AVG(charge)
FROM #sales
GROUP BY record_date, segment_key, segment_value;
Will produce these results:
To fill those gaps, you must first generate a list of all possible segment combos and all possible dates. You can generate that full set using:
DECLARE @start date = '20240101', @end date = '20240105';
;WITH all_possible_segments AS
(
SELECT segment_key, segment_value
FROM #sales
WHERE record_date >= @start
AND record_date < DATEADD(DAY, 1, @end)
GROUP BY segment_key, segment_value
),
all_possible_dates AS
(
SELECT period_start, next_period_start = period_end
FROM dbo.GetInterval('day', @start, DATEADD(DAY, 1, @end))
)
SELECT * FROM all_possible_segments
CROSS JOIN all_possible_dates;
In this case it produces 10 rows, like so:
Now you can outer join to your actual data, e.g.:
DECLARE @start date = '20240101', @end date = '20240105';
;WITH all_possible_segments AS
(
SELECT segment_key, segment_value
FROM #sales
WHERE record_date >= @start
AND record_date < DATEADD(DAY, 1, @end)
GROUP BY segment_key, segment_value
),
all_possible_dates AS
(
SELECT period_start, next_period_start = period_end
FROM dbo.GetInterval('day', @start, DATEADD(DAY, 1, @end))
)
SELECT dates.period_start,
combos.segment_key,
combos.segment_value,
avg_consumption = AVG(sales.consumption),
avg_charge = AVG(sales.charge)
FROM all_possible_segments AS combos
CROSS JOIN all_possible_dates AS dates
LEFT OUTER JOIN #sales AS sales
ON sales.segment_key = combos.segment_key
AND sales.segment_value = combos.segment_value
AND sales.record_date >= dates.period_start
AND sales.record_date < dates.next_period_start
GROUP BY dates.period_start,
combos.segment_key,
combos.segment_value;
But this still produces NULL
for those missing rows (or 0
if you use COALESCE()
), which makeds:
To fill in any missing row’s value with the previous value (which is weird because it misrepresents the actual data, unless you were filling in rolling SUM
s, but if making the line look straight is paramount), you can use the new functionality for LAST_VALUE()
: IGNORE NULLS
. This says pull from a different (usually the most recent) row with a non-null value:
DECLARE @start date = '20240101', @end date = '20240105';
;WITH all_possible_segments AS
(
SELECT segment_key, segment_value
FROM #sales
WHERE record_date >= @start
AND record_date < DATEADD(DAY, 1, @end)
GROUP BY segment_key, segment_value
),
all_possible_dates AS
(
SELECT period_start, next_period_start = period_end
FROM dbo.GetInterval('day', @start, DATEADD(DAY, 1, @end))
)
SELECT period_start, segment_key, segment_value,
avg_consumption = LAST_VALUE(avg_cons) IGNORE NULLS OVER
(PARTITION BY segment_key, segment_value ORDER BY period_start),
avg_charge = LAST_VALUE(avg_chrg) IGNORE NULLS OVER
(PARTITION BY segment_key, segment_value ORDER BY period_start)
FROM
(
SELECT dates.period_start,
combos.segment_key,
combos.segment_value,
avg_cons = AVG(sales.consumption),
avg_chrg = AVG(sales.charge)
FROM all_possible_segments AS combos
CROSS JOIN all_possible_dates AS dates
LEFT OUTER JOIN #sales AS sales
ON sales.segment_key = combos.segment_key
AND sales.segment_value = combos.segment_value
AND sales.record_date >= dates.period_start
AND sales.record_date < dates.next_period_start
GROUP BY dates.period_start,
combos.segment_key,
combos.segment_value
) AS grp
ORDER BY period_start,
segment_key,
segment_value;
Output:
If this doesn’t perform well, it’s probably because the underlying table isn’t properly indexed to support range queries (e.g. covering and leading on date/time), or your range was too large to do anything but scan. Or the performance problem is not the query at all, but simply trying to return / render too many rows. If you get three months of minute intervals and aggregate against 400 products, well, that’s a lot of data to return (over 50 million rows) no matter how fast the query executes to find those rows.
(Compare timings of SELECT
to SELECT .,.. INTO #trash
to see if this might be contributing to the problem.)
For readers using older versions, well, you can’t use IGNORE NULLS
, but other ideas at least for replacing GENERATE_SERIES
here:
- Time Slots for a schedule between two Datetimes
1