I am running the following query:
/*Put the grouped data into a temp table */
DROP TABLE IF EXISTS _TEMP_FORECAST_PARTITION_DATA
SELECT
SCENARIO, PERIOD, DEPARTMENT, ACCOUNT, CATEGORY, CURRENCY,
PROJECT, PARTNERSHIP, EMPLOYEE, BOOKING_DATE, ORIGINAL_CURRENCY,
ANALYSIS_COMMENT, COMMENT_TYPE, NOTE, ALLOCATION_EXPLAINER,
PIPELINE_VALUATION_DATE, PIPELINE_PROBABILITY, TECH_ORIGINAL_USER,
TECH_ORIGINAL_ORIGIN, TECH_ORIGINAL_DATEUPD, DEPARTMENT_VISIBILITY,
SUM(AMOUNT) AS AMOUNT,
SUM(ORIGINAL_AMOUNT) AS ORIGINAL_AMOUNT
INTO
_TEMP_FORECAST_PARTITION_DATA
FROM
AW_FCT_000002_000001
GROUP BY
SCENARIO, PERIOD, DEPARTMENT, ACCOUNT, CATEGORY, CURRENCY,
PROJECT, PARTNERSHIP, EMPLOYEE, BOOKING_DATE, ORIGINAL_CURRENCY,
ANALYSIS_COMMENT, COMMENT_TYPE, NOTE, ALLOCATION_EXPLAINER,
PIPELINE_VALUATION_DATE, PIPELINE_PROBABILITY, TECH_ORIGINAL_USER,
TECH_ORIGINAL_ORIGIN, TECH_ORIGINAL_DATEUPD, DEPARTMENT_VISIBILITY
/* Clear the original source table, then populate it with the temp table */
TRUNCATE TABLE AW_FCT_000002_000001
INSERT INTO AW_FCT_000002_000001
(OID, SCENARIO, PERIOD, DEPARTMENT, ACCOUNT, CATEGORY, CURRENCY,
PROJECT, PARTNERSHIP, EMPLOYEE, BOOKING_DATE, ORIGINAL_CURRENCY, ANALYSIS_COMMENT,
COMMENT_TYPE, NOTE, ALLOCATION_EXPLAINER, PIPELINE_VALUATION_DATE,
PIPELINE_PROBABILITY, AMOUNT, ORIGINAL_AMOUNT,
TECH_ORIGINAL_USER, TECH_ORIGINAL_ORIGIN,
TECH_ORIGINAL_DATEUPD, DEPARTMENT_VISIBILITY)
SELECT
NEWID(), SCENARIO, PERIOD, DEPARTMENT, ACCOUNT, CATEGORY, CURRENCY,
PROJECT, PARTNERSHIP, EMPLOYEE, BOOKING_DATE, ORIGINAL_CURRENCY,
ANALYSIS_COMMENT, COMMENT_TYPE, NOTE, ALLOCATION_EXPLAINER,
PIPELINE_VALUATION_DATE, PIPELINE_PROBABILITY,
AMOUNT, ORIGINAL_AMOUNT, TECH_ORIGINAL_USER,
TECH_ORIGINAL_ORIGIN, TECH_ORIGINAL_DATEUPD, DEPARTMENT_VISIBILITY
FROM
_TEMP_FORECAST_PARTITION_DATA
The query is taking 20 minutes to run – is there a more efficient way to do this?
There’s no functional issue, but it seems to take a disproportionate amount of time to other things I’m doing with the table.
Query execution plan here: https://www.brentozar.com/pastetheplan/?id=HydaVBp2C
Statistics from the run:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 5 ms.
(54830198 rows affected)
(1 row affected)
SQL Server Execution Times:
CPU time = 1852172 ms, elapsed time = 544239 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.
SQL Server Execution Times:
CPU time = 79 ms, elapsed time = 75 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 7 ms.
(54830198 rows affected)
(1 row affected)
SQL Server Execution Times:
CPU time = 348640 ms, elapsed time = 405714 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Completion time: 2024-09-10T06:00:59.0020584+01:00
Table definition as follows:
USE [TGK04_DATA_002]
GO
/****** Object: Table [dbo].[AW_FCT_000002_000001] Script Date: 9/10/2024 3:06:40 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[AW_FCT_000002_000001](
[OID] [varchar](36) NOT NULL,
[SCENARIO] [varchar](15) NULL,
[PERIOD] [varchar](2) NULL,
[DEPARTMENT] [varchar](30) NULL,
[ACCOUNT] [varchar](30) NULL,
[CATEGORY] [varchar](30) NULL,
[CURRENCY] [varchar](5) NULL,
[PROJECT] [varchar](30) NULL,
[PARTNERSHIP] [varchar](30) NULL,
[EMPLOYEE] [varchar](30) NULL,
[BOOKING_DATE] [varchar](30) NULL,
[AMOUNT] [numeric](27, 9) NULL,
[ORIGINAL_AMOUNT] [numeric](27, 9) NULL,
[ORIGINAL_CURRENCY] [varchar](3) NULL,
[NOTE] [varchar](8000) NULL,
[ANALYSIS_COMMENT] [varchar](255) NULL,
[PIPELINE_VALUATION_DATE] [varchar](30) NULL,
[COMMENT_TYPE] [varchar](30) NULL,
[TECH_ORIGINAL_USER] [varchar](255) NULL,
[TECH_ORIGINAL_ORIGIN] [varchar](255) NULL,
[TECH_ORIGINAL_DATEUPD] [datetime] NULL,
[ALLOCATION_EXPLAINER] [varchar](2000) NULL,
[PIPELINE_PROBABILITY] [varchar](30) NULL,
[DEPARTMENT_VISIBILITY] [varchar](30) NULL,
[DAY_RATE] [numeric](27, 9) NULL,
[DAY_RATE_ORIGINAL_CURRENCY] [numeric](27, 9) NULL,
[DAYS] [numeric](27, 9) NULL,
[PROVENIENZA] [varchar](80) NULL,
[USERUPD] [varchar](255) NULL,
[DATEUPD] [datetime] NULL,
[EN_VERSION] [numeric](5, 0) NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[AW_FCT_000002_000001] ADD DEFAULT ((0.000000000)) FOR [DAY_RATE]
GO
ALTER TABLE [dbo].[AW_FCT_000002_000001] ADD DEFAULT ((0.000000000)) FOR [DAY_RATE_ORIGINAL_CURRENCY]
GO
ALTER TABLE [dbo].[AW_FCT_000002_000001] ADD DEFAULT ((0.000000000)) FOR [DAYS]
GO
ALTER TABLE [dbo].[AW_FCT_000002_000001] ADD DEFAULT ((0)) FOR [EN_VERSION]
GO
21
If this runs once each night for 20 minutes, you might consider living with it for now and focus efforts on designing and developing a completely new process to replace it – one that separates daily data from summarized/aggregated data. But if you insist…
If you have the ability to alter the table definition, the current process can be improved if you:
- Set column
OID
as thePRIMARY KEY
. - Add a persisted computed column containing a cryptographic hash calculated from the values of the 20+ currently grouped columns.
- Add an index in this new column that also includes the two amount columns. (Column
OID
is also needed, but is implicitly included as the primary key.) - Modify your process to update one of the rows in place and delete any remaining duplicate rows. Single (non-duplicated) rows should not be touched.
Using a single persisted computed column should greatly improve the efficiency of the grouping operation. Indexing that column is even better. Having the amount columns included in that index allows the initial scan and aggregation to be performed using a narrow index scan instead of a wide table scan. Finally, selecting one existing row to update in place saves the need to copy all those other columns and significantly reduces the insert/delete overhead.
Although the persisted computed column and associated index does have some overhead during the initial inserts, I believe this is more than offset by the potential savings.
For the persisted computed definition, the following add-column and index DDL can be used.
ALTER TABLE AW_FCT_000002_000001
ADD HashColumn AS CONVERT(VARBINARY(32), HASHBYTES(
'SHA2_256', -- SHA1 may be faster, but is deprecated
ISNULL(CONVERT(VARBINARY(MAX), SCENARIO), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PERIOD), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), DEPARTMENT), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), ACCOUNT), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), CATEGORY), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), CURRENCY), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PROJECT), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PARTNERSHIP), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), EMPLOYEE), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), BOOKING_DATE), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), ORIGINAL_CURRENCY), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), ANALYSIS_COMMENT), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), COMMENT_TYPE), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), NOTE), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), ALLOCATION_EXPLAINER), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PIPELINE_VALUATION_DATE), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PIPELINE_PROBABILITY), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), TECH_ORIGINAL_USER), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), TECH_ORIGINAL_ORIGIN), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), TECH_ORIGINAL_DATEUPD), 0x)
+ 0x00 + ISNULL(CONVERT(VARBINARY(MAX), DEPARTMENT_VISIBILITY), 0x)
))
PERSISTED NOT NULL
CREATE INDEX IX_AW_FCT_000002_000001_HashColumn
ON AW_FCT_000002_000001(HashColumn)
INCLUDE(AMOUNT, ORIGINAL_AMOUNT)
-- I'm assuming that OID is the PK, so is implicitly included.
The above calculation has a few limitations:
- A null byte
0x00
is used to delimit column values, which should be acceptable for columns containing normal text values that do not contain nullCHAR(0)
characters. - A null column is treated as equivalent to an empty value
''
. - Text values hashed exactly as represented, so are case-sensitive, accent-sensitive, etc. (If you need to match case-insensitive values, you can add
UPPER()
to each value above.)
I expect this should be sufficient for your use case. A more robust expression can be found in this answer. (I also recall having seen an approach that maps all of the columns to JSON or XML and then calculates the hash, but I can’t find that right now.)
The following SQL can then be used to consolidate the data:
DECLARE @Temp TABLE (
HashColumn VARBINARY(32) NOT NULL,
OidToKeep VARCHAR(36),
CombinedAmount numeric(27, 9) NULL,
CombinedOriginalAmount numeric(27, 9) NULL
)
SET XACT_ABORT ON
BEGIN TRANSACTION
-- Find and combine duplicates
INSERT @Temp
SELECT
F.HashColumn,
MIN(F.OID) AS OidToKeep,
SUM(F.AMOUNT) AS CombinedAmount,
SUM(F.ORIGINAL_AMOUNT) AS CombinedOriginalAmount
FROM AW_FCT_000002_000001 F
GROUP BY F.HashColumn
HAVING COUNT(*) >= 2
-- Update the row to be kept
UPDATE F
SET
AMOUNT = T.CombinedAmount,
ORIGINAL_AMOUNT = T.CombinedOriginalAmount
FROM @Temp T
JOIN AW_FCT_000002_000001 F
ON F.HashColumn = T.HashColumn
AND F.OID = T.OidToKeep
-- Delete the other rows
DELETE F
FROM @Temp T
JOIN AW_FCT_000002_000001 F
ON F.HashColumn = T.HashColumn
AND F.OID <> T.OidToKeep
COMMIT
The first statement selects and aggregates the data based on the HashColumn
value and also limits results to those HashColumn
values having duplicate (2 or more) rows. It then updates one chosen survivor row, and removes the other sacrificial rows.
If you had a date-added column and only need to consider data added after a certain date/time (since last successful run), you could add something like:
WHERE F.HashColumn IN (
SELECT DISTINCT F1.HashColumn
FROM AW_FCT_000002_000001 F1
WHERE F1.DateAdded >= @SinceDateParameter
)
The above should be paired with an index CREATE INDEX IX_xxx ON AW_FCT_000002_000001(DateAdded) INCLUDE (HashColumn)
.
If you wanted to specifically preserve the row having the earliest date, you could change MIN(OID) AS OidToKeep
to something like:
(
SELECT TOP 1 F1.OID
FROM AW_FCT_000002_000001 F1
WHERE F1.HashColumn = F.HashColumn
ORDER BY F1.DateAdded
) AS OidToKeep
If the above is used, the DateAdded
column should be added to the earlier IX_AW_FCT_000002_000001_HashColumn
index – CREATE INDEX IX_AW_FCT_000002_000001_HashColumn ON AW_FCT_000002_000001(HashColumn, DateAdded) INCLUDE(AMOUNT, ORIGINAL_AMOUNT)
. Your DATEUPD
or EN_VERSION
columns might be candidate substitutes for DateAdded
for this use in selecting the earliest row..
See this db<>fiddle for a demo.