How can I optimise this SQL GROUP BY to remove duplicates, summing on numeric fields?

I am running the following query:

/*Put the grouped data into a temp table */

DROP TABLE IF EXISTS _TEMP_FORECAST_PARTITION_DATA

SELECT 
    SCENARIO, PERIOD, DEPARTMENT, ACCOUNT, CATEGORY, CURRENCY,
    PROJECT, PARTNERSHIP, EMPLOYEE, BOOKING_DATE, ORIGINAL_CURRENCY, 
    ANALYSIS_COMMENT, COMMENT_TYPE, NOTE, ALLOCATION_EXPLAINER, 
    PIPELINE_VALUATION_DATE, PIPELINE_PROBABILITY, TECH_ORIGINAL_USER,
    TECH_ORIGINAL_ORIGIN, TECH_ORIGINAL_DATEUPD, DEPARTMENT_VISIBILITY,
    SUM(AMOUNT) AS AMOUNT,
    SUM(ORIGINAL_AMOUNT) AS ORIGINAL_AMOUNT
INTO 
    _TEMP_FORECAST_PARTITION_DATA
FROM 
    AW_FCT_000002_000001
GROUP BY    
    SCENARIO, PERIOD, DEPARTMENT, ACCOUNT, CATEGORY, CURRENCY,
    PROJECT, PARTNERSHIP, EMPLOYEE, BOOKING_DATE, ORIGINAL_CURRENCY,
    ANALYSIS_COMMENT, COMMENT_TYPE, NOTE, ALLOCATION_EXPLAINER,
    PIPELINE_VALUATION_DATE, PIPELINE_PROBABILITY, TECH_ORIGINAL_USER,
    TECH_ORIGINAL_ORIGIN, TECH_ORIGINAL_DATEUPD, DEPARTMENT_VISIBILITY

/* Clear the original source table, then populate it with the temp table */

TRUNCATE TABLE AW_FCT_000002_000001

INSERT INTO AW_FCT_000002_000001
(OID, SCENARIO, PERIOD, DEPARTMENT, ACCOUNT, CATEGORY, CURRENCY,
   PROJECT, PARTNERSHIP, EMPLOYEE, BOOKING_DATE, ORIGINAL_CURRENCY, ANALYSIS_COMMENT,
   COMMENT_TYPE, NOTE, ALLOCATION_EXPLAINER, PIPELINE_VALUATION_DATE,
   PIPELINE_PROBABILITY, AMOUNT, ORIGINAL_AMOUNT, 
   TECH_ORIGINAL_USER, TECH_ORIGINAL_ORIGIN, 
   TECH_ORIGINAL_DATEUPD, DEPARTMENT_VISIBILITY)

SELECT
    NEWID(), SCENARIO, PERIOD, DEPARTMENT, ACCOUNT, CATEGORY, CURRENCY,
    PROJECT, PARTNERSHIP, EMPLOYEE, BOOKING_DATE, ORIGINAL_CURRENCY,
    ANALYSIS_COMMENT, COMMENT_TYPE, NOTE, ALLOCATION_EXPLAINER,
    PIPELINE_VALUATION_DATE, PIPELINE_PROBABILITY, 
    AMOUNT, ORIGINAL_AMOUNT, TECH_ORIGINAL_USER,
    TECH_ORIGINAL_ORIGIN, TECH_ORIGINAL_DATEUPD, DEPARTMENT_VISIBILITY
FROM 
    _TEMP_FORECAST_PARTITION_DATA

The query is taking 20 minutes to run – is there a more efficient way to do this?

There’s no functional issue, but it seems to take a disproportionate amount of time to other things I’m doing with the table.

Query execution plan here: https://www.brentozar.com/pastetheplan/?id=HydaVBp2C

Statistics from the run:

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 5 ms.

(54830198 rows affected)

(1 row affected)

SQL Server Execution Times:
CPU time = 1852172 ms, elapsed time = 544239 ms.
Warning: Null value is eliminated by an aggregate or other SET operation.

SQL Server Execution Times:
CPU time = 79 ms, elapsed time = 75 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 7 ms.

(54830198 rows affected)

(1 row affected)

SQL Server Execution Times:
CPU time = 348640 ms, elapsed time = 405714 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

Completion time: 2024-09-10T06:00:59.0020584+01:00

Table definition as follows:

USE [TGK04_DATA_002]
GO

/****** Object:  Table [dbo].[AW_FCT_000002_000001]    Script Date: 9/10/2024 3:06:40 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[AW_FCT_000002_000001](
    [OID] [varchar](36) NOT NULL,
    [SCENARIO] [varchar](15) NULL,
    [PERIOD] [varchar](2) NULL,
    [DEPARTMENT] [varchar](30) NULL,
    [ACCOUNT] [varchar](30) NULL,
    [CATEGORY] [varchar](30) NULL,
    [CURRENCY] [varchar](5) NULL,
    [PROJECT] [varchar](30) NULL,
    [PARTNERSHIP] [varchar](30) NULL,
    [EMPLOYEE] [varchar](30) NULL,
    [BOOKING_DATE] [varchar](30) NULL,
    [AMOUNT] [numeric](27, 9) NULL,
    [ORIGINAL_AMOUNT] [numeric](27, 9) NULL,
    [ORIGINAL_CURRENCY] [varchar](3) NULL,
    [NOTE] [varchar](8000) NULL,
    [ANALYSIS_COMMENT] [varchar](255) NULL,
    [PIPELINE_VALUATION_DATE] [varchar](30) NULL,
    [COMMENT_TYPE] [varchar](30) NULL,
    [TECH_ORIGINAL_USER] [varchar](255) NULL,
    [TECH_ORIGINAL_ORIGIN] [varchar](255) NULL,
    [TECH_ORIGINAL_DATEUPD] [datetime] NULL,
    [ALLOCATION_EXPLAINER] [varchar](2000) NULL,
    [PIPELINE_PROBABILITY] [varchar](30) NULL,
    [DEPARTMENT_VISIBILITY] [varchar](30) NULL,
    [DAY_RATE] [numeric](27, 9) NULL,
    [DAY_RATE_ORIGINAL_CURRENCY] [numeric](27, 9) NULL,
    [DAYS] [numeric](27, 9) NULL,
    [PROVENIENZA] [varchar](80) NULL,
    [USERUPD] [varchar](255) NULL,
    [DATEUPD] [datetime] NULL,
    [EN_VERSION] [numeric](5, 0) NOT NULL
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[AW_FCT_000002_000001] ADD  DEFAULT ((0.000000000)) FOR [DAY_RATE]
GO

ALTER TABLE [dbo].[AW_FCT_000002_000001] ADD  DEFAULT ((0.000000000)) FOR [DAY_RATE_ORIGINAL_CURRENCY]
GO

ALTER TABLE [dbo].[AW_FCT_000002_000001] ADD  DEFAULT ((0.000000000)) FOR [DAYS]
GO

ALTER TABLE [dbo].[AW_FCT_000002_000001] ADD  DEFAULT ((0)) FOR [EN_VERSION]
GO

If this runs once each night for 20 minutes, you might consider living with it for now and focus efforts on designing and developing a completely new process to replace it – one that separates daily data from summarized/aggregated data. But if you insist…

If you have the ability to alter the table definition, the current process can be improved if you:

Set column OID as the PRIMARY KEY.
Add a persisted computed column containing a cryptographic hash calculated from the values of the 20+ currently grouped columns.
Add an index in this new column that also includes the two amount columns. (Column OID is also needed, but is implicitly included as the primary key.)
Modify your process to update one of the rows in place and delete any remaining duplicate rows. Single (non-duplicated) rows should not be touched.

Using a single persisted computed column should greatly improve the efficiency of the grouping operation. Indexing that column is even better. Having the amount columns included in that index allows the initial scan and aggregation to be performed using a narrow index scan instead of a wide table scan. Finally, selecting one existing row to update in place saves the need to copy all those other columns and significantly reduces the insert/delete overhead.

Although the persisted computed column and associated index does have some overhead during the initial inserts, I believe this is more than offset by the potential savings.

For the persisted computed definition, the following add-column and index DDL can be used.

ALTER TABLE AW_FCT_000002_000001
    ADD HashColumn AS CONVERT(VARBINARY(32), HASHBYTES(
        'SHA2_256', -- SHA1 may be faster, but is deprecated
        ISNULL(CONVERT(VARBINARY(MAX), SCENARIO), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PERIOD), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), DEPARTMENT), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), ACCOUNT), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), CATEGORY), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), CURRENCY), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PROJECT), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PARTNERSHIP), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), EMPLOYEE), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), BOOKING_DATE), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), ORIGINAL_CURRENCY), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), ANALYSIS_COMMENT), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), COMMENT_TYPE), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), NOTE), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), ALLOCATION_EXPLAINER), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PIPELINE_VALUATION_DATE), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), PIPELINE_PROBABILITY), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), TECH_ORIGINAL_USER), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), TECH_ORIGINAL_ORIGIN), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), TECH_ORIGINAL_DATEUPD), 0x)
        + 0x00 + ISNULL(CONVERT(VARBINARY(MAX), DEPARTMENT_VISIBILITY), 0x)
    ))
    PERSISTED NOT NULL

CREATE INDEX IX_AW_FCT_000002_000001_HashColumn
    ON AW_FCT_000002_000001(HashColumn)
    INCLUDE(AMOUNT, ORIGINAL_AMOUNT)
    -- I'm assuming that OID is the PK, so is implicitly included.

The above calculation has a few limitations:

A null byte 0x00 is used to delimit column values, which should be acceptable for columns containing normal text values that do not contain null CHAR(0) characters.
A null column is treated as equivalent to an empty value ''.
Text values hashed exactly as represented, so are case-sensitive, accent-sensitive, etc. (If you need to match case-insensitive values, you can add UPPER() to each value above.)

I expect this should be sufficient for your use case. A more robust expression can be found in this answer. (I also recall having seen an approach that maps all of the columns to JSON or XML and then calculates the hash, but I can’t find that right now.)

The following SQL can then be used to consolidate the data:

DECLARE @Temp TABLE (
    HashColumn VARBINARY(32) NOT NULL,
    OidToKeep VARCHAR(36),
    CombinedAmount numeric(27, 9) NULL,
    CombinedOriginalAmount numeric(27, 9) NULL
)

SET XACT_ABORT ON
BEGIN TRANSACTION

-- Find and combine duplicates
INSERT @Temp
SELECT
    F.HashColumn,
    MIN(F.OID) AS OidToKeep,
    SUM(F.AMOUNT) AS CombinedAmount,
    SUM(F.ORIGINAL_AMOUNT) AS CombinedOriginalAmount
FROM AW_FCT_000002_000001 F
GROUP BY F.HashColumn
HAVING COUNT(*) >= 2

-- Update the row to be kept
UPDATE F
SET
    AMOUNT = T.CombinedAmount,
    ORIGINAL_AMOUNT = T.CombinedOriginalAmount
FROM @Temp T
JOIN AW_FCT_000002_000001 F
    ON F.HashColumn = T.HashColumn
    AND F.OID = T.OidToKeep

-- Delete the other rows
DELETE F
FROM @Temp T
JOIN AW_FCT_000002_000001 F
    ON F.HashColumn = T.HashColumn
    AND F.OID <> T.OidToKeep

COMMIT

The first statement selects and aggregates the data based on the HashColumn value and also limits results to those HashColumn values having duplicate (2 or more) rows. It then updates one chosen survivor row, and removes the other sacrificial rows.

If you had a date-added column and only need to consider data added after a certain date/time (since last successful run), you could add something like:

WHERE F.HashColumn IN (
    SELECT DISTINCT F1.HashColumn
    FROM AW_FCT_000002_000001 F1
    WHERE F1.DateAdded >= @SinceDateParameter
)

The above should be paired with an index CREATE INDEX IX_xxx ON AW_FCT_000002_000001(DateAdded) INCLUDE (HashColumn).

If you wanted to specifically preserve the row having the earliest date, you could change MIN(OID) AS OidToKeep to something like:

(
    SELECT TOP 1 F1.OID
    FROM AW_FCT_000002_000001 F1
    WHERE F1.HashColumn = F.HashColumn
    ORDER BY F1.DateAdded
) AS OidToKeep

If the above is used, the DateAdded column should be added to the earlier IX_AW_FCT_000002_000001_HashColumn index – CREATE INDEX IX_AW_FCT_000002_000001_HashColumn ON AW_FCT_000002_000001(HashColumn, DateAdded) INCLUDE(AMOUNT, ORIGINAL_AMOUNT). Your DATEUPD or EN_VERSION columns might be candidate substitutes for DateAdded for this use in selecting the earliest row..

See this db<>fiddle for a demo.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 13:59

Thẻ: sqlsql-serverperformancequery-optimization

Thiết kế website giá rẻ

Danh mục

How can I optimise this SQL GROUP BY to remove duplicates, summing on numeric fields?