Context
It’s only loading some of the data even though we expect it to load all of the data in one single run (aka 43ish iterations of the loop) I have to run this script many times to get all of the records. I have an idea that it has something to do with the fetch next, offset, and/or optimize for statements.
There are about 2 million records that need to be loaded from the CSV file. The code below is supposed to:
- extracts about 500k from the file at a time in the bulk insert to a #tmp table
- then it iterates about 43 times translating and inserting roughly 50k records each iteration of the data that’s not already in the database.
Problem
It’s only loading some of the data on a complete run. I have to run this script many times to get all of the records.
I have some other code that I’ve removed here for brevity that tracks the migration history and it shows that each time I run this it will catch more of the records and add them to the database. Honestly, I’m at a loss for words as to why this could be. I’ve tried logging and investigating for about a week and I’m stuck. Below you can see the number of times this was ran and the number of records/rows in the csv as ChecksumMigration then the number that was in the table when it started as ChecksumTableBefore and the number that was added to the database during the run as ChecksumTableAfter.
AppliedDateTime | ChecksumMigration | ChecksumTableBefore | ChecksumTableAfter |
---|---|---|---|
2024-05-06 00:20:05 | 2036473 | 1986473 | 2036473 |
2024-05-06 00:06:27 | 2036473 | 1936473 | 1986473 |
2024-05-05 23:57:51 | 2036473 | 1786473 | 1936473 |
2024-05-05 23:54:07 | 2036473 | 1536473 | 1786473 |
2024-05-05 23:49:35 | 2036473 | 1036473 | 1536473 |
2024-05-05 23:42:20 | 2036473 | 0 | 1036473 |
USE FoodData_Central;
DECLARE @beforeChecksum INT = 0;
DECLARE @afterChecksum INT = 0;
DECLARE @migrationName NVARCHAR(40) = N'2024 April Full from 2021 - ';
DECLARE @pathToInputFolder NVARCHAR(40) = N'C:FoodData_Central_csv_2024-04-18';
DECLARE @tableName Nvarchar(40);
DECLARE @startTime DATETIME2 = GETDATE();
BEGIN TRY
set @tableName = 'food';
-- CHECKSUM
DECLARE @SQL NVARCHAR(MAX) = 'SELECT @ResultVariable = count(*) FROM ' + @tableName;
EXEC sp_executesql @SQL, N'@ResultVariable INT OUTPUT', @ResultVariable = @beforeChecksum output;
-- TRUNCATE TABLE food; --only truncated the first time and then use fetch next to get through the data, solving problem data along the way
-- Mapping
DROP TABLE IF EXISTS #tmp;
create table #tmp(
fdc_id NVARCHAR(max) NOT NULL,
data_type NVARCHAR(max) NULL,
description NVARCHAR(max) NULL,
food_category_id NVARCHAR(max) NULL,
publication_date NVARCHAR(max) NULL
)
bulk insert #tmp
From 'C:FoodData_Central_csv_2024-04-18food.csv' -- update file name, (!sometimes the file names are different <crosses eyes>)
WITH
(
CODEPAGE = '65001'
,FIRSTROW = 2
,FIELDTERMINATOR = '","'
,ROWTERMINATOR = '0x0A' --Use to shift the control to next row
,batchsize=500000
,TABLOCK
);
DECLARE @i int = 1
DECLARE @offsetCount int = 1;
DECLARE @nextCount int = 50000;
WHILE @i < 43
BEGIN
SET @i = @i + 1
-- DDL
insert into food(fdc_id, data_type, description, food_category_id, publication_date) -- UPDATE file name, and columns
select
CAST(REPLACE(t.fdc_id,'"','') AS INT) AS fdc_id
, t.data_type
, t.description
, CAST(t.food_category_id AS SMALLINT) AS food_category_id
, CAST(REPLACE(REPLACE(t.publication_date, '"', ''), CHAR(13), '') AS DATETIME2) AS publication_date
from #tmp t
WHERE NOT EXISTS ( -- skip duplicates
SELECT 1 FROM food AS d --UPDATE
WHERE d.fdc_id = CAST(REPLACE(t.fdc_id,'"','') AS INT)
)
ORDER BY fdc_id DESC
OFFSET @offsetCount - 1 ROWS
FETCH NEXT @nextCount - @offsetCount + 1 ROWS ONLY
OPTION ( OPTIMIZE FOR (@offsetCount = 1, @nextCount = 2036474) );
set @offsetCount = @offsetCount + 50000;
set @nextCount = @nextCount + 50000;
END
-- CLEANUP
DROP TABLE IF EXISTS #tmp;
END TRY
BEGIN CATCH
PRINT 'Error Number: ' + CAST(ERROR_NUMBER() AS NVARCHAR(10));
PRINT 'Error Message: ' + ERROR_MESSAGE();
-- CLEANUP
DROP TABLE IF EXISTS #tmp;
END CATCH;
GO