I have an Azure Synapse dedicated pool. I use COPY INTO to read CSV files from Azure blob storage into an existing table. It can consistently load 3Gb / 27M rows in under 10 minutes.
Target table
CREATE TABLE xyz.some_table (
some_table_id int null,
some_event datetime null,
/* snip */
col_p nvarchar(256) null,
col_q nvarchar(256) null,
col_r nvarchar(256) null
)
WITH (
HEAP,
DISTRIBUTION = ROUND_ROBIN
);
COPY INTO
COPY INTO xyz.some_table
FROM 'https://account.blob.core.windows.net/container/folder'
WITH (
CREDENTIAL = (IDENTITY = 'MANAGED IDENTITY'),
FIRSTROW = 2,
FILE_TYPE = 'CSV',
COMPRESSION = 'NONE',
ROWTERMINATOR = '0X0A',
ENCODING = 'UTF8'
);
Turns out this implementation is incorrect – the source tables are nvarchar(max). So after a year of running well I got truncation errors.
I changed the Synapse target table to also be nvarchar(max).
...
col_p nvarchar(max) null,
col_q nvarchar(max) null,
col_r nvarchar(max) null
...
and it now takes 2 to 3 hours to load the same files that used to take 10 minutes. I redefine these columns from nvarchar(max) to nvarchar(4000) and performance is back where it was.
I am familiar with SQL Server internals and how lobs are stored differently to other data types. I have no intuition about Synapse, however, and what may be causing this performance degradation. I’m asking for insights, links or hints, please, as to what’s going on here.
For better performance you need to minimize the size of the data type which shortens the row length.
That is, you give the longest character length of a column instead of MAX
.
Because when you give MAX
it takes maximum storage size of 2 GB
check here.
You can also refer here it is mentioned
When possible, use NVARCHAR(4000) or VARCHAR(8000) instead of
NVARCHAR(MAX) or VARCHAR(MAX).
for better performance.