I am trying to optimize my COPY INTO command.
Loading data from ADLS Gen2 into Snowflake via External Stage.
ADLS Folder Pattern,
YYYY-MM-DDTHH24.MI.SSZ/METADATA/<TABLE>/<TABLE>.json
YYYY-MM-DDTHH24.MI.SSZ/<TABLE>/<TABLE>.csv
I have two seperate COPY to load the data into two seperate table given the file format is different.
Things tried,
- Passing in the stage path to reduce the scan to specific day.
COPY INTO TABLE
FROM ( SQL )
@EXT_STAGE/2024-08-01
PATTERN = 'T[0-9]{2}.[0-9]{2}.[0-9]{2}Z/.*.json$'
Still takes alot of time. Folders present for almost every 10 mins in ADLS.
- Tried different patterns,
PATTERN = 'T[0-9]{2}.[0-9]{2}.[0-9]{2}Z/.*.json$' -- Found most effective
PATTERN = '.*T[0-9]{2}.[0-9]{2}.[0-9]{2}Z/.*.json$' -- Second Best
PATTERN = '.*.json$' -- Third Best
Though the difference in time is very low.
- Stats,
For 100k files processed, it is taking about 12 mins
Listing External Stage - 5 mins
Processing - 7 mins
Number of rows inserted
94479
Scan progress
100.00%
External bytes scanned
2.35GB
Bytes written
17.10MB
Considering the stats its just ~2 GB data and only 17 MB’s written but execution time is quite high.
Let me know if there is any way to improve the execution time either by improving the PATTERN or any other way.
*Things which are not as per Snowflake recommendation,
- Files are not compressed ( Can’t compress it )
2