I have a storage container where a source file is published on a daily basis. I need to compare the contents of this file with that of a table in SQL DB, and copy the rows from the source file to the DB only if they do not exist in the DB.
I came across this following post which deals with a similar use case: Azure Data factory – insert a row in the azure sql database only if it doesn’t exist
However, I have tried to replicate this use case using a data flow and compare the two datasets using ‘Exists’ on a derived column (hash column using sha2) instead of joins/filter/select. However, i have the following issues:
- The file and the table in general are massive: ~370 columns and ~7M rows.
- The dataset contains a mixture of data types: string, double, timestamp etc.
- When i create the source in my data flow, the data types get messed up and they do not match with the data types in the DB.
- So in my data flow, I always end up with all the rows from the csv file as not matching because the derived columns do not match.
How do i deal with this?
Here is a snapshot of my dataflow: