Say I have 3 tables table A
, table B
and table C
—
table C
is the outcome of joining table A
and table B
Then a data pipeline which executing daily is conducting the above tables joining process. Due to the size of table A
is giant and we would like to avoid full-scanning on this table, an ‘offset’ is enabled on table A
. Then the daily execution become “incremental”, only the latest data (later than ‘offset’) from table A
will join table B
then load to table C
.
My question is — if table B
can be updated (not likely to change the existing records but to add in new records). And before the updates, some records already loaded in table A
are requiring the new reference in table B
. Therefore, after the joining process these records in table C
are missing information from table B
.
Currently we have a separate data pipeline which also executing daily to backfill any missing information from table C
。(Find records that has missing information suppose from table B
then join again). Wondering whether there is a more elegant approach?
Does the Delta Live Table from Databricks or Dynamic Table from Snowflake can provide better solution to resolve such scenarios?