My case is a bit more complex, but let’s consider simpler example.
Let’s say we’ve got a big dataset that needs to be updated incrementally. Updates come several times a day and each time we receive all data we have for today (small amount of data at first and then more and more towards the end of the day). So, each batch contains new data and data from previous batch that may contain some changes or deletions compared to previous batch.
Logically, to make it efficient we need to overwrite data for current day only each time we receive a new batch of data, leaving data for all previous days untouched. If we were using plain Spark we could achieve this by reading only new files, partitioning our output dataset by date, and using Spark configuration setting partitionOverwriteMode = dynamic. That would suit our case perfectly, overwriting today’s data each time new data comes.
But, as I understand I can’t use such settings in Foundry, I tried it, and it seems like the setting is simply ignored.
Using incremental decorator, on the other hand, can help us to read only new data, but output write mode can only be set to modify (append only) or replace (complete overwrite). So, no option for partial overwrites.
Maybe someone knows, how such scenarios (when append is not enough and total replacement of output is too costly and inefficient) can be handled in Foundry?
Thanks!