I’m trying to find a solution to allow our users to stay in sync with an API that updates often. Right now we export our data once a month in JSON lines format. We want users to keep up with daily changes to records via the API, but it’s difficult to do because we update our records A LOT. We release a lot of new features, and when we do this we update 15M to 20M records per day (out of 250M total).
What we would like to do is create daily changefiles which can be downloaded via S3. But each document is fairly large, including an abstract, authors, institutions, etc. If we dump all ~15M changed records via JSON lines the file size will be large and the transfer costs will likely be unsustainable. What we would like to do is dump only what’s changed, such as updates, additions, and deletions. This would result in small file sizes and likely be very sustainable.
Has anybody done something like this? We checked out a project called ResourceSync (https://www.openarchives.org/rs/toc) but I don’t think it caught on. Maybe there is a project out there or solution that I’m not aware of? Would love to hear your thoughts!