Assume I have a Django application with the following features:
- there’s a Project model which has a
created_at
field - the application supports a plugin system where a developer can create custom API endpoint and install them onto the application, exposing them publicly. The endpoints can run arbitrary code, but cannot schedule long running tasks (e.g. Celery tasks)
My goal is to create a third party analytics service for this application. In order to do that, the analytics service needs to poll the application for new data periodically. Specifically, the analytics application needs to fetch any new projects.
The first step is to create a plugin that will act as an “adapter”, which exposes an API endpoint that presents the projects in a format that is useful to the analytics service. We have full control over this endpoint, which parameters it accepts, etc. Assume it answers using pagination.
Now the question is: assuming the analytics service will perform one request per hour to get the new projects, what should be the pattern to only ask for new projects?
There are two techniques which I’ve thought of, both with their advantages and disadvantages.
-
Use a query parameter
since
specifying the timestamp of the last time the analytics service fetched the project. The analytics app will run in a loop to get all the pages from the adapter endpoint, then will save the current time as the last timestamp. The pro is that this is a very simple approach. The main con is that any projects created during the retrieval of the pages may never make it to the analytics -
Specifying the list of project IDs the analytics has already fetched in the request body / query params. This prevents starvation but will eventually get to the point where the requests are huge.
Is there a better way?