Im running a service that crawls many websites daily. The crawlers are run as jobs processed by a bunch of independent background worker processes, that picks up the jobs as they get enqueued.
Now currently I’m doing throttling “in-process” meaning that the threads will do a random sleep between each request. However having many asynchronous background workers running, that from time to time might be processing jobs for the same domain, is causing concurrency trouble, and I need a central monitoring system that my workers can talk to and get directions from. I have been looking into a few patterns e.g. leaky bucket and rate limiting using redis, feeling that I have still to find the silver bullet.
My main concern is, that not only do I have to track the requests and throttle them, but I also need to capture feedback like response code 429’s and use it to adapt the rate level on a domain-individual basis.
Do any of you have some good ideas as to patterns you have used, good resources on the topic, general advice and other recommendations.
Thanks!
I suggest looking into Zookeeper to help you with this. We utilize Zookeeper to allow the servers to coordinate between each other as they’re tasked with talking to domains.
Pattern works as follows :
- Persistent nodes representing your domain set are populated into
Zookeeper. - Jobs affecting those domains “lock” those domains with ephemeral nodes. You can tune how many locks are allowed in your code, thus specifying max concurrent consumers.
- You can additionally use the body node itself to track individual requests for the purposes of limiting rate across your entire cluster or specifying the rate for each individual consumer – depending on your desired approach. You could even use an “AtomicLong” recipe (look into Curator) to throttle by download sizes.
- You then assign listeners to the relevant nodes to allow the state information to be propagated to your consumers as it updates. FIFO ordering is guaranteed so that’s not a concern for you.
- Optionally, you can propagate additional cluster worker information to a centralized node and have a leader node (see the Curator ‘Leadership Election’ recipe) handle auditing that information so you can review actual performance over time.
This is how we do it and it works as advertised.
I would like to suggest an architecture that scales reasonably well, and performs faster than random sleeps.
Each domain is associated with a queue of known pages that are to be crawled, and two additional fields:
- The time between requests for this domain.
- The earliest time when the next page can be requested.
We now have a priority queue of domains. In this queue, the domains are sorted by the time for the next request. These domain objects are units of work. The worker threads take units of work from the queue, with the queue guaranteeing this will be the earliest domain where the next request shall be made. When a worker receives an unit of work, it first checks whether the specified request time lies in the future. If so, the thread sleeps until then. Otherwise/afterwards:
- The next page for that domain is requested and processed. Possibly, new pages are added to that domain to be crawled.
- If the server requests rate limiting, then the time between requests for that domain is increased.
- The unit of work is given back to the job queue, and a new job is requested.
When the job queue receives back ownership of the domain object, it first checks whether there are any further pages to be crawled. If so, then the time for the next request is calculated. This may be the time between requests added to the time of the previous request, or a random value with that time being the minimum or mean.
This architecture has the advantage that ownership is clearly defined, so only one thread is making requests to any given domain at a time. The disadvantages are the job queue thread doing a lot of work, and the overhead of communication between threads.
There is an additional point to take care of: How are pages to be crawled added to a domain that is still unknown to the system, or that is currently owned by another thread? This should be handled by the job queue in order to avoid concurrency issues.
Performance can be increased by having each worker thread request multiple pages simultaneously using asynchronous operations. This way, the time between sending an HTTP request and receiving an response can be used for working on other things.
2