Patterns for creating adaptive web crawler throttling

Im running a service that crawls many websites daily. The crawlers are run as jobs processed by a bunch of independent background worker processes, that picks up the jobs as they get enqueued.

Now currently I’m doing throttling “in-process” meaning that the threads will do a random sleep between each request. However having many asynchronous background workers running, that from time to time might be processing jobs for the same domain, is causing concurrency trouble, and I need a central monitoring system that my workers can talk to and get directions from. I have been looking into a few patterns e.g. leaky bucket and rate limiting using redis, feeling that I have still to find the silver bullet.

My main concern is, that not only do I have to track the requests and throttle them, but I also need to capture feedback like response code 429’s and use it to adapt the rate level on a domain-individual basis.

Do any of you have some good ideas as to patterns you have used, good resources on the topic, general advice and other recommendations.

Thanks!

I suggest looking into Zookeeper to help you with this. We utilize Zookeeper to allow the servers to coordinate between each other as they’re tasked with talking to domains.

Pattern works as follows :

Persistent nodes representing your domain set are populated into
Zookeeper.
Jobs affecting those domains “lock” those domains with ephemeral nodes. You can tune how many locks are allowed in your code, thus specifying max concurrent consumers.
You can additionally use the body node itself to track individual requests for the purposes of limiting rate across your entire cluster or specifying the rate for each individual consumer – depending on your desired approach. You could even use an “AtomicLong” recipe (look into Curator) to throttle by download sizes.
You then assign listeners to the relevant nodes to allow the state information to be propagated to your consumers as it updates. FIFO ordering is guaranteed so that’s not a concern for you.
Optionally, you can propagate additional cluster worker information to a centralized node and have a leader node (see the Curator ‘Leadership Election’ recipe) handle auditing that information so you can review actual performance over time.

This is how we do it and it works as advertised.

I would like to suggest an architecture that scales reasonably well, and performs faster than random sleeps.

Each domain is associated with a queue of known pages that are to be crawled, and two additional fields:

The time between requests for this domain.
The earliest time when the next page can be requested.

We now have a priority queue of domains. In this queue, the domains are sorted by the time for the next request. These domain objects are units of work. The worker threads take units of work from the queue, with the queue guaranteeing this will be the earliest domain where the next request shall be made. When a worker receives an unit of work, it first checks whether the specified request time lies in the future. If so, the thread sleeps until then. Otherwise/afterwards:

The next page for that domain is requested and processed. Possibly, new pages are added to that domain to be crawled.
If the server requests rate limiting, then the time between requests for that domain is increased.
The unit of work is given back to the job queue, and a new job is requested.

When the job queue receives back ownership of the domain object, it first checks whether there are any further pages to be crawled. If so, then the time for the next request is calculated. This may be the time between requests added to the time of the previous request, or a random value with that time being the minimum or mean.

This architecture has the advantage that ownership is clearly defined, so only one thread is making requests to any given domain at a time. The disadvantages are the job queue thread doing a lot of work, and the overhead of communication between threads.

There is an additional point to take care of: How are pages to be crawled added to a domain that is still unknown to the system, or that is currently owned by another thread? This should be handled by the job queue in order to avoid concurrency issues.

Performance can be increased by having each worker thread request multiple pages simultaneously using asynchronous operations. This way, the time between sending an HTTP request and receiving an response can be used for working on other things.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 22:41

Thẻ: architecture, concurrency, crawlers, web-crawler

Thiết kế website giá rẻ

Danh mục

Patterns for creating adaptive web crawler throttling