I was wondering, if I scrape one site every few seconds, Is DNS caching something to set up to avoid a lot of DNS lookup or is done automatically?
Practical example: I have a script that every few seconds check a marketplace for new products. Let’s say marketplace domain is martketpalce.com. To convert that domain to an IP, it is sent a DNS lookup to a DNS server. To avoid sending a DNS lookup every time I scrape martketpalce.com, should I do something in particular or not?
Thank you
1
The short answer is: yes.
I drafted a couple versions of this answer, but there is a single underlying reason:
You should control your scrapper’s network behavior, in order to ensure consistent and reliable loading. Without implementing this, the proper operation of the scraping sessions will fall apart in the real world.
A brief but incomplete list:
- Your utility (if you call
curl
orwget
) might cache (or worse, might not). - Your OS might have undocumented settings or middleware (
nscd
anyone?) - You might have (or have not) local caching DNS on different systems. Ask Microsoft for a detailed and full history of their Windows DNS caching.
- You might have different DNS configs if you move the tool from system to system.
- You might have various “transparent” proxy or load balancing.
- Your library might cache DNS, but have varied performance depending on load, or pooling limits, or poor error handling, etc.
- You have something equally irritating that I haven’t mentioned yet.