I would like to prevent Amazon from scraping product data on my website. So I found this document: https://developer.amazon.com/amazonbot
And this example:
User-agent: Amazonbot # Amazon's user agent
Disallow: /do-not-crawl/ # disallow this directory
So, if I add:
User-agent: Amazonbot # Amazon's user agent
Disallow: / # disallow access to all the website
or maybe
User-agent: Amazonbot # Amazon's user agent
Disallow: /Technology/ # disallow access to Technology category page
In particular, would this prevent access to all products on the Technology page on the website?
What concerns me also is the mention of crawl delay in their ‘Help’ Page?
I currently have:
User-agent: *
Disallow: /admin/
Disallow: /api/
Crawl-delay: 1
User-agent: Amazonbot
Disallow: /
Which obviously has a crawl delay and this comment within their ‘Help’ page:
Today, AmazonBot does not support the crawl-delay directive in robots.txt and robots meta tags on HTML pages such as “nofollow” and “noindex”.
2
Setting a robots.txt
policy does not “prevent” any crawler from indexing your website. It is more like politely asking not to do it. The robot may respect your request or ignore it.
Since Amazon explicitly stated that they do respect robots.txt
but don’t support crawl-delay
I would expect them to do just that, although you can block their crawler in different ways, using data they provided on their website:
- Identification by User-Agent
You may check for user agents containing the string “Amazonbot” (or simply “amazon”) and give them a 403 error. There are multiple ways to do this, you may add the following to your VHost file if you use Apache:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} amazonbot [NC]
RewriteRule . - [R=403,L]
You may also use your firewall to perform this function. If you use CloudFlare, they provide a built-in function to block specific UAs. You may also use iptables for this, however, this would also take a toll on performance since every request would have to be inspected. Nevertheless, if you want to set it up, take a look at iptables -m string -h
.
- Identification by IP-Address and reverse DNS
This would be a little more involved, but in general you may take an IP address of a bot accessing your website, run a reverse DNS lookup on that address and then check if the domain contains “amazon”.
I would certainly not suggest this as a way to apply with every request to your website since it would take a heavy performance toll, however, you might retroactively scan your log files with such a script to look for disobedient Amazonbot, that don’t respect robots.txt
.