I’ve been thinking about a side project that envolves web data scraping.
Ok, I read the Getting data from a webpage in a stable and efficient way question and the discussion gave me some insights.
In the discussion Joachim Sauer stated that you can contact the owners of the sites and architect some way to provide the data that I want. The problem I see is that the websites are generally badly created and apparently seldom have changes in HTML (I don’t think they will help me), but the data is relevant. I have suffered a lot using those sites so I would like to aggregate and show them in a better way.
So, going with scraping, specifically Scrapy (for python), is a problematic approach? I read that parse.ly uses scraping (Python and Scrapy), but in another context.
Given my context, there’s a better approach than going with scraping? If going with scraping, how to deal with website structure’s changes?
1
Downloading the contents of a website can cause a wide range of problems for the website owners.
- Bottleneck the server by using all available resources to feed your script requests.
- Make a mistake and perform requests that would appear like an attack.
- Get stuck in what is called a
robot trap
and keep downloading the same page because the URL constantly changes. - You might ignore the
robotos.txt
file and access parts of the website the owners don’t want you too.
It’s best practice to use a proper web crawling tool. Using the right tool for the job will ensure that you respect the performance, security and usage of the web server. These simple Python/PHP scripts for scraping websites do nothing but harm to the servers they ambush with thousands of web requests in an uncontrolled manner.
You should use a web crawler like Heritrix to download the website to an archive file. Once the archive file is created you can process it using Python/PHP all you want. Since it’s stored locally on your harddrive there is no harm in how many times you read it.
The ethics and legal issues of using content from another website is a completely different issue. I’m not going to even go there, because that’s between you and the website owner. What I don’t want to see are people hammering websites needlessly as they try to download the content. Be respectful and web crawl with the same rules that companies like Google, Bing and Yahoo follow.
4