What is the way to go to extract data from websites? [closed]

I’ve been thinking about a side project that envolves web data scraping.

Ok, I read the Getting data from a webpage in a stable and efficient way question and the discussion gave me some insights.

In the discussion Joachim Sauer stated that you can contact the owners of the sites and architect some way to provide the data that I want. The problem I see is that the websites are generally badly created and apparently seldom have changes in HTML (I don’t think they will help me), but the data is relevant. I have suffered a lot using those sites so I would like to aggregate and show them in a better way.

So, going with scraping, specifically Scrapy (for python), is a problematic approach? I read that parse.ly uses scraping (Python and Scrapy), but in another context.

Given my context, there’s a better approach than going with scraping? If going with scraping, how to deal with website structure’s changes?

Downloading the contents of a website can cause a wide range of problems for the website owners.

Bottleneck the server by using all available resources to feed your script requests.
Make a mistake and perform requests that would appear like an attack.
Get stuck in what is called a robot trap and keep downloading the same page because the URL constantly changes.
You might ignore the robotos.txt file and access parts of the website the owners don’t want you too.

It’s best practice to use a proper web crawling tool. Using the right tool for the job will ensure that you respect the performance, security and usage of the web server. These simple Python/PHP scripts for scraping websites do nothing but harm to the servers they ambush with thousands of web requests in an uncontrolled manner.

You should use a web crawler like Heritrix to download the website to an archive file. Once the archive file is created you can process it using Python/PHP all you want. Since it’s stored locally on your harddrive there is no harm in how many times you read it.

The ethics and legal issues of using content from another website is a completely different issue. I’m not going to even go there, because that’s between you and the website owner. What I don’t want to see are people hammering websites needlessly as they try to download the content. Be respectful and web crawl with the same rules that companies like Google, Bing and Yahoo follow.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 07:26

Thẻ: architecture, python, web-crawler, web-scraping

Thiết kế website giá rẻ

Danh mục

What is the way to go to extract data from websites? [closed]