Title: How to Efficiently Scrape Press Release Pages from 1000 Company Websites?
Body:
I am currently working on a project where I need to scrape the news pages from 10 to at most 2000 different company websites. The project is divided into two parts: the initial run to initialize a database and subsequent weekly (or other periodic) updates.
I am stuck on the first step, initializing the database. My boss wants a “write-once, generalizable” solution, essentially mimicking the behavior of search engines. However, I have some concerns based on my web scraping experience:
- Dynamic Content: Many web pages load content dynamically. This means libraries like requests often fail, and I would need to use Selenium to simulate user interactions.
- Pagination: Even if I can access the content of the first page, handling pagination during the initial database population is a significant challenge.
My boss understands Python but is not deeply familiar with the intricacies of web scraping. He suggested researching how search engines handle this task to understand our limitations. While search engines have vastly more resources, our target is relatively small. The primary issue seems to be the complexity of the code required to handle pagination robustly. For a small team, implementing deep learning just for pagination seems overkill.
Could anyone provide insights or potential solutions for effectively scraping news pages from these websites? Any advice on handling dynamic content and pagination at scale would be greatly appreciated.
I’ve tried using Selenium before but pages usually vary. If it’s worth analyzing pages of each company, then it will be even better to use requests for the static pages of some companies in the very beginning, but this idea is not accepted by my boss. 🙁
NorthS7ar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.