I am seeking assistance with a critical issue we are facing in our cross-border e-commerce auction and proxy purchase platform. Our system relies heavily on web crawling technology to access Yahoo Auction, read HTML, and extract necessary data for various functions such as product search, category products, seller’s selling products, product details, next bid price, and auction time.
Problem Description:
During peak hours, when the number of user accesses is high, the pages we attempt to crawl frequently return error HTML pages, specifically errors 404 (Not Found) and 500 (Internal Server Error). This issue severely impacts our system’s ability to display data to our customers.
Environment:
Operating Systems: Linux and Windows Server
Programming Language: Node.js, Axios (rotate client-info and user-agent)
Solutions i made:
- Load Balancing: We deployed multiple servers with load balancing, each having its own public IP.
- IP Rotation: We implemented multiple IP addresses per server to rotate IPs during crawling attempts.
Despite these measures, the problem persists.
Specific Use Cases Impacted:
Product Search
Category Products
Seller’s Selling Products
Product Details
Next Bid Price
Auction Time
Error Messages:
404 Not Found
500 Internal Server Error
Request for Assistance:
I would greatly appreciate any advice or best practices on web crawling and infrastructure deployment that could help us overcome these challenges. Specifically, insights into handling high-volume crawling without triggering such errors would be invaluable.
Thank you very much for your time and assistance.
- Load Balancing: We deployed multiple servers with load balancing, each having its own public IP.
- IP Rotation: We implemented multiple IP addresses per server to rotate IPs during crawling attempts.
Despite these measures, the problem persists.
Nguyễn Nam Hải is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.