I recently made a scrapping script that is using SelectorLib in Python to scrape data from some Amazon search result pages; I am just using it for like 9 or 10 pages and the script seems to be working fine for pages 1-4; but as soon as I get to page 5-10, it gives me a “NoneType, object is not iterable” error and doesn’t scrape anything;
I have tried opening those pages in my web browser separately to see if it is a structure error, which it is not since the SelectorLib plugin seems to extract data from all URLs just fine.
It’s not any error with running multiple pages either apparently since I tried using 1 URL at once.
I would appreciate if anyone could help me, I’m starting out and this is my first scraper (hence mind if my code is bad)
URL example, just the page no. is changing for every url-
https://www.amazon.com/s?k=gaming+laptops&i=computers&rh=n%3A13896617011%2Cp_n_deal_type%3A23566065011%2Cp_n_condition-type%3A2224371011%2Cp_n_feature_twenty-seven_browse-bin%3A23710032011&page=9
I have tried everything I could think of, changing user agents, introducing delay parameters, proxies; and I still can’t seem to figure out how to solve this issue; like I said since it’s my first scraper; maybe I’m just doing something wrong entirely.
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
# Download the page
print("Downloading %s" % url)
r = requests.get(url, headers=headers)
# check if page was blocked (Usually 503)
if r.status_code > 500:
if "To discuss automated access to Amazon data please contact" in r.text:
print("Page %s was blocked by Amazon. Please try using better proxiesn" % url)
else:
print("Page %s must have been blocked by Amazon as the status code was %d" % (url, r.status_code))
return None
# Pass the HTML of the page and create
return e.extract(r.text)
with open("search_results_urls.txt", 'r') as urllist, open('search_results_output.jsonl', 'w') as outfile:
for url in urllist.read().splitlines():
try:
data = scrape(url)
if data is not None:
for product in data['products']:
product['search_url'] = url
print("Saving Product: %s" % product['title'])
json.dump(product, outfile, indent=2)
outfile.write("n")
else:
print(f"No data retrieved for URL: {url}")
print("Waiting for 1 minute before scraping the next page...")
sleep(1)
except Exception as e:
print(f"Error occurred while processing URL: {url}")
print(f"Error message: {str(e)}")
continue
Sorry for the long code
Backup Restore is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.