I am attempting to scrape business listing data from Yell.com (Yellow Pages UK) using Python, but continually encounter a 403 Forbidden error. Despite trying various methods to mimic a regular user browsing through a web browser, the site consistently blocks my scraping attempts.
What I’ve Tried:
Changing User-Agents: I have rotated through different User-Agents to emulate various browsers, but none have succeeded in bypassing the 403 error.
Using Headers: I included headers such as Referer and Accept-Language in my requests to mimic a typical browser session.
Session Management: Implemented requests.Session() to manage cookies and maintain a session across multiple requests.
Rate Limiting: Introduced delays between requests to simulate human interaction timing.
Selenium: Attempted using Selenium with a Chrome WebDriver to execute JavaScript and handle dynamically loaded content, assuming that part of the content I need is loaded this way.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://www.google.com'
}
session = requests.Session()
session.headers.update(headers)
try:
response = session.get('https://www.yell.com/ucs/UcsSearchAction.do?keywords=electricians&location=London')
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
except requests.exceptions.HTTPError as e:
print(f'HTTP error occurred: {e}') # Python 3.6
except Exception as e:
print(f'An error occurred: {e}')
ere
Ben Sogbe is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.