I have done a web scrapping script in Python with beautifulSoup, bs4, urllib3 and I am still getting this error sometimes:’latin-1′ codec can’t encode character ‘u0103’ in position 52: ordinal not in range(256)”.
That I pass/skip but I really do need to get and parse those web links too that are handled to be skipped if it gets that error.
Here is my code snippet:
try:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"referer": current_url
}
try:
response = requests.get(current_url, verify=False, allow_redirects=True, headers=headers, timeout=15)
response.raise_for_status()
response.encoding = 'utf-8'
if "text/html" not in response.headers.get('Content-Type'):
continue
except Exception as e:
print(f"[{datetime.datetime.now().strftime('%d-%b%Y %H:%M')}]{current_url}: {e}")
continue
if response.status_code == 200:
soup = BeautifulSoup(response.content,"html.parser",from_encoding="utf-8")...
except Exception as e:
print(f"[{datetime.datetime.now().strftime('%d-%b%Y %H:%M')}]'{current_url}'': {e}")
What is there to do, btw I do have my defaultencoding set to utf-8.