I scraped some data from the web using:
import requests
from bs4 import BeautifulSoup
def get_lines_from_url(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
lines = soup.get_text("n").strip().splitlines()
return lines
while printing them with:
all_lines = get_lines_from_url(link)
for line in all_lines:
print(line)
i encounter UnicodeEncodeError: 'charmap' codec can't encode character 'u2588' in position 0: character maps to <undefined>
I printed the encoded version of each line with:
for line in all_lines:
print(line.encode('utf-8'))
The culprit was a line containing b'xe2x96x88'
.
I did some research and found that all lines print when i include sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
as in:
import sys
import codecs
# other imports...
# `get_lines_from_url` declaration...
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
all_lines = get_lines_from_url(link)
for line in all_lines:
print(line)
What does sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
do?
why couldn’t Python print b'xe2x96x88'
directly?
5