I am trying to access the contents of a webpage using urllib
and bs4
:
<code>import bs4
from urllib.request import Request, urlopen
url = "https://ar5iv.labs.arxiv.org/html/2309.10034"
req = Request(url=url, headers={'User-Agent': 'Mozilla/7.0'})
webpage = str(urlopen(req).read())
soup = bs4.BeautifulSoup(webpage)
text = soup.get_text()
</code>
<code>import bs4
from urllib.request import Request, urlopen
url = "https://ar5iv.labs.arxiv.org/html/2309.10034"
req = Request(url=url, headers={'User-Agent': 'Mozilla/7.0'})
webpage = str(urlopen(req).read())
soup = bs4.BeautifulSoup(webpage)
text = soup.get_text()
</code>
import bs4
from urllib.request import Request, urlopen
url = "https://ar5iv.labs.arxiv.org/html/2309.10034"
req = Request(url=url, headers={'User-Agent': 'Mozilla/7.0'})
webpage = str(urlopen(req).read())
soup = bs4.BeautifulSoup(webpage)
text = soup.get_text()
However, this contains all kinds of non-ASCII characters like n
and xc2
or x89
or subscript
and so on. I want to remove all those characters and extract the plain text only. Is this possible and how can I do it?