!pip install httpx
!pip install selectolax
import httpx
from selectolax.parser import HTMLParser
After importing specific libraries, I’m running a code to get content in tag ‘p’ from urls included in a urls list.
import requests
urls = ['http://toofab.com/2017/05/08/real-housewives-atlanta-kandi-burruss-rape-phaedra-parks-porsha-williams/', 'https://www.today.com/style/see-people-s-choice-awards-red-carpet-looks-t141832', 'https://www.zerchoo.com/entertainment/gossip-girl-10-years-later-how-upper-east-siders-shocked-the-world-changed-pop-culture-forever/', 'www.intouchweekly.com/posts/gwen-stefani-dumped-156076']
text_column = []
for url_index in range(len(urls)):
try:
resp= requests.get(urls[url_index],
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
}
)
html = HTMLParser(resp.text)
#we use css on a specific tag, in this example the tag is 'p', then we iterate over the occurence using a variable
# called element.
article = [element.text().strip() for element in html.css('p')]
#for converting the list to string.
article_text = ''.join(map(str,article))
text_column.append(article_text)
# article_body = soup.select("p")
except (httpx.ConnectError, httpx.HTTPStatusError):
print(f"Connection error for URL: {urls[url_index]}")
continue # skip to the next URL
Though I put the except statement to skip the bad urls, I receive this error:
SSLError: HTTPSConnectionPool(host='www.zerchoo.com', port=443): Max retries exceeded with url: /entertainment/gossip-girl-10-years-later-how-upper-east-siders-shocked-the-world-changed-pop-culture-forever/ (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')))
Hedra Lotfy is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.