I want to extract the text content which sits behind an a-tag element. The code looks like this:
<a data-autid="article-url" href="linkToTheWebsite">HERE STANDS THE TEXT I WANT TO EXTRACT</a>
In the past these a-tag elements didn’t have an “data-” attribute, but a normal “id” attribute, which was super simple to extract. Now I have no idea how this should work. I tried this but it doesn’t appear to do the job:
self.article_title = item.select_one('a', data_autid='article-url').text.strip()
Any idea what I could do?
0
You can use an [attr=value]
CSS Selector:
Represents elements with an attribute name of attr whose value is
exactly value.
To use a CSS Selector, use the .select_one()
method instead of find()
.
In your example:
from bs4 import BeautifulSoup
html = """<a data-autid="article-url" href="linkToTheWebsite">HERE STANDS THE TEXT I WANT TO EXTRACT</a>"""
soup = BeautifulSoup(html, "html.parser")
>>> print(soup.select_one('a[data-autid="article-url"]').text)
HERE STANDS THE TEXT I WANT TO EXTRACT
Or: If you want to use find()
:
print(soup.find("a", attrs={"data-autid": "article-url"}).text)
2
You can try this:
from lxml import html
import requests
html = requests.get('yoururl')
tree = html.fromstring(html.content)
yourtext = tree.xpath('//a[@data-autid="article-url"]/text()')
1