I am trying to webscrape a link that belongs to a previous button on this website. (The final purpose is to enrich data for a RAG chatbot)
<code>https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm
</code>
<code>https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm
</code>
https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm
The prev/next buttons are in the top right corner. The link that has to be extracted on the given example subpage would be this one:
<code>href="https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/Prinect/measuring/measuring-3.htm"
</code>
<code>href="https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/Prinect/measuring/measuring-3.htm"
</code>
href="https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/Prinect/measuring/measuring-3.htm"
I tried the standard way with Beautifulsoup:
<code>from bs4 import BeautifulSoup
import requests
url = "https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
# get full html section
test1 = soup.find(id="browseSeqBack")
print(test1)
# get full html section test 2
test2 = soup.find("div", class_="brs_previous").children
print(test2)
# get link directly test 3
secBackButton = soup.find(id="browseSeqBack")
href = secBackButton.attrs.get('href', None)
print(href)
</code>
<code>from bs4 import BeautifulSoup
import requests
url = "https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
# get full html section
test1 = soup.find(id="browseSeqBack")
print(test1)
# get full html section test 2
test2 = soup.find("div", class_="brs_previous").children
print(test2)
# get link directly test 3
secBackButton = soup.find(id="browseSeqBack")
href = secBackButton.attrs.get('href', None)
print(href)
</code>
from bs4 import BeautifulSoup
import requests
url = "https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
# get full html section
test1 = soup.find(id="browseSeqBack")
print(test1)
# get full html section test 2
test2 = soup.find("div", class_="brs_previous").children
print(test2)
# get link directly test 3
secBackButton = soup.find(id="browseSeqBack")
href = secBackButton.attrs.get('href', None)
print(href)
However, neither do test 1 and 2 deliver the whole html section, nor does the direct query for the link work.
this section comes back with test1:
<code><a class="wBSBackButton" data-attr="href:.l.brsBack" data-css="visibility: @.l.brsBack?'visible':'hidden'" data-rhwidget="Basic" id="browseSeqBack">
<span aria-hidden="true" class="rh-hide" data-html="@KEY_LNG.Prev"></span>
</code>
<code><a class="wBSBackButton" data-attr="href:.l.brsBack" data-css="visibility: @.l.brsBack?'visible':'hidden'" data-rhwidget="Basic" id="browseSeqBack">
<span aria-hidden="true" class="rh-hide" data-html="@KEY_LNG.Prev"></span>
</code>
<a class="wBSBackButton" data-attr="href:.l.brsBack" data-css="visibility: @.l.brsBack?'visible':'hidden'" data-rhwidget="Basic" id="browseSeqBack">
<span aria-hidden="true" class="rh-hide" data-html="@KEY_LNG.Prev"></span>
Thanks in Advance 🙂