I’ve gotten to a point where
print(soup.td.a)
results in
<a href="/?p=section&a=details&id=37627">Some Text Here</a>
I’m trying to figure out how I can filter further so all that results is
37627
I’ve tried a number of things including urlparse and re.compile but I’m just not getting the syntax correct. Plus I feel like there is probably an easier way that I’m just not finding. I appreciate any help given.
Travis Kessler is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2
You can use the parse_qs() method to parse queries:
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
html_content = '''
<td>
<a href="/?p=section&a=details&id=37627">Some Text Here</a>
</td>
'''
# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
# Find the <a> tag
a_tag = soup.find('a')
# Extract the href attribute
href = a_tag.get('href')
# Parse the URL to get the query parameters
parsed_url = urlparse(href)
# for py2: parsed_url = urlparse.urlparse(url)
query_params = parse_qs(parsed_url.query)
# Get the 'id' parameter
id_value = query_params.get('id', [None])[0]
print(id_value) # Output: 37627
There are two way to make this. You can simply cut the string from ‘id=’ to the next occurence of double quote character or ‘&’ if it is possible that the link can have more query params or just use the regex. I would prefer to use the regex, as it is more simple and accurate.
FIRST SOLUTION:(Cut the string)
# check that the string contains the part 'id='
id_start = (soup.td.a).find('id=')
if id_start != -1:
# id_start contains the index of 'id=', to get the actual index of the id we need to add 3 to this value
id_start += len('id=')
# we find the occurence of '"' starting from 'id=' index, if not found search for '&'
id_end = (soup.td.a).find('"', id_start)
if(id_end == -1):
id_end = (soup.td.a).find('&', id_start)
id_value = (soup.td.a)[id_start:id_end]
SECOND SOLUTION:
import re
# Regular expression to find the 'id' parameter
regex= r'id=(d+)'
# Search for the pattern in the string
match = re.search(regex, soup.td.a)
if match:
id_value = match.group(1)
Hope it works!!