I am in the process of building a web scraper to extract the URL, Headline, Name of Publisher, Company Name, Phone Number and the Email.
https://www.prnewswire.com/news-releases/atrenew-to-report-second-quarter-2024-financial-results-on-august-20-2024-302215138.html (sample)
There are hundreds of news articles everyday
I have no issues with the Headline & URL as the tags are constant for every news article.
I use regex to extract email and the phone number, no issues there as well.
What I have the problem is with Name, Company Name and Job Title.
The Name is sometimes in xn spanclass which is easy to extract.
But in some websites the name, company name and Job title are in a text. Which is hard for me to extract correctly.
First I thought of separating them by commas but there positioning in the text is not always constant. They can appear in a different position.
Is there a proper method to extract name, Company name and Job Title?
I have zero programming skills
I am using chatgpt to help.
For the time being my python script is using beautifulsoup, request, html parser and selenium (without selenium email appears as protected), I use headless browser.
The extracted information in csv files sometimes says not found for name, company name and Job title. Sometimes company names is extracted for the name. It’s a mess
for url in article_urls:
# Fetch the webpage
driver.get(url)
html_content = driver.page_source
# Parse HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Extract headline
headline_tag = soup.find('h1') # Adjust this if the headline is under a different tag or class
headline = headline_tag.text.strip() if headline_tag else "Headline not found"
# Extract Media Contact details
contact_section = soup.find('b', string="Media Contact")
if contact_section:
contact_section = contact_section.find_parent() # Find the parent element containing the details
contact_info = contact_section.find_next_sibling('p').text.strip()
# Initialize variables
name = company = phone = "Not found"
email = "Not found"
# Extract the email
email_match = re.search(r'[w.-]+@[w.-]+', contact_info)
if email_match:
email = email_match.group(0)
contact_info = contact_info.replace(email, '')
# Extract the phone number
phone_match = re.search(r'+?d[ds()-]{7,}d', contact_info)
if phone_match:
phone = phone_match.group(0)
contact_info = contact_info.replace(phone, '')
# Extract the remaining contact info parts
parts = [part.strip() for part in contact_info.split(',') if part.strip()]
if len(parts) > 0:
name = parts[0]
if len(parts) > 1:
company = ', '.join(parts[1:])
# Remove any URLs from the company name
company = re.sub(r'https?://S+', '', company).strip()
# Remove text after the last comma
if ',' in company:
company = company.rsplit(',', 1)[0].strip()
# Find and exclude #text elements following the email tag
for sibling in contact_section.next_siblings:
if isinstance(sibling, NavigableString) and sibling.strip():
sibling_text = sibling.strip()
if sibling_text.startswith(('http://', 'https://')):
company = company.replace(sibling_text, '').strip()
elif '@' in sibling_text:
continue
else:
break
# Avoid using "Editor" as a name
if name.lower() == 'editor':
name = "Name not provided"
else:
name = company = phone = email = "Media Contact details not found"
Qazi Matiullah is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1