I am in the process of building a web scraper to extract the URL, Headline, Name of Publisher, Company Name, Phone Number and the Email.
https://www.prnewswire.com/news-releases/atrenew-to-report-second-quarter-2024-financial-results-on-august-20-2024-302215138.html (sample)
There are hundreds of news articles everyday
I have no issues with the Headline & URL as the tags are constant for every news article.
I use regex to extract email and the phone number, no issues there as well.
What I have the problem is with Name, Company Name and Job Title.
The Name is sometimes in xn spanclass which is easy to extract.
But in some websites the name, company name and Job title are in a text. Which is hard for me to extract correctly.
First I thought of separating them by commas but there positioning in the text is not always constant. They can appear in a different position.
Is there a proper method to extract name, Company name and Job Title?
I have zero programming skills
I am using chatgpt to help.
For the time being my python script is using beautifulsoup, request, html parser and selenium (without selenium email appears as protected), I use headless browser.
The extracted information in csv files sometimes says not found for name, company name and Job title. Sometimes company names is extracted for the name. It’s a mess
Qazi Matiullah is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.