I wanted to extract some information about a specific drug (lets say Rolvedon) from this site.
I tried using BeautifulSoup and Scrapy but they seem to be very format dependent. I want the code to be more flexible and reusable for several other links [like] (https://www.accessdata.fda.gov/drugsatfda_docs/label/2022/761148Orig1s000Corrected_lbl.pdf) or like.
BeautifulSoup:
import requests
from bs4 import BeautifulSoup
url = 'https://www.drugs.com/rolvedon.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
p_tags = soup.find_all('p')
drug_class = None
for p in p_tags:
if 'Drug class:' in p.text:
drug_class = p.text.split('Drug class:')[-1].strip()
break
if drug_class:
print(f"Drug Class: {drug_class}")
else:
print("Drug class information not found.")
Output – Drug Class: Colony stimulating factors
but this would work only if i know what kind of html tag is related to the content, which i might not in many cases.
Please help and suggest some better way to achieve this.
Mandvi Shukla is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.