I am attempting to scrape news articles from the website Phoenix News using Python and the BeautifulSoup library. My goal is to extract specific information from each article and store it in a DataFrame for further analysis. However, the script is not functioning as expected and fails to find any articles, resulting in an empty DataFrame.
HTML Structure:
Each article on the webpage is encapsulated within a div
element with the class outCard
. Inside this div
, various nested div
and span
elements contain the article’s details, such as the title, publication date, time, and a brief description. Here’s an example of the HTML structure for a news article:
<div class="outCard">
<div class="card-blog-color card-blog-color-selected card-body-important card">
<span class="counter colorRedShadow"></span>
<div class="card-header-important card-header">
<div class="news-color blogColor"></div>
<span class="news-title">
<div class="align-items-news">
<img title="Blogs" src="data:image/png;base64,...">
<span>BINANCE BLOG</span> - Ecosystem
</div>
<div class="align-items-news cursor-pointer">
<span class="dateNews">21/05/2024</span>
<span class="hourNews"> 14:20:42</span>
<span class="milisecondsNews"> 657</span>
</div>
</span>
</div>
<span class="">
<div class="card-body">
<span class="card-text-title">
BNB: <a href="https://www.binance.com/en/blog/ecosystem/binance-labs-invests-in-aevo-to-support-the-future-of-l2-blockchain-innovations-1035611236859767594" target="_blank">
<span class="backgroundColorPink">Binance</span> Labs Invests In <span class="backgroundColorPink">Aevo</span> To Support The Future Of <span class="backgroundColorPink">L2</span> Blockchain Innovations
</a>
</span>
<span class="card-text-description">Binance Labs has invested in Aevo, a high-performance Layer 2 (L2) built on top of the OP Stack that allows perpetual trading, pre-launch futures, and options, all on the same platform with a single margin account.</span>
<span class="token-button-container">...</span>
</div>
</span>
</div>
</div>
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_phoenix_news():
url = "https://phoenixnews.io/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
news_items = []
articles = soup.find_all('div', class_='outCard')
print(f"Found {len(articles)} articles")
for article in articles:
try:
title_tag = article.find('span', class_='news-title')
title = title_tag.text.strip() if title_tag else 'N/A'
card_body = article.find('div', class_='card-body')
summary_tag = card_body.find('span', class_='card-text-title')
summary = summary_tag.text.strip() if summary_tag else 'N/A'
timestamp_date_tag = article.find('span', class_='dateNews')
timestamp_date = timestamp_date_tag.text.strip() if timestamp_date_tag else 'N/A'
timestamp_time_tag = article.find('span', class_='hourNews')
timestamp_time = timestamp_time_tag.text.strip() if timestamp_time_tag else 'N/A'
timestamp = f"{timestamp_date} {timestamp_time}"
link_tag = card_body.find('a')
link = link_tag['href'] if link_tag else 'N/A'
print(f"Title: {title}")
print(f"Summary: {summary}")
print(f"Timestamp: {timestamp}")
print(f"Link: {link}")
news_items.append({
'title': title,
'summary': summary,
'link': link,
'timestamp': timestamp
})
print(f"Scraped Article: {title}")
except AttributeError as e:
print(f"Error: {e}")
continue
news_df = pd.DataFrame(news_items)
return news_df
if __name__ == "__main__":
news = get_phoenix_news()
print(news)
Python Script:
Here is the Python script I am using:
-
Import Libraries: I am using the
requests
library to fetch the webpage content andBeautifulSoup
from thebs4
package to parse the HTML. Thepandas
library is used to store the extracted data in a DataFrame. -
Fetch Webpage Content: The script starts by sending a GET request to the website and storing the HTML content in a BeautifulSoup object for parsing.
-
Find Articles: It attempts to find all
div
elements with the classoutCard
, which should contain the articles. -
Extract Data: For each article found, the script extracts the title, summary, timestamp (combining date and time), and the link to the full article.
-
Handle Missing Data: If any of these elements are missing, it handles the exception and continues to the next article.
-
Store Data: The extracted information is stored in a list of dictionaries, which is then converted into a DataFrame.
-
Output DataFrame: Finally, the script prints the DataFrame containing the extracted data.
Current Issue:
-
No Articles Found: The script prints “Found 0 articles” indicating that it fails to locate any
div
elements with the classoutCard
. -
Empty DataFrame: As a result, the DataFrame remains empty.
Goal:
-
Correctly Parse Articles: I aim to correctly navigate and parse the HTML structure to extract the desired information.
-
Populate DataFrame: Successfully store the extracted data in a DataFrame for further analysis.
I am looking for guidance on:
-
Correctly identifying and navigating the HTML elements to extract the required information.
-
Any corrections or improvements to the script that would allow it to achieve the desired outcome.
Thank you for your assistance!