I want to parse all articles out of a newspaper website “The Hindu” based on the keywords I provide.
Script Input / Outputs
[IN] – Keyword to search in heading
[OUT]- Article Links that matches keyword in its heading
import requests
from bs4 import BeautifulSoup
def search_articles_with_word(url, search_word):
# Send a GET request to the URL
print(f"Fetching page from: {url}")
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
print("Page retrieved successfully.")
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Convert the whole HTML content to lowercase for case-insensitive search
html_content = soup.get_text().lower()
# Check if the search word is in the HTML content
if search_word.lower() in html_content:
print(f"nArticles containing '{search_word}':n")
# Find all <a> tags and iterate through them to extract relevant information
for a_tag in soup.find_all('a'):
title = a_tag.text.strip()
href = a_tag.get('href')
if title and href:
if search_word.lower() in title.lower():
print(f"{title} - {href}")
else:
print(f"nNo articles found containing '{search_word}'.")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
# URL of the page to search (The Hindu Opinion section)
url = "https://www.thehindu.com/opinion/"
# Get user input for the word to search
search_word = input("Enter a word to search for in articles: ")
# Call the function to search articles with the given word
search_articles_with_word(url, search_word)
This script perflectly fetches the article links based on the keyword I input.
Problem
How to enhance this script so that it fetches next 100 pages of the website and search for the keywords and return relevant links of article? The problem with the Hindu website is it shows “SHOW MORE” option, clicking on which loads next articles.
SHOW MORE CODE
Inspect element shows code as
<a class="small-link show-more" data-show-more="fragment/showmoreopinion?page=5&variant=one">SHOW MORE<span class="slider"></span></a>
Some websites change the link based on the page by having page=2 etc on their weblinks, but this website doesn’t change its links but on the same link loads articles if we click show more. I want the script to automatically keep on cliking show more and then parsing article headings.