I am currently trying to scrape data from a website CCMT 2021 OR CR which has a dynamic structure. The table is paginated, and to navigate to other pages, there is an option of ‘Next’ or clicking on ‘2’, ‘3’, etc. I want to extract all the data and save it to an Excel file.
Previously, I extracted data from a similar link,CCMT 2023 OR CR, using the same approach. Scraping one page (21 rows) took approximately 23 seconds. With 383 pages, the total time was around 2 hours. The 2021 data set is similar in size.
My Goal: I want to make the scraping process faster and more efficient.
Here is the code I used:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
# Specify the path to chromedriver if not in PATH
chrome_driver_path = 'chromedriver.exe' # Adjust the path as needed
# Initialize Chrome WebDriver
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service)
# Function to scrape data from all pages
def scrape_all_pages(base_url):
driver.get(base_url) # Open the base URL
# Initial wait to allow the page to load
time.sleep(60) # Increase this if the page loads very slowly
# Container to store all the data
all_data = []
while True:
try:
# Wait until the table is present
WebDriverWait(driver, 90).until(
EC.presence_of_element_located((By.XPATH, '//*[@id="mainContent"]/app-jac-delhi/section/form/div/div[5]/div[3]/div/table/tbody/tr'))
)
# Locate the table rows on the current page
data_elements = driver.find_elements(By.XPATH, '//*[@id="mainContent"]/app-jac-delhi/section/form/div/div[5]/div[3]/div/table/tbody/tr')
# Extract data from each row (skip the header row)
for row in data_elements:
cols = row.find_elements(By.TAG_NAME, 'td')
row_data = [col.text for col in cols]
all_data.append(row_data)
print(f"Scraped {len(data_elements)} rows from current page.")
# Check if there is a "Next" button to go to the next page
try:
next_button = driver.find_element(By.XPATH,'//*[@id="mainContent"]/app-jac-delhi/section/form/div/div[5]/div[4]/pagination-controls/pagination-template/ul/li[10]')
if "disabled" not in next_button.get_attribute("class"):
next_button.click()
# Wait for the next page to load
time.sleep(20) # Increase this if the page loads very slowly
else:
break # Break the loop if the next page button is disabled
except Exception as e:
print(f"Next button not found or other exception: {e}")
break # Break the loop if the next button is not found
except TimeoutException as e:
print(f"Timeout waiting for table to load: {e}")
break # Break the loop if waiting for the table times out
driver.quit() # Close the browser
return all_data
# Base URL of the webpage
base_url = "https://admissions.nic.in/admiss/admissions/orcrjacd/105012121"
# Scrape data from all pages
print(f"Scraping data from {base_url}...")
all_data = scrape_all_pages(base_url)
# Check if any data was scraped
if all_data:
# Convert the scraped data to a DataFrame
df = pd.DataFrame(all_data, columns=['SNo.', 'Round', 'Institute', 'PG Program', 'Group', 'Category', 'Max GATE Score', 'Min GATE Score'])
# Save DataFrame to an Excel file
output_file = 'scraped_data_2021.xlsx'
df.to_excel(output_file, index=False)
print(f"Data has been successfully scraped and saved to {output_file}.")
else:
print("No data was scraped.")
Attempts to Improve Efficiency:
- Headless Mode:
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(service=service, options=chrome_options)
-
Reduced Wait Times:
- Reduced time.sleep to 10 seconds.
- Reduced WebDriverWait to 10 seconds.
However, the speed remained approximately the same.
-
Exploring Multiprocessing/Multithreading/Asynchronous Scraping:
- I have heard of these methods but do not know how to implement them for this task.
How can I make the scraping process faster and more efficient?
Any suggestions or improvements to make this scraping process faster and more reliable would be greatly appreciated. Thank you!
Aman Prakash Kanth is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1