Thiết kế website giá rẻ

Question

I am currently trying to scrape data from a website CCMT 2021 OR CR which has a dynamic structure. The table is paginated, and to navigate to other pages, there is an option of ‘Next’ or clicking on ‘2’, ‘3’, etc. I want to extract all the data and save it to an Excel file.
Previously, I extracted data from a similar link,CCMT 2023 OR CR, using the same approach. Scraping one page (21 rows) took approximately 23 seconds. With 383 pages, the total time was around 2 hours. The 2021 data set is similar in size.

My Goal: I want to make the scraping process faster and more efficient.

Here is the code I used:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

# Specify the path to chromedriver if not in PATH
chrome_driver_path = 'chromedriver.exe'  # Adjust the path as needed

# Initialize Chrome WebDriver
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service)

# Function to scrape data from all pages
def scrape_all_pages(base_url):
    driver.get(base_url)  # Open the base URL

    # Initial wait to allow the page to load
    time.sleep(60)  # Increase this if the page loads very slowly

    # Container to store all the data
    all_data = []

    while True:
        try:
            # Wait until the table is present
            WebDriverWait(driver, 90).until(
                EC.presence_of_element_located((By.XPATH, '//*[@id="mainContent"]/app-jac-delhi/section/form/div/div[5]/div[3]/div/table/tbody/tr'))
            )

            # Locate the table rows on the current page
            data_elements = driver.find_elements(By.XPATH, '//*[@id="mainContent"]/app-jac-delhi/section/form/div/div[5]/div[3]/div/table/tbody/tr')

            # Extract data from each row (skip the header row)
            for row in data_elements:
                cols = row.find_elements(By.TAG_NAME, 'td')
                row_data = [col.text for col in cols]
                all_data.append(row_data)

            print(f"Scraped {len(data_elements)} rows from current page.")

            # Check if there is a "Next" button to go to the next page
            try:
                next_button = driver.find_element(By.XPATH,'//*[@id="mainContent"]/app-jac-delhi/section/form/div/div[5]/div[4]/pagination-controls/pagination-template/ul/li[10]')
                if "disabled" not in next_button.get_attribute("class"):
                    next_button.click()
                    # Wait for the next page to load
                    time.sleep(20)  # Increase this if the page loads very slowly
                else:
                    break  # Break the loop if the next page button is disabled
            except Exception as e:
                print(f"Next button not found or other exception: {e}")
                break  # Break the loop if the next button is not found
        except TimeoutException as e:
            print(f"Timeout waiting for table to load: {e}")
            break  # Break the loop if waiting for the table times out

    driver.quit()  # Close the browser
    return all_data

# Base URL of the webpage
base_url = "https://admissions.nic.in/admiss/admissions/orcrjacd/105012121"

# Scrape data from all pages
print(f"Scraping data from {base_url}...")
all_data = scrape_all_pages(base_url)

# Check if any data was scraped
if all_data:
    # Convert the scraped data to a DataFrame
    df = pd.DataFrame(all_data, columns=['SNo.', 'Round', 'Institute', 'PG Program', 'Group', 'Category', 'Max GATE Score', 'Min GATE Score'])

    # Save DataFrame to an Excel file
    output_file = 'scraped_data_2021.xlsx'
    df.to_excel(output_file, index=False)
    print(f"Data has been successfully scraped and saved to {output_file}.")
else:
    print("No data was scraped.")

Attempts to Improve Efficiency:

Headless Mode:

from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=service, options=chrome_options)

Reduced Wait Times:
- Reduced time.sleep to 10 seconds.
- Reduced WebDriverWait to 10 seconds.
  However, the speed remained approximately the same.
Exploring Multiprocessing/Multithreading/Asynchronous Scraping:
- I have heard of these methods but do not know how to implement them for this task.

How can I make the scraping process faster and more efficient?
Any suggestions or improvements to make this scraping process faster and more reliable would be greatly appreciated. Thank you!

Thiết kế website giá rẻ

Danh mục

How Can I Improve the Efficiency of Scraping a Dynamic Table with Selenium to Reduce Time Taken?