Thiết kế website giá rẻ

Question

I need to filter out all webpages with 0 listings on Grailed, I have over 500k URLs to go through.

I’m using Python and Selenium, my problem is that for every new webpage the script needs to click on the cookie and user login pop-up to access the number of listings. The result is that each webpage takes ~13 seconds to process. For 500k URLs this will take 75 days, which I don’t have.

This is my first time webscraping /coding / using Python in general, so I’m probably missing a lot of obvious adjustments.

An example link is: https://www.grailed.com/designers/acne-studios/casual-pants

All 500k links are: https://www.grailed.com/designers/designer-name/category-name

There are currently two possible approaches I’m figuring out:

Try to block the cookie and user login pop-ups. However I’m not sure if this is possible without saving some sort of user profile, after which I’m worried I’ll get blocked by Grailed.
Run multiple instances at the same time, preferably between 13 (~2 weeks) and 130 (~14 hours). However I’m not sure what the implications are of this, of it’ll be costly and how to avoid getting blocked, do I need to use proxies for this?

Although these are the two approaches that seem obvious to me, please tell me if I’m missing something.

My code is as follows:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
import os
import time

# Update the PATH environment variable
os.environ['PATH'] += r";C:UsersrafmeDesktopSelenium Drivers"

# Read the CSV file
BrandCategoryLinks = pd.read_csv('C:/Users/rafme/Downloads/Test Brands & Categories.csv')

FilteredCategoryLink = []

# Loop through each link in the DataFrame
for index, link in BrandCategoryLinks.iterrows():
    driver = None
    try:
        base_url = link['Links']
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument("--disable-gpu")  # Disable GPU usage
        chrome_options.add_argument("--no-sandbox")  # Disable sandboxing
        chrome_options.add_argument("--disable-dev-shm-usage")  # Disable shared memory usage
        chrome_options.add_argument("--window-size=1920x1080")  # Set the window size
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
        service = Service(r"C:UsersrafmeDesktopSelenium Driverschromedriver.exe")
        driver = webdriver.Chrome(service=service, options=chrome_options)

        driver.get(base_url)

        timeout = 60  # Increase timeout

        try:
            WebDriverWait(driver, timeout).until(EC.presence_of_element_located((By.ID, "onetrust-reject-all-handler")))
            reject_button = driver.find_element(By.ID, "onetrust-reject-all-handler")

            # Scroll the element into view using JavaScript
            driver.execute_script("arguments[0].scrollIntoView(true);", reject_button)
            time.sleep(2)  # Wait for the scrolling to complete

            # Click the element
            reject_button.click()
            time.sleep(1)
            reject_button.click()
            time.sleep(1)
        except (NoSuchElementException, ElementClickInterceptedException):
            pass
        except Exception as e:
            print(f"Error occurred: {e}")
            continue

        # Close the user login modal if it exists
        try:
            elem = driver.find_element(By.XPATH, "//div[@class='Modal-Content']")
            ac = ActionChains(driver)
            ac.move_to_element(elem).move_by_offset(250, 0).click().perform()  # clicking away from login window
        except NoSuchElementException:
            pass
        except Exception as e:
            print(f"Error clicking 'User Authentication' button: {e}")
            continue

        # Check listing count
        try:
            listing_count = driver.find_elements(By.XPATH,
                                                 "//div[@class='FiltersInstantSearch']//div[@class='feed-item']")
            if len(listing_count) > 1:
                print(f"Found {len(listing_count)} listings on {base_url}")
                FilteredCategoryLink.append(base_url)
            else:
                print(f"Found {len(listing_count)} listings on {base_url}, not enough to keep.")
        except Exception as e:
            print(f"Error finding listings: {e}")
            continue

    except Exception as e:
        print(f"Error processing link {link}: {e}")
    finally:
        if driver:
            driver.quit()

# Save the filtered categories to CSV
filtered_categories = pd.DataFrame(FilteredCategoryLink, columns=['Link'])
filtered_categories.to_csv('filtered_categories.csv', index=False)

Thank you all very much for taking the time to go over my problem!

Thiết kế website giá rẻ

Danh mục

I have 500k URLs, my Python / Selenium script takes ~13 seconds per webpage, what can I do to speed this up?