I need to filter out all webpages with 0 listings on Grailed, I have over 500k URLs to go through.
I’m using Python and Selenium, my problem is that for every new webpage the script needs to click on the cookie and user login pop-up to access the number of listings. The result is that each webpage takes ~13 seconds to process. For 500k URLs this will take 75 days, which I don’t have.
This is my first time webscraping /coding / using Python in general, so I’m probably missing a lot of obvious adjustments.
An example link is: https://www.grailed.com/designers/acne-studios/casual-pants
All 500k links are: https://www.grailed.com/designers/designer-name/category-name
There are currently two possible approaches I’m figuring out:
-
Try to block the cookie and user login pop-ups. However I’m not sure if this is possible without saving some sort of user profile, after which I’m worried I’ll get blocked by Grailed.
-
Run multiple instances at the same time, preferably between 13 (~2 weeks) and 130 (~14 hours). However I’m not sure what the implications are of this, of it’ll be costly and how to avoid getting blocked, do I need to use proxies for this?
Although these are the two approaches that seem obvious to me, please tell me if I’m missing something.
My code is as follows:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, ElementClickInterceptedException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.action_chains import ActionChains
import os
import time
# Update the PATH environment variable
os.environ['PATH'] += r";C:UsersrafmeDesktopSelenium Drivers"
# Read the CSV file
BrandCategoryLinks = pd.read_csv('C:/Users/rafme/Downloads/Test Brands & Categories.csv')
FilteredCategoryLink = []
# Loop through each link in the DataFrame
for index, link in BrandCategoryLinks.iterrows():
driver = None
try:
base_url = link['Links']
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-gpu") # Disable GPU usage
chrome_options.add_argument("--no-sandbox") # Disable sandboxing
chrome_options.add_argument("--disable-dev-shm-usage") # Disable shared memory usage
chrome_options.add_argument("--window-size=1920x1080") # Set the window size
chrome_options.add_argument("--headless")
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36")
service = Service(r"C:UsersrafmeDesktopSelenium Driverschromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.get(base_url)
timeout = 60 # Increase timeout
try:
WebDriverWait(driver, timeout).until(EC.presence_of_element_located((By.ID, "onetrust-reject-all-handler")))
reject_button = driver.find_element(By.ID, "onetrust-reject-all-handler")
# Scroll the element into view using JavaScript
driver.execute_script("arguments[0].scrollIntoView(true);", reject_button)
time.sleep(2) # Wait for the scrolling to complete
# Click the element
reject_button.click()
time.sleep(1)
reject_button.click()
time.sleep(1)
except (NoSuchElementException, ElementClickInterceptedException):
pass
except Exception as e:
print(f"Error occurred: {e}")
continue
# Close the user login modal if it exists
try:
elem = driver.find_element(By.XPATH, "//div[@class='Modal-Content']")
ac = ActionChains(driver)
ac.move_to_element(elem).move_by_offset(250, 0).click().perform() # clicking away from login window
except NoSuchElementException:
pass
except Exception as e:
print(f"Error clicking 'User Authentication' button: {e}")
continue
# Check listing count
try:
listing_count = driver.find_elements(By.XPATH,
"//div[@class='FiltersInstantSearch']//div[@class='feed-item']")
if len(listing_count) > 1:
print(f"Found {len(listing_count)} listings on {base_url}")
FilteredCategoryLink.append(base_url)
else:
print(f"Found {len(listing_count)} listings on {base_url}, not enough to keep.")
except Exception as e:
print(f"Error finding listings: {e}")
continue
except Exception as e:
print(f"Error processing link {link}: {e}")
finally:
if driver:
driver.quit()
# Save the filtered categories to CSV
filtered_categories = pd.DataFrame(FilteredCategoryLink, columns=['Link'])
filtered_categories.to_csv('filtered_categories.csv', index=False)
Thank you all very much for taking the time to go over my problem!
DudeNewToCoding is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.