I have two scripts:
other.py
looks like this:
# some stuff is done here and a list of urls is created such as: urls = ['https://www.walmart.com/ip/Sabrina-Carpenter-Cherry-Pop-EDP-30ml-1oz/5492571361?classType=REGULAR&athbdg=L1600', 'https://www.walmart.com/ip/Hoey-5-1-Painless-Hair-Remover-Women-Facial-Removal-Electric-Cordless-Shaver-Set-Wet-Dry-Lady-Razor-Women-Bikini-Line-Nose-Hair-Eyebrow-Arm-Leg-USB-R/647670434?classType=REGULAR'] # Then, the script runs another script called get_url.py and passes the urls to it to be processed: subprocess.Popen(['python', 'get_url.py', str(urls)]) #it is important that this does not block the code and the rest of the code in this script can run without waiting for get_url.py to complete.
get_url.py
called above looks like this and downloads each url passed to it:
import pandas as pd import os import time from datetime import datetime import pyautogui from selenium.webdriver.chrome.options import Options from selenium import webdriver from concurrent.futures import ProcessPoolExecutor def get_page(url): file_name = f"{url[:20]}_{pd.to_datetime(datetime.now()).strftime('%Y-%m-%d %H-%M-%S')}.html" file_path = os.path.join(os.getcwd(), 'data', 'htmls') path_and_name = os.path.join(file_path, file_name) driver = webdriver.Chrome(options=options) driver.get(url) time.sleep(1) pyautogui.hotkey('ctrl', 's') # open the save as window time.sleep(1) pyautogui.typewrite(path_and_name ) # enter the path and file name so the webpage is downloaded in the desired directory time.sleep(.5) pyautogui.hotkey('enter') time.sleep(.2) while True: # wait until the download is complete, then close the driver files = os.listdir(file_path) if file_name in files: driver.close() break time.sleep(.1) urls = sys.argv[1] # getting urls from other.py #converting the string urls to an actual list: urls = ast.literal_eval(page_urls.replace('[', '').replace(']', '').replace('n', ', ')) if __name__ =='__main__': # multi-processing the urls to speed up things(necessary) with ProcessPoolExecutor(max_workers=10) as executer: executer.map(get_page, urls, chunksize = 1)
The function works fine as long as I open one browser. However, as soon as multiple windows open by the ProcessPoolExecutor
, it appears that the pyautogui.typewrite
part of the function loses track of the windows, which may lead to path_and_name
being typed multiple times in the “save as” window or typed incomplete leading to the page not downloading or being downloaded with a bad name/directory. Even worse, if I click somewhere like inside my code editor when the function is running, pyautogui
may type the path_and_name
value in the editor where the cursor is active. Running the browser in “headless” mode so that I don’t accidentally mess with the windows does not help.
So, basically, how do I fix the above code?
because the pyautogui operates on the active window. if you want to run multiple windows in parallel or use the browser while the program is running, you should try using something that doesn’t interact with the GUI.
So, instead of clicking the save window, define a folder for the downloads and configure the Chrome browser to download to the folder.
# download folder
downloads = os.path.join(os.get(), html, data)
# Chrome options
options = Options()
prefs = {
"download.default_directory": downloads,
}
options.add_experimental_option("prefs", prefs)
The code above will not work.
You can’t also create a service instance, which Selenium shows examples of on github
The Selenium manager is also an option for managing the browser session.