The app is web scraping. It will eventually be on a live website, hence headless mode being crucial. I was following a few guides and videos, such as this: https://www.youtube.com/watch?v=ne3BH9-5H2o
What I eventually want is that my web app will work in a live website and the user will be able to download the CSV that contains the scraped data.
What I have now is that this works perfectly without a headless browser and it will initially work and then break when in a headless browser. I’m really not well versed in this exact thing. It is my first project with Python at all and I’ve tried many suggested solutions from Google and tried my luck with AI chatbots, but not getting anywhere.
This is my output from running in a headless browser:
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:GET http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/element/f.BA2702E83329F95F938C6039779FB64F.d.C5F3064420B902B29643EBB8453A64CB.e.26/enabled {"id": "f.BA2702E83329F95F938C6039779FB64F.d.C5F3064420B902B29643EBB8453A64CB.e.26"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "GET /session/47d14aaef0c4b9c8b7af27a70da64850/element/f.BA2702E83329F95F938C6039779FB64F.d.C5F3064420B902B29643EBB8453A64CB.e.26/enabled HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/element {"using": "css selector", "value": ".VfPpkd-LgbsSe.VfPpkd-LgbsSe-OWXEXe-k8QpJ.VfPpkd-LgbsSe-OWXEXe-dgl2Hf.nCP5yc.AjY5Oe.DuMIQc.LQeN7.XWZjwc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/element HTTP/1.1" 200 126
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/element/f.BA2702E83329F95F938C6039779FB64F.d.C5F3064420B902B29643EBB8453A64CB.e.26/click {"id": "f.BA2702E83329F95F938C6039779FB64F.d.C5F3064420B902B29643EBB8453A64CB.e.26"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/element/f.BA2702E83329F95F938C6039779FB64F.d.C5F3064420B902B29643EBB8453A64CB.e.26/click HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
INFO:root:
Scraping has started. This could take a few minutes. Please do not close the browser window or click the top and move it (the script will stop if you do so).
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 830
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 830
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
INFO:root:Number of elements: 7
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 830
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 830
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync {"script": "arguments[0].scrollIntoView();", "args": [{"ELEMENT": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.68", "element-6066-11e4-a52e-4f735466cecf": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.68"}]}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 1415
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 1415
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 1415
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
INFO:root:Number of elements: 12
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 1415
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 1415
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync {"script": "arguments[0].scrollIntoView();", "args": [{"ELEMENT": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.82", "element-6066-11e4-a52e-4f735466cecf": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.82"}]}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 1415
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/elements {"using": "css selector", "value": ".hfpxzc"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/elements HTTP/1.1" 200 1415
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync {"script": "arguments[0].scrollIntoView();", "args": [{"ELEMENT": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.62", "element-6066-11e4-a52e-4f735466cecf": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.62"}]}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/element/f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.62/click {"id": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.62"}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/element/f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.62/click HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:GET http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/source {}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "GET /session/47d14aaef0c4b9c8b7af27a70da64850/source HTTP/1.1" 200 901718
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 301 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://www.taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 301 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://www.taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 301 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://www.taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 301 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://www.taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 301 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://www.taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 301 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.taksihelsinki.fi:443
DEBUG:urllib3.connectionpool:https://www.taksihelsinki.fi:443 "GET /tilaa-taksi/taksiasemat/ HTTP/1.1" 200 None
INFO:root:['Taksiasema Viiskulma', '0100 6203', 'https://taksihelsinki.fi/tilaa-taksi/taksiasemat/', '', 'Laivurinrinne 2, 00120 Helsinki']
DEBUG:selenium.webdriver.remote.remote_connection:POST http://127.0.0.1:37913/session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync {"script": "arguments[0].scrollIntoView();", "args": [{"ELEMENT": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.63", "element-6066-11e4-a52e-4f735466cecf": "f.BA2702E83329F95F938C6039779FB64F.d.5597E6B4F31B589B3519D21B88DB7FD1.e.63"}]}
DEBUG:urllib3.connectionpool:http://127.0.0.1:37913 "POST /session/47d14aaef0c4b9c8b7af27a70da64850/execute/sync HTTP/1.1" 404 853
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
ERROR:googleMapsScrapingToolweb:Exception on /scrape [POST]
Traceback (most recent call last):
File "/home/vaahtlnirn1/.local/lib/python3.10/site-packages/flask/app.py", line 1463, in wsgi_app
response = self.full_dispatch_request()
File "/home/vaahtlnirn1/.local/lib/python3.10/site-packages/flask/app.py", line 872, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/home/vaahtlnirn1/.local/lib/python3.10/site-packages/flask_cors/extension.py", line 176, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/home/vaahtlnirn1/.local/lib/python3.10/site-packages/flask/app.py", line 870, in full_dispatch_request
rv = self.dispatch_request()
File "/home/vaahtlnirn1/.local/lib/python3.10/site-packages/flask/app.py", line 855, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return]
File "/home/vaahtlnirn1/googleMapsScrapingTool/googleMapsScrapingToolweb.py", line 160, in scrape
file_path = scraper.scrape()
File "/home/vaahtlnirn1/googleMapsScrapingTool/googleMapsScrapingToolweb.py", line 46, in scrape
return self._selenium_extractor(browser)
File "/home/vaahtlnirn1/googleMapsScrapingTool/googleMapsScrapingToolweb.py", line 76, in _selenium_extractor
browser.execute_script("arguments[0].scrollIntoView();", element)
File "/usr/lib/python3/dist-packages/selenium/webdriver/remote/webdriver.py", line 667, in execute_script
return self.execute(command, {
File "/usr/lib/python3/dist-packages/selenium/webdriver/remote/webdriver.py", line 318, in execute
self.error_handler.check_response(response)
File "/usr/lib/python3/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found in the current frame
(Session info: chrome-headless-shell=124.0.6367.60)
This is the relevant code:
class GoogleMapsScraper:
def __init__(self, link):
self.link = link
self.csv_data = []
self.elementResults = 0
def scrape(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument('--disable-gpu')
browser = webdriver.Chrome(options=options)
browser.maximize_window()
browser.get(self.link)
try:
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".VfPpkd-LgbsSe.VfPpkd-LgbsSe-OWXEXe-k8QpJ.VfPpkd-LgbsSe-OWXEXe-dgl2Hf.nCP5yc.AjY5Oe.DuMIQc.LQeN7.XWZjwc")))
accept_button = browser.find_element(By.CSS_SELECTOR, ".VfPpkd-LgbsSe.VfPpkd-LgbsSe-OWXEXe-k8QpJ.VfPpkd-LgbsSe-OWXEXe-dgl2Hf.nCP5yc.AjY5Oe.DuMIQc.LQeN7.XWZjwc")
accept_button.click() # Click the accept button for Google cookies and terms
except Exception as e:
logging.error("Error accepting cookies:", e)
return self._selenium_extractor(browser)
def _selenium_extractor(self, browser):
prev_length = 0
logging.info("nScraping has started. This could take a few minutes. Please do not close the browser window or click the top and move it (the script will stop if you do so).")
while len(self._get_elements(browser)) < 1000: # This limits the number of results per page. Google seemingly has a hard limit of 120, but 1000 ensures that it runs smoothly.
# Acquiring elements to scrape
logging.info(f"Number of elements: {len(self._get_elements(browser))}")
var = len(self._get_elements(browser))
last_element = self._get_elements(browser)[-1]
browser.execute_script("arguments[0].scrollIntoView();", last_element)
time.sleep(2) # Sleep allows time for page to load
a = self._get_elements(browser)
try:
if len(a) == var:
self.elementResults += 1
if self.elementResults > 20 or len(a) == prev_length:
break
else:
self.elementResults = 0
prev_length = len(a)
except StaleElementReferenceException:
continue