I am trying to build a proof of concept for web scraping using Selenium in Foundry. The requirement is to scrape potentially hundreds-thousands of websites using Foundry. I realize this is probably not ideal, but I was given this requirement and I intend to find a way to make it work unless someone from Palantir can tell me why it won’t work and that this is a terrible idea. Beautiful soup works fine in regular transforms, but we will need something to scrape dynamically generated content. Due to the need to run a browser, I am not sure how to make Selenium work other than containers and sidecar transforms – if there is a better way please let me know.
My sidecar transform build is timing out, and I don’t know what the issue is or how to begin troubleshooting it. The sidecar transform runs 20-30 minutes then fails with the following error:
[module version: 1.1132.0]
Spark module
‘ri.spark-module-manager.main.spark-module.e3afe96c-4d51-44a4-a687-1174dfba2fb4’
died while job
‘ri.foundry.main.job.f00598ef-e11e-43c4-82bf-ac63d057294a’ was using
it. (ExitReason: MODULE_UNREACHABLE)Module exit details: Module became unreachable after registration.
This likely indicates the module has died. Module became unreachable
for an unknown reason.
Here is the sidecar transform, mostly just copied from the documentation with an egress policy added for the one test website we’re scraping:
from transforms.api import transform, Input, Output, configure
from transforms.sidecar import sidecar, Volume
from myproject.datasets.utils import copy_files_to_shared_directory, copy_output_files
from myproject.datasets.utils import copy_start_flag, wait_for_done_flag, copy_close_flag, launch_udf_once
from transforms.external.systems import use_external_systems, EgressPolicy, Credential
@use_external_systems(
egress=EgressPolicy('{POLICY RID}')
)
@configure(["NUM_EXECUTORS_64",
'EXECUTOR_MEMORY_LARGE', 'EXECUTOR_MEMORY_OVERHEAD_LARGE',
'DRIVER_MEMORY_EXTRA_EXTRA_LARGE', 'DRIVER_MEMORY_OVERHEAD_EXTRA_LARGE'
])
@sidecar(image='{PACKAGE_NAME}', tag='0.3', volumes=[Volume("shared")])
@transform(
output=Output("OUTPUT"),
)
def compute(ctx, output, egress):
def user_defined_function(row):
# Copy files from source to shared directory.
# copy_files_to_shared_directory(source)
# Send the start flag so the container knows it has all the input files
copy_start_flag()
# Iterate till the stop flag is written or we hit the max time limit
wait_for_done_flag()
# Copy out output files from the container to an output dataset
output_fnames = [
"start_flag",
# "outfile.csv",
"logfile",
"done_flag",
]
copy_output_files(output, output_fnames)
# Write the close flag so the container knows you have extracted the data
copy_close_flag()
# The user defined function must return something
return (row.ExecutionID, "success")
# This spawns one task, which maps to one executor, and launches one "sidecar container"
launch_udf_once(ctx, user_defined_function)
Dockerfile:
FROM --platform=linux/amd64 python:3.9-buster
# Keeps Python from generating .pyc files in the container
ENV PYTHONDONTWRITEBYTECODE=1
# Turns off buffering for easier container logging
ENV PYTHONUNBUFFERED=1
# please review all the latest versions here:
# https://googlechromelabs.github.io/chrome-for-testing/
ENV CHROMEDRIVER_VERSION=123.0.6312.122
### install chrome
# https://storage.googleapis.com/chrome-for-testing-public/123.0.6312.122/linux64/chrome-linux64.zip
RUN apt-get update && apt-get install -y wget && apt-get install -y zip
COPY google-chrome-stable_current_amd64.deb .
RUN apt-get install -y ./google-chrome-stable_current_amd64.deb
### install chromedriver
COPY chromedriver-linux64.zip .
RUN unzip chromedriver-linux64.zip && rm -dfr chromedriver_linux64.zip
&& mv /chromedriver-linux64/chromedriver /usr/bin/chromedriver
&& chmod +x /usr/bin/chromedriver
# set display port to avoid crash
ENV DISPLAY=:99
# install selenium
RUN pip install selenium==4.3.0
WORKDIR /app
COPY . /app
RUN mkdir -p /opt/palantir/sidecars/shared-volumes/shared/
RUN chown 5001 /opt/palantir/sidecars/shared-volumes/shared/
ENV SHARED_DIR=/opt/palantir/sidecars/shared-volumes/shared
USER 5001
ENTRYPOINT ["python", "entrypoint.py"]
entrypoint.py, also mostly copied from the documentation:
import os
import time
import subprocess
from datetime import datetime
def run_process():
"Define a function for running commands and capturing stdout line by line"
# p = subprocess.Popen(exe, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
p = subprocess.Popen(["python", "scraper.py"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
out, err = p.communicate()
return (p.returncode, out, err)
# return iter(p.stdout.readline, b"")
start_flag_fname = "/opt/palantir/sidecars/shared-volumes/shared/start_flag"
stop_flag_fname = "/opt/palantir/sidecars/shared-volumes/shared/stop_flag"
terminate_flag_fname = "/opt/palantir/sidecars/shared-volumes/shared/terminate_flag"
run_process()
# We want this to loop multiple times until close flag shows up
while not os.path.exists(terminate_flag_fname):
print(f'{datetime.utcnow().isoformat()}: waiting for start flag')
while not os.path.exists(start_flag_fname):
time.sleep(1)
print(f'{datetime.utcnow().isoformat()}: start flag detected')
with open('/opt/palantir/sidecars/shared-volumes/shared/logfile', 'w') as logfile:
for item in run_process():
my_string = f'{datetime.utcnow().isoformat()}: {item}'
print(my_string)
logfile.write(my_string)
logfile.flush()
print(f'{datetime.utcnow().isoformat()}: execution finished writing output file')
open(stop_flag_fname, 'w')
print(f'{datetime.utcnow().isoformat()}: stop flag file written')
# Clean up files in Foundry logic
# Final Close Flag, container dies
while not os.path.exists(terminate_flag_fname):
time.sleep(1)
print(f'{datetime.utcnow().isoformat()}: terminate flag detected. shutting down')
scraper.py, basic Selenium run test:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Define options for running the chromedriver
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-dev-shm-usage")
# Initialize a new chrome driver instance
driver = webdriver.Chrome(options=chrome_options)
driver.get('{WEBSITE}')
print(driver.page_source)
driver.quit()