Just trying to do a pretty basic scrape to PDF using Playwright in python. The coursera pages give me a hard time though.
Not all of the page is rendered for some reason and I only get the basic details without the syllabus (ie: “There are 5 modules in this course…” then a table) for https://www.coursera.org/learn/english-common-interactions-workplace-basic-level
Here’s my code, should be straight forward, very basic scrape with some human-like scrolling for good measure.
def fetch_url_as_pdf(self, url: str):
"""
Fetch URL and update the instance with PDF and metadata
"""
with sync_playwright() as p:
# Start the browser
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-infobars",
"--disable-notifications",
"--no-default-browser-check",
"--no-first-run",
],
)
logging.info("Setup browser as %s", browser)
# Setup the context
context = browser.new_context(
user_agent=(
"""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 """
"""(KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"""
),
viewport={"width": 1280, "height": 800},
screen={"width": 1280, "height": 800},
record_video_dir=DATA_DIR,
)
logging.info("Setup context as %s", context)
# Load the page
page = context.new_page()
page.emulate_media(media="screen")
response = page.goto(
url,
wait_until="load",
timeout=100000, # Increase timeout to 100 seconds
)
logging.info("Fetched page as %s", page)
page.wait_for_load_state()
# Scroll the page to the bottom
self.scroll_page_slowly(page, scroll_pause_time=0.25)
self._set_metadata(response)
self.pdf = page.pdf()
browser.close()
@staticmethod
def scroll_page_slowly(page, scroll_pause_time=0.5):
"""Scrolls down the page slowly to simulate human behavior."""
scroll_height = page.evaluate("document.body.scrollHeight")
current_position = 0
while current_position < scroll_height:
scroll_lines = random.randint(50, 80)
page.evaluate(f"window.scrollBy(0, {current_position + scroll_lines});")
current_position += scroll_lines
sleep(scroll_pause_time)
scroll_height = page.evaluate("document.body.scrollHeight")
Here’s the video resulting from record_video_dir
:
Video
I’ve tried a few different waiting methods including static times, domcontentloaded and load.