I’m using scrapy with Playwright to load a Google Jobs search results page. Playwright is needed to be able to load the page in a browser setting, then to click on different jobs to reveal the details of the job.
Example URL I want to extract information from: https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs
While I can get the code to open that page in a Playwright browser and parse the fields I want in an interactive python environment, I’m not sure how to integrate Playwright into scrapy smoothly. I have the start_requests
function set up correctly, in the sense that Playwright is set up and it’ll open up a browser to the desired page, like the URL above.
Here’s what I have so far for the parse
function:
async def parse(self, response):
page = response.meta["playwright_page"]
jobs = page.locator("//li")
num_jobs = jobs.count()
for idx in range(num_jobs):
# For each job found, first need to click on it
await jobs.nth(idx).click()
# Then grab this large section of the page that has details about the job
# In that large section, first click a couple of "More" buttons
job_details = page.locator("#tl_ditsc")
more_button1 = job_details.get_by_text("More job highlights")
await more_button1.click()
more_button2 = job_details.get_by_text("Show full description")
await more_button2.click()
# Then take that large section and pass it to another function for parsing
soup = BeautifulSoup(job_details, 'html.parser')
data = self.parse_single_jd(soup)
...
yield {data here}
return
When I try to run the above, it errors on the for idx in range(num_jobs)
line with “TypeError: ‘coroutine’ object cannot be interpreted as an integer”. When running in an interactive python shell, the use of page.locator
, jobs.count()
, jobs.nth(#).click()
, etc all work. This leads me to believe that I’m misunderstanding something fundamental about the async nature of parse, which I believe is needed in order to be able to do things like click on the page (per this documentation)
Any advice?