My goal is to call an async method at the beginning of scraping to fetch the urls to be used by Scrapy spider. I’m not able to call async methods.
Here is my latest attempt:
import scrapy
from scrapy import Request
import w3lib
from playwright.async_api import async_playwright
async def get_urls() -> list[str]:
urls = []
try:
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://somthing")
content = await page.content()
await browser.close()
except Exception as e:
print(f"Error in get_urls: {e}")
return urls
class PharmachoiceSpider(scrapy.Spider):
name = "pharmachoice"
def start_requests(self):
yield Request("data:,", callback=self.parse_initial)
async def parse_initial(self, response):
urls = []
try:
urls = await get_urls()
except Exception as e:
print('Error when trying to call get_urls: {e}')
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
async def parse(self, response, **kwargs):
disease = response.css('h2.elementor-heading-title::text').get()
descriptions = response.css('section').css('p').getall()
descriptions = [w3lib.html.remove_tags(des) for des in descriptions]
yield {'disease': disease,
'description': ''.join(descriptions)[:10]}
I’m getting this error:
[asyncio] ERROR: Task exception was never retrieved future: <Task
finished name=’Task-13′ coro=<Connection.run() done, defined at
C:kouroshProgrammingpythonmachine_learningRetrieval Augmented
Generationmycrawlervenvlibsite-packagesplaywright_impl_connection.py:265>
exception=NotImplementedError()>
Note that I’ve enabled twisted as in the documentation:
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Scrapy version 2.11.2
Any ideas how to do an async call once at the beginning?