I’m trying to scrape a website with a scroll-down loading page. The page does not load new elements but just updates the element’s content when scrolling down. So I’m trying to use selenium in the DownloaderMiddleware to scroll, and I want to return the current page_sorce every time I scroll. (as it is different every time)
However, when I use yield HtmlResponse in the process_request method, it showed I should not return generator.If I use return HtmlResponse I could just get the last page’s content. How could I solve this problem? Thanks a lot.
for i in range(20):
try:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
logging.error(f'scroll{i+1}times')
wait.until(EC.presence_of_element_located((By.XPATH, '//div/a[@role = "link"]')))
time.sleep(2)
logging.error(f'waiting scroll{i+1}times')
logging.error(f'get page_source{i+1}times')
self.driver.execute_script('return document.body.scrollHeight;')
except Exception as e:
logging.error('scroll failed')
return None
break
body = self.driver.page_source
return HtmlResponse(body=body,url=request.url,encoding='utf-8')
return None
Jiong Zeng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.