I’m trying to make api calls to the website to load html and somewhy the callback argument doesn’t go to the parse method but keeps increasing the page number until 7 and then makes blank requests, getting nothing.
here’s my code:
class ScrollSpider(scrapy.Spider):
name = 'scroll'
headers={
# this API requires these headers:
"Referer": "https://web-scraping.dev/testimonials",
"X-Secret-Token": "secret123",
}
scroll = True
def start_requests(self):
page = 1
while self.scroll:
yield scrapy.Request('https://web-scraping.dev/api/testimonials?page={page}', headers=self.headers, callback=self.parse)
page += 1
def parse(self, response):
if response.status != 200:
if response.json()['detail']['0']['loc']['type'] == 'value_error.missing':
raise ValueError("API returned an error - something is missing?", response.json())
self.scroll = False
else:
testimonials = response.css('div.testimonial')
for testimonial in testimonials:
yield {
'user_name': testimonial.css('identicon-svg::attr("username")').get(),
'user_photo': testimonial.css('svg::attr("src")').get(),
'testimonial': testimonial.css('p::text').get(),
'rating': len(testimonial.css('span svg').getall())
}
and here’s the traceback:
PS C:UsersmaaikOneDriveРабочий столpython practicepracticing webscrapingpractice> scrapy crawl scroll -o scroll.csv
2024-07-15 10:38:24 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: practice)
2024-07-15 10:38:24 [scrapy.utils.log] INFO: Versions: lxml 5.2.2.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.12.3 (tags/v3.12.3:f6650f9, Apr 9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-11-10.0.22631-SP0
2024-07-15 10:38:24 [scrapy.addons] INFO: Enabled addons:
[]
2024-07-15 10:38:24 [asyncio] DEBUG: Using selector: SelectSelector
2024-07-15 10:38:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-15 10:38:24 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-07-15 10:38:24 [scrapy.extensions.telnet] INFO: Telnet Password: c031a1767f980dd1
2024-07-15 10:38:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2024-07-15 10:38:24 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'practice',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'practice.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['practice.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-15 10:38:25 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-15 10:38:25 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-15 10:38:25 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-15 10:38:25 [scrapy.core.engine] INFO: Spider opened
2024-07-15 10:38:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-15 10:38:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-15 10:38:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://web-scraping.dev/api/testimonials?page=%7Bpage%7D> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2024-07-15 10:38:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://web-scraping.dev/robots.txt> (referer: None)
2024-07-15 10:38:26 [scrapy.core.engine] DEBUG: Crawled (422) <GET https://web-scraping.dev/api/testimonials?page=%7Bpage%7D> (referer: https://web-scraping.dev/testimonials)
2024-07-15 10:38:26 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <422 https://web-scraping.dev/api/testimonials?page=%7Bpage%7D>: HTTP status code is not handled or not allowed
2024-07-15 10:39:25 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2024-07-15 10:40:25 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-15 10:41:25 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-15 10:42:25 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-15 10:43:25 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-15 10:44:25 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-15 10:45:25 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
As far as I understand it somewhy tries to get the nonexistent 7th page when I wrote a func that if request status isn’t 200 to break the while loop and the first requests weren’t even made.
On your log, you see:
[scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://web-scraping.dev/api/testimonials?page=%7Bpage%7D>
Maybe thats the reason ? You construct the url at line:
yield scrapy.Request('https://web-scraping.dev/api/testimonials?page={page}', headers=self.headers, callback=self.parse)
Shouldn’t that url be f-string because you have {placeholder}
and because that string is not marked as f-string, value of page is never replaced with the placeholder, you are getting dupes.
Fix:
yield scrapy.Request(f’https://web-scraping.dev/api/testimonials?page={page}’, headers=self.headers, callback=self.parse’)
your issue can handle by formatting start_requests method and handle pagination logic properly, here don’t need to use for loop in start_requests method, as you can see in your required API if page found returning 200 status response and your page request more than available page response will be 403 with JSON response with detail.
so,if page limit increase then automatically stop crawling , you can achieve this by handle proper pagination in parse method instead of start_requests.
my solution is in below code proper pagination handle in parse method:
import scrapy
class ScrollSpider(scrapy.Spider):
name = 'scroll'
start_urls = ['https://web-scraping.dev/api/testimonials']
headers = {
# this API requires these headers:
"Referer": "https://web-scraping.dev/testimonials",
"X-Secret-Token": "secret123",
}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse, headers=self.headers)
def parse(self, response):
testimonials = response.css('div.testimonial')
for testimonial in testimonials:
yield {
'user_name': testimonial.css('identicon-svg::attr("username")').get(),
'user_photo': testimonial.css('svg::attr("src")').get(),
'testimonial': testimonial.css('p::text').get(),
'rating': len(testimonial.css('span svg').getall())
}
if response.status == 200:
page_number = response.meta.get('page_number', 1)
next_page = page_number + 1
url = f'https://web-scraping.dev/api/testimonials?page={next_page}'
yield scrapy.Request(url=url, callback=self.parse, meta={'page_number': next_page},headers=self.headers)
else:
# you can print error according to your requirements
pass