In below code,
len(self.crawler.engine.slot.scheduler)
is always returning 0- and
self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']
is returning value in increasing order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
I was expecting the queue to be high initially and in decreasing order as URLs get crawled. Higher queue before crawling and lower value of queue after crawling.
Also, uncommenting this code shows similar trend of increasing queue size.
<code>if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
</code>
<code>if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
</code>
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
note: I have set CONCURRENT_REQUESTS = 1
in settings
<code>import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
start_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/",
]
def parse(self, response):
print(f"n before {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} nn")
print(f"n before2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
print(f"n After {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} nn")
print(f"n after2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero
</code>
<code>import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
start_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/",
]
def parse(self, response):
print(f"n before {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} nn")
print(f"n before2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
print(f"n After {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} nn")
print(f"n after2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero
</code>
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
start_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/",
]
def parse(self, response):
print(f"n before {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} nn")
print(f"n before2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
print(f"n After {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} nn")
print(f"n after2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero
-
this is the original question (I could not comment there because of low reputation): How to get the number of requests in queue in scrapy?
-
scrapy code copied from: https://docs.scrapy.org/en/latest/intro/tutorial.html