I’m using Scrapy in Jupyternotebook to scrape the Yellowpages website and am running into a strange error.
My code scrapes the list view of YellowPages for when a user types in ‘auto’ for a variety of zip codes. While my code is correctly able to scrape almost all of the business listings (around 500) there are 2 that I cannot get my Spider to scrape.
These are 2 businesses: Roger’s Services and Northeastern Bus Rebuilders. (See screenshot below for how they appear on the website). I’ve inspected the website’s html and the structure of the divs containing the information I’m scrapping doesn’t seem markedly different from any of the other divs that were easily scraped. The webpage I’m having issues with can be found here. Note that all businesses on this page and subsequent pages are correctly scraped. Only 2, on this page specifically, seem to not be ‘seen’ by the Spider.
I tried scraping the information by getting a div containing the divs and elements I want by ID, but to no avail.
I’ve tried with css selector and with xpath. See my code below for my attempts. I was able to successfully get the results container div of the business directly above the Roger’s Services using the css and xpath to access the div.results lid-XXXX element, but when I put in the Roger Services ID, I get nothing.
Any advice for why this might be happening? I’m genuinely at a loss because the html seems exactly the same between business listings and only these two seem to be causing problems.
Please let me know what you might do to solve it! Thank you!
This is my parse code, with print statements where I tried debugging:
def parse(self, response):
### DEBUGGING SECTION BEGIN ###
div_pointer = response.css('div#lid-5671823')
#div_pointer = response.xpath('//div[contains(@id, "5671823")]')
div_content = div_pointer.get()
if(div_content!=None):
print('The div containing Roger Service')
print(div_content)
print('Trying to get the stuff inside:')
div_inside = div_pointer.css('div.info-section.info-primary')
print(div_inside)
### DEBUGGING SECTION END ###
containing_divs = response.css('div.info-section.info-primary') # works fine for everything else
for containing_div in containing_divs:
business_name = containing_div.css('a.business-name span::text').get()
if (business_name==None): # note, paid listings are slightly differently coded in the html, so error handling below:
business_name = containing_div.css('a.business-name::text').get()
href = containing_div.css('a.business-name::attr(href)').get()
end_of_href = href[href.rfind('-')+1:] # returns ypid-?lid=....
ypid = end_of_href[:end_of_href.rfind('?lid')]# substring to return only ypid
else:
href = containing_div.css('a.business-name::attr(href)').get()
ypid = href[href.rfind('-')+1:] #the ypid is the end of the link (href), so substring the link from the last - to the end
categories = containing_div.css('div.categories a::text').extract()
if(ypid.rfind('?lid')!=-1): # Clean ypid, in case (some hrefs of non paid include ?... etc. at end)
end_of_href = href[href.rfind('-')+1:] # returns ypid-?lid=....
ypid = end_of_href[:end_of_href.rfind('?lid')]# substring to return only ypid
yield {
'business_name':business_name,
'ypid':ypid,
'categories':categories
}