I am using icrawler on python to scrape images online. I have a list of strings download_waitlist = ["cat","dog","car","motorbike","snoop dogg"]
That I want to download from the internet using icrawler. There are two issues with the package: sometimes it ends the process without having downloaded anything, meaning that there are no images in my download folder. It also simply doesn’t download the right images. I often have way too many cat images that end up getting labeled as “dog” or “car”. Any workarounds, anything I am doing wrong? Here is my downloader code:
download_waitlist = ["cat","dog","car","motorbike","snoop dogg"]
google_crawler = GoogleImageCrawler(storage={"root_dir": "Downloads"}, feeder_threads=3, parser_threads=3, downloader_threads=3)
def getfilepath(offset,imgformat,dotslash): # this returns a string with the correct file path for any image number
if len(str(offset))>5:
gotnumber = str(offset)
else:
gotnumber = str(0)
for i in range(5-len(str(offset))):
gotnumber = gotnumber+str(0)
gotnumber = gotnumber+str(offset)
gotnumber = "Downloads/"+gotnumber+"."+imgformat
if dotslash:
gotnumber = "./"+gotnumber
return(gotnumber)
for iteration, query in enumerate(download_waitlist): #downloading the list
search = query
offset = iteration*10 #making sure immages don't override eachother
google_crawler.crawl(keyword=search, max_num=2, file_idx_offset=offset) #searching for 2 images
if os.path.isfile(getfilepath(offset+1,"jpg",True)): #loading the image using pygame
loaded = pygame.image.load(getfilepath(offset+1,"jpg",False))
elif os.path.isfile(getfilepath(offset+1,"png",False)):
loaded = pygame.image.load(getfilepath(offset+1,"png",False))
AllImages[search] = loaded #adding and labeling the image in my dictionary
print("downloaded "+str(iteration+1)+"/"+str(len(download_waitlist))+" images")
I tried downloading https://github.com/Patty-OFurniture/icrawler.git which has been modified to fix reliability issues but couldn’t get it to work on my instance.