I’ve been trying to retrieve BibTeX citations based on article/paper names using the Scholarly Library on Python (more specifically, the scholarly.search_pubs function). However, after around 20-60 queries in a 24 hour period, Google Scholar starts rate limiting me (image of the consequential error message provided in both image and text):
Error Message in image form
MaxTriesExceededException: Cannot Fetch from Google Scholar.
To start, this is my BibTeX retrieval code (also provided in an image) (earlier in the code I have pip installed scholarly and pybtex):
BibTeX retrieval code
from scholarly import scholarly
def get_bibtex_citation(paper_title):
# Search for the publication by title
search_query = scholarly.search_pubs(paper_title)
try:
# Get the first result from the search query
publication = next(search_query)
# Fill in the details of the publication
filled_publication = scholarly.fill(publication)
# Get the BibTeX citation for the filled publication
bibtex_citation = scholarly.bibtex(filled_publication)
return bibtex_citation
except StopIteration:
return "No publication found with the given title."
paper_title = "" # <----=----{Input Paper Name Into Here}----=----
bibtex_citation = get_bibtex_citation(paper_title)
print(bibtex_citation)
It works (for the most part, I still have to fix a bug but I digress) until
After checking Scholarly’s documentation, I found out that I theoritically should be able to use a proxy to avoid the rate-limits from Google Scholar. As such, I tried to use some of Scholarly’s proxy tools (image of code and text of code provided):
Proxy Code
from scholarly import scholarly, ProxyGenerator
pg = ProxyGenerator()
success = pg.FreeProxies()
scholarly.use_proxy(pg)`
However, it didn’t end up working at all and in fact all it did was extend the time the query took by 20 seconds just to fail anyway. The same error message occurred.
After searching a bit on some random forums, I saw that I should use selenium instead of proxies because proxies don’t worry that well or something? Something about too many people overloading proxies and therefore them taking a lot of time. I didn’t care that much about the time and I would use proxies even if they extended each query by a not unreasonable amount of time so long as they prevented me from getting rate limited, but I digress.
After that, I watched two videos on Selenium with Python.
(This video: https://www.youtube.com/watch?v=Xjv1sY630Uc)
(And this video:: https://www.youtube.com/watch?v=b5jt2bhSeXs)
I was able to set up Selenium after installing Chrome Canary and the respective ChromeDriver version for Canary (windows x64).
I have some simple code here that shows me messing around with selenium (in both image and text):
Just starting to use selenium
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
PATH = "" # <--- path in here
driver = webdriver.Chrome(PATH)
driver.get("") # <--- website in here
print(driver.title)
search = find_element(by=By.CLASS_NAME, value="element to look for")# <--- put class name and element to look for in here
search.send_keys("test")
search.send_keys(Keys.RETURN)
time.sleep(1)
driver.quit()
However, even after watching these tutorials, I wasn’t able to make selenium code that helped me in my goal. What I’m thinking is to set up a user interface where I can complete the captchas for my scholarly program using selenium so I don’t get rate limited. If anyone could give me some guidance or help I would appreciate it!
Vivan Nyati is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.