I made a post a couple of weeks ago asking a similar question, but it was closed for not being specific enough.
I’m currently trying to scrape 500 names from Google Scholar and collect information from them using the scholar package in R. However, after around 50 names, I get a message that says “Warning: Response code 429. Google is rate limiting you for making too many requests too quickly.”. I’m trying to figure out a way to get around this.
I’ve tried setting a timer after each iteration is done, but this doesn’t work regardless of how long the timer is. I still get timed out after around 50-60 people have information scraped. My new plan was to write in an if statement that would stop the loop if a “429” error message was read and wait 15 minutes, then continue after 15 minutes. From my testing, 15 minutes seems to be about how long Google times you out for.
I initially thought my code was working, but after looking through my results, I don’t believe it is. For example, the R console has the error code much more than I would expect (over 100 times). There’s also many people whose names should have been scraped, but I get NA values for them. I’m expecting to get some NAs since some people may not have a Google Scholar profile.
Does my code do what I think it should be? Is there a better way to do this? Note that “info” is a data frame that has two columns (first_name, last_name) which will be updated with more information as data is scraped.
scholar_ids <- character(nrow(info))
last_successful_iteration <- 0 # Initialize
for (i in seq_len(nrow(info))) {
# Get the Google Scholar ID for the current person
id <- NULL
while (is.null(id)) { # Retry until successful or error other than 429
tryCatch({
id <- get_scholar_id(last_name = info$last_name[i],
first_name = info$first_name[i],
affiliation = "Davis")
last_successful_iteration <- i # Update last successful iteration
}, error = function(err) {
cat("Error message:", err$message) # Print the error message
if (grepl("429", err$message)) { # Check if the error contains 429
cat("Timeout. Pausing for 15 minutes.")
Sys.sleep(900) # Pause for 15 minutes (900 seconds)
# Restart the loop from the last successful iteration
i <- last_successful_iteration
} else {
stop(err) # If it's not error 429, stop and display the error
}
})
}
# Store the ID in the vector
scholar_ids[i] <- id
sleep_time <- runif(n = 1, min = 10, max = 12)
Sys.sleep(sleep_time)
}