I’ve created a loop to scrape NBA regular season data. My loop cycles through all the regular season months over a set of years. My code keeps returning the error “Error in open.connection(x, “rb”) : HTTP error 429.” when the webpage does exist online and is accessible to everyone.
I’ve created a “try” variable to handle the exceptions when an NBA season was not in play in the months in my list. The loop should move past those months that games were not played and move on to the next one. My loop seems to work fine. I do notice that my memory usage report shows upwards of 95% of memory is in use when executing my loop. Could this be potential issue I need to address to execute my loop and create a table of NBA regular season data for my analysis period? Any help is greatly appreciated.
library(rvest)
library(dplyr)
mnths = c("october","november","december","january","february","march","april","may")
#list of months to cycle through
yrs = seq(2003,2017)
#list of years to cycle through
url_base = "https://www.basketball-reference.com/leagues/NBA_"
#beginning of webpage URL
#https://www.basketball-reference.com/leagues/NBA_2003_games-october.html
#above is the final webpage formatted for the 1st run of the loop as an example
i = 1
j = 1
while(i<=length(yrs)){
#begin loop to cycle through each year
while(j <= length(mnths)){
#begin subloop to cycle through each month in a year
webpage = paste(paste(paste(paste(url_base,yrs[i],sep = ""),"_games-",sep = ""),mnths[j],sep = ""),".html",sep = "")
#string variable of webpage with specific month and year in loop
webpageexists = try(read_html(webpage) %>% html_node(), silent = TRUE)
#try variable to check if webpage exists
if(webpageexists == "try-error"){
#if statement to check if webpage exists, if not variable "webpageexists" will be a try-error and the month will be incremented and subloop continues
j = j + 1
rm(webpageexists)
#removing try variable from memory
}else if(exists("tb")){
tbx = as.data.frame(read_html(webpage) %>% html_nodes("table") %>% html_table())
#table created to contain new data from specific webpage in loop
tb = rbind(tb,tbx)
#table holding all data from all runs of loop
j = j + 1
rm(webpageexists)
#removing try variable from memory
}else{
tb = as.data.frame(read_html(webpage) %>% html_nodes("table") %>% html_table())
#table that is created that all new tables will be merged into
#this else statement is only used on the very first run of the loop
j = j + 1
rm(webpageexists)
#removing try variable from memory
}
}
#end subloop to cycle through each month in a year
j = 1
#j reset to 1 so that the next year starts at the first month in the "mnths" list
i = i + 1
#i is incremented by 1 to move to the next year in the "yrs" list
}
#end loop to cycle through each year