I can’t get the navigate
method of the client
object of rsDriver
to work inside a function.
My sandbox is the Quotes to scrape website: http://quotes.toscrape.com/
My libraries and global variables are :
library(RSelenium)
library(httr)
rD <- rsDriver(port = 4442L, browser="firefox", chromever = NULL)
remDr <- rD[["client"]]
quotes <- list()
authors <- list()
and I launch my script with :
get_quotes(start_url)
Here’s my main function get_quotes()
, which takes an url as parameter.
start_url <- "https://quotes.toscrape.com/"
remDr$navigate(start_url)
get_quotes <- function(page_url) {
print(paste("Processing ", page_url))
# current page
new_quotes <- lapply(get_quotes_elements(page_url), get_quote)
quotes <<- append(quotes, new_quotes)
# Find next page
next_quotes_page_url <- get_next_quotes_page_url()
if (!is.null(next_quotes_page_url)) {
print(paste("next page url = ", next_quotes_page_url))
remDr$navigate(next_quotes_page_url)
get_quotes(next_quotes_page_url)
}
}
I use a lapply
loop to extract quotes: for each HTML element of class .quote
found on the page with the function get_quotes_elements()
get_quotes_elements <- function(page_url) {
remDr$navigate(page_url)
quotes_elements <- remDr$findElements("css selector", ".quote")
return(quotes_elements)
}
, I execute the get_quote()
function.
get_quote <- function(quote_element) {
quote <- list()
quote_text <- quote_element$findChildElement("css selector", ".text")$getElementText()
quote_author <- quote_element$findChildElement("css selector", ".author")$getElementText()
quote_tags_elements <- quote_element$findChildElements("css selector", ".tag")
quote_tags <- list(unlist(lapply(quote_tags_elements, function(quote_tag){
quote_tag$getElementText()
})))
quote['author'] <- quote_author
quote['quote'] <- quote_text
quote['tags'] <- quote_tags
author_page_url <- quote_element$findChildElement("css selector", "a[href*='/author/']")$getElementAttribute("href")
print(paste(" author url found :", author_page_url))
remDr$navigate(author_page_url)
author <- get_author(author_page_url)
return(quote)
}
Then I search for other quotes pages using get_next_quotes_page_url()
get_next_quotes_page_url <- function() {
next_button_element <- remDr$findElements("css selector", ".next a")
next_quotes_page_url <- NULL
if(length(next_button_element) > 0) {
next_quotes_page_url <- unlist(lapply(next_button_element, function(next_page){
next_page$getElementAttribute("href")
}))
}
return(next_quotes_page_url)
}
, if so, I call the navigate()
method, then execute get_quotes()
, and so on for as long as there is a next page
quotation link.
Everything works perfectly if I limit the script to extract quotes.
The problem arises when, inside get_quote()
, I retrieve the link to the quote author’s description and try to extract the author’s information with get_author
function :
get_author <- function(author_page_url) {
print(paste(" processing scraping of : ", author_page_url))
author_name = remDr$findElement("css selector", ".author-title")$getElementText()
print(paste(" author name ", author_name))
}
The navigation to the author’s page doesn’t work, which raises an Unable to locate element: .author-title
error, which is consistent since I’m not on the right page.
Thanks for your help!
Translated with DeepL.com (free version)