I’m having trouble scraping data from https://ser-sid.org/, a database for seed traits. I’ve successfully retrieved a table of potential attributes of species along with their URLs using the following code:
library(jsonlite)
library(curl)
library(httr)
library(dplyr)
# Define the API endpoint URL
url <- "https://fyxheguykvewpdeysvoh.supabase.co/rest/v1/species_summary"
# Define the API key
api_key <- "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6ImZ5eGhlZ3V5a3BkZXlzdm9oIiwicm9sZSI6ImFub24iLCJpYXQiOjE2NDc0MTY1MzQsImV4cCI6MTk2Mjk5MjUzNH0.XhJKVijhMUidqeTbH62zQ6r8cS6j22TYAKfbbRHMTZ8"
# Define the species list
Species <- c("Astragalus glycyphyllos", "Epirrita dilutata", "Ilybius chalconatus",
"Myrmica hirsuta", "Oenanthe aquatica", "Pterapherapteryx sexalata",
"Rhantus frontalis", "Scorpidium cossonii", "Surnia ulula", "Taraxacum tortilobum"
)
restDF <- function(content) {
# Parse the JSON content
json_data <- jsonlite::fromJSON(content)
# Convert JSON to data frame
df <- as.data.frame(json_data)
# Return the data frame
return(df)
}
# Initialize a list to store results
results <- list()
# Loop through each species
for (i in 1:length(Species)) {
# Split species into genus and epithet
genus_epithet <- strsplit(Species[i], " ")[[1]]
genus <- genus_epithet[1]
epithet <- genus_epithet[2]
# Define the query parameters for each species
params <- list(
select = "*",
genus = paste0("ilike.", genus, "%"),
epithet = paste0("ilike.", epithet, "%"),
apikey = api_key # Include the API key as a query parameter
)
# Build the full URL with query parameters
full_url <- modify_url(url, query = params)
# Make the GET request using curl_fetch_memory()
response <- curl::curl_fetch_memory(full_url)
# Check if the request was successful
if (response$status_code == 200) {
# Parse the JSON response
json_content <- rawToChar(response$content)
# Convert JSON content to data frame using restDF function
df <- restDF(json_content)
# Append data to results list
results[[Species[i]]] <- df
} else {
print(paste("Error: Failed to retrieve data for", species))
}
}
Results <- results |>
purrr::reduce(bind_rows) |>
dplyr::mutate(URL = paste0("https://ser-sid.org/species/", id))
This produces the following data frame:
genus | epithet | id | has_germination | has_oil | has_protein | has_dispersal | has_seed_weights | has_storage_behaviour | has_morphology | URL |
---|---|---|---|---|---|---|---|---|---|---|
Astragalus | glycyphyllos | e7043715-6324-415e-83f8-02d282f7b5f8 | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | https://ser-sid.org/species/e7043715-6324-415e-83f8-02d282f7b5f8 |
Oenanthe | aquatica | bb851d35-2b67-48d1-8b0b-bda2bbd05f42 | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE | FALSE | https://ser-sid.org/species/bb851d35-2b67-48d1-8b0b-bda2bbd05f42 |
Now, I’m trying to get the detailed trait values from the URLs. For example, for seed weight, the data is styled with the CSS class .text-white. However, when I attempt to scrape this data using rvest, I get an empty string.
Here is the code I’m using
library(rvest)
URL <- Results$URL[1]
Data <- read_html(URL) |>
rvest::html_elements(".font-medium.text-white") |>
rvest::html_text()
# Or
Data <- read_html(URL) |>
rvest::html_elements(".tracking-tight, .font-medium.text-white") |>
rvest::html_text()
Both approaches return an empty string. I’ve tried various CSS selectors with html_elements but to no avail. Could anyone help me figure out what I’m doing wrong or suggest a better way to scrape the required data?
Thank you in advance for your help!