I have this website: https://medicaldictionary.lib.umich.edu/
I am trying to take the information from this website and make into a table, e.g.
term defn
1 24-hour coverage When many health plans or programs are combined into one plan that can cover someone for the whole day.
2 abatement Remove or cover a dangerous material, such as lead or mercury, from water, paint, or air, or other places.
3 abdomen, abdominal Your belly or tummy area.
4 ability To able to or can do.
5 ablation To remove or destroy a body part or tissue. This can be done in many ways, including surgery, radiation, or drugs.
I scanned the source code for this website and identified the tags, and then tried to write the following:
library(rvest)
library(dplyr)
library(tidyr)
scrape_website <- function(url) {
webpage <- read_html(url)
col1_content <- webpage %>%
html_elements(".row.flex-start.cardHeader") %>%
html_text2() %>%
trimws()
h2_content <- webpage %>%
html_elements("h2") %>%
html_text2() %>%
trimws()
col1 <- unique(c(col1_content, h2_content))
iconset_content <- webpage %>%
html_elements(".iconset") %>%
html_text2() %>%
trimws()
def_content <- webpage %>%
html_elements(".definitionParagraph") %>%
html_text2() %>%
trimws()
col2 <- unique(c(iconset_content, def_content))
max_length <- max(length(col1), length(col2))
col1 <- c(col1, rep(NA, max_length - length(col1)))
col2 <- c(col2, rep(NA, max_length - length(col2)))
result_df <- data.frame(
Header = col1,
Content = col2,
stringsAsFactors = FALSE
)
return(result_df)
}
When I call it:
url <- "https://medicaldictionary.lib.umich.edu/"
result_table <- scrape_website(url)
I get an empty result:
Header Content
1 Application by the University of Michigan Library <NA>
Is there a way to fix this?
Everything is available in the json file here : https://medicaldictionary.lib.umich.edu/data.json . You can capture it from there
out <- jsonlite::fromJSON("https://medicaldictionary.lib.umich.edu/data.json")
out$definitions
How did I know about the json file?
Because I have done some bit of scraping earlier so I know such data is mostly available already on the backend. You need to inspect the page and look at the incoming data for it.