This is a follow-up question to the one I asked earlier (How to create data frame from rvest scraped website, preserving nested structure of data) and the answer by @stefan. That answer works perfectly for that question.
But what if there are extra layers of nesting?
library(rvest)
library(dplyr, warn = FALSE)
books <- minimal_html('
<div class="entry">
<div class="collection">Collection 1<div>
<div class="book">
<div class="booktitle">Book 1</div>
<div class="year">1999</div>
<div class="author">
<div class="name">Author 1</div>
<div class="city">Austin</div>
</div>
<div class="author">
<div class="name">Author 2</div>
<div class="city">Dallas</div>
</div>
<div class="author">
<div class="name">Author 3</div>
<div class="city">Memphis</div>
</div>
</div>
<div class="book">
<div class="booktitle">Book 2</div>
<div class="year">2022</div>
<div class="author">
<div class="name">Author 4</div>
<div class="city">Houston</div>
</div>
</div>
</div>
<div class="entry">
<div class="collection">Collection 2<div>
<div class="book">
<div class="booktitle">Book 3</div>
<div class="year">1845</div>
<div class="author">
<div class="name">Author 5</div>
<div class="city">Phoenix</div>
</div>
<div class="author">
<div class="name">Author 6</div>
<div class="city">Dayton</div>
</div>
<div class="author">
<div class="name">Author 7</div>
<div class="city">Philadelphia</div>
</div>
</div>
</div>')
As before, I would like things to be at the author level, but an author should have name and city on the same row. Also, there is an extra outer layer, collection
. All authors in a collection should have the collection number. So there should be seven rows, and Author 7
should have these values: Collection 2
, Book 3
, 1845
, Author 7
, and Philadelphia
.
How can I extend this code from the prior answer to get my desired solution?
data0 <- books %>%
html_elements(".book") |>
lapply((x) {
tibble(
title = x |> html_element(".booktitle") |> html_text2(),
year = x |> html_element(".year") |> html_text2(),
authors = x |> html_elements(".author") |> html_text2(),
)
}) |>
bind_rows()