I am having trouble with unnest_wider (from tidyr).
I have this nested XML document that I am trying to convert into a dataframe/tibble. I followed the workflow presented here, which proposes to turn the XML nodeset into R lists.
My XML is OAI/Dublin Core formatted, and I have several elements in it that have the same name (“subject.other”, for example). Simplified, my doc.xml
looks like this :
<?xml version="1.0" encoding="utf-8"?>
<ListRecords>
<record>
<header>
<identifier>id_01</identifier>
<datestamp>2024-05</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:publisher>Fake Editions</dc:publisher>
<dc:subject.other>Great subject n°1</dc:subject.other>
<dc:subject.other>Great subject n°2</dc:subject.other>
<dc:subject.other>Great subject n°3</dc:subject.other>
<dc:subject.other>Great subject n°4</dc:subject.other>
<dc:subject.other>Great subject n°5</dc:subject.other>
<dc:subject.other>Great subject n°6</dc:subject.other>
<dc:title>Random title</dc:title>
</oai_dc:dc>
</metadata>
</record>
</ListRecords>
What I tried
The code that I ran is the following :
# doc.xml is turned into a list
doc_list <- xmlconvert::xml_to_list(read_xml("doc.xml"))
# the list becomes a tibble
df <- tibble::enframe(doc_list)
# unnesting the column "value", where we find the listed elements contained in <header> and <metadata> in the XML
final_df <- df %>%
unnest_wider(value, names_repair = "universal")
Expectations…
What I would like my final_df
to look like in the end is something like that :
structure(list(
identifier = "id_01",
publisher = "Fake Editions",
subject.other_1 = "Great subject n°1",
subject.other_2 = "Great subject n°2",
subject.other_3 = "Great subject n°3"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
…reality
But what I get is :
structure(list(
identifier = "id_01",
publisher = "Fake Editions",
subject.other_1 = "Great subject n°1",
subject.other_2 = "Great subject n°1",
subject.other_3 = "Great subject n°1"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))
As you can see, what happens is that the actual values contained in the different “subject.other” elements get erased and replaced by the value contained in the first one (“Great subject n°1”). I tried changing the .names_repair
options, but it didn’t change anything.
Would you see any solution to make it work ? I have tried everything to get this XML into a dataframe/tibble, and I am losing hope !
Thank you very much !
(I can provide more code/details, sorry I am not used to asking questions on Stackoverflow)
1