I have a code in R that runs perfectly fine. It just takes a very long time and i want to know if there is a way to improve its efficiency (Other type of request / converting data first…) .
I have a data table and for each row i need to search through a .xtf document to find the object with the ID contained in the data table. Then i need to extract attributes from this object. For a table with 14’000 entries and a xtf-file of 16’000 KB it takes me 2 hrs to run. Other code runs fast.
Here is my code:
# Find XTF
xtf_file <- list.files(path = input_dir, pattern = ".xtf$", full.names = TRUE)
# Read XTF (as XML)
xml_data <- read_xml(xtf_file)
# Use function for all TID's (the iteration is for a progress bar which i cut out here)
a_attributes_df <- data.frame(`SAA_PAA_from_xtf` = unlist(
mapply(function(Tid, iteration) a_query_xtf_for_all(Tid, iteration), data_all$Tid, seq_along(data_all$Tid))
))
# The function
a_query_xtf_for_all <- function(Tid, iteration) {
# XPath-Request based on TID
xpath_query <- paste0("//*[@TID='", Tid, "']")
# Find node in XML
node <- xml2::xml_find_first(xml_data, xpath_query)
# Initialise
saa_paa <- NA
if (!is.null(node) && xml2::xml_length(node) > 0) {
# Extract attribute
funktion_hierarchisch_node <- xml2::xml_find_first(node, ".//d1:FunktionHierarchisch", ns)
if (!is.null(funktion_hierarchisch_node)) {
funktion_hierarchisch <- xml2::xml_text(funktion_hierarchisch_node)
# Check if SAA / PAA contained
if (grepl("SAA", funktion_hierarchisch, ignore.case = TRUE)) {
saa_paa <- "SAA"
} else if (grepl("PAA", funktion_hierarchisch, ignore.case = TRUE)) {
saa_paa <- "PAA"
}
}
}
return(c(`SAA/PAA` = saa_paa))
}
The XML Document looks like this:
<?xml version="1.0" encoding="UTF-8" ?>
<TRANSFER xmlns="http://www.interlis.ch/INTERLIS2.3">
<HEADERSECTION VERSION="2.3" SENDER="UNKNOWN">
...
</HEADERSECTION>
<DATASECTION>
<VSADSSMINI_2020_LV95.VSADSSMini BID="BASKET1">
<VSADSSMINI_2020_LV95.VSADSSMini.Knoten TID="ch221714gDfKhSAa">
<Baujahr>2024</Baujahr>
<Bezeichnung>V1</Bezeichnung>
<FunktionHierarchisch>SAA</FunktionHierarchisch>
</VSADSSMINI_2020_LV95.VSADSSMini.Knoten>
...
</VSADSSMINI_2020_LV95.VSADSSMini>
</DATASECTION>
</TRANSFER>
data_all contains the coulmn “Tid” with values like: ch19yfb23ZpJZdD4
Your help is much appreciated!
Robin Müller is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2
Consider creating a data frame of the XML, then merge
it with data_all
. This approach avoids the many-to-many iteration in R using for
loops and XPath with //
searching.
nodes <- xml_find_all(xml_data, "//d1:VSADSSMINI_2020_LV95.VSADSSMini.Knoten")
xtf_df <- data.frame(
Tid = sapply(nodes, (n) xml_text(xml_find_first(n, "@TID"))),
SAA_PAA = sapply(nodes, (n) xml_text(xml_find_first(n, "d1:FunktionHierarchisch")))
)
match_Tids <- merge(data_all, xtf_df, by = "Tid")