I have two files in R (f1, f2) that have names of countries (each country only appears once in each file). I want to perform a join on both files using an inner join, and then the rows that don’t match, I would use a fuzzy join to approximately match them.
Here is my approach: First I standardize all names (e.g. make all letters capital, remove spaces, commas, hyphens, apostrophes etc) – then I do an inner join to get exact matches. Then, on the names that don’t match, I use a fuzzy match metric and join if the threshold is less than some constant:
library(dplyr)
library(fuzzyjoin)
library(stringdist)
library(stringr)
set.seed(123)
f1 <- data.frame(
country = c("United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Japan", "Australia", "Brazil"),
value1 = runif(10, 1, 100), value2 = runif(10, 1, 100), value3 = runif(10, 1, 100)
)
f2 <- data.frame(
country = c("United States of America", "United Kingdom", "French Republic", "Germany", "Italian Republic", "Kingdom of Spain", "Canada", "Japan", "Commonwealth of Australia", "Federative Republic of Brazil"),
value2 = runif(10, 1, 100)
)
standardize_name <- function(name) {
name %>%
str_to_upper() %>%
str_replace_all("[[:punct:]]", "") %>%
str_replace_all("\s+", "")
}
f1 <- f1 %>%
mutate(country_std = standardize_name(country))
f2 <- f2 %>%
mutate(country_std = standardize_name(country))
inner_joined <- inner_join(f1, f2, by = "country")
unmatched_f1 <- anti_join(f1, f2, by = "country")
unmatched_f2 <- anti_join(f2, f1, by = "country")
fuzzy_joined <- stringdist_join(unmatched_f1, unmatched_f2,
by = "country",
mode = "left",
method = "lv",
max_dist = 5,
distance_col = "distance")
final_result <- bind_rows(inner_joined, fuzzy_joined)
However, in reality, f2 is not a data frame, but a shapefile (https://www.naturalearthdata.com/downloads/110m-cultural-vectors/ : Admin 0 – Countries). I am not sure if fuzzy matches can be used when one file is a shapefile and the other is data frame.
Does anyone know if this is possible? The final result should be a sf object for plotting, have all columns from f1, and should have all rows from f2.
Cheers
heartofdarkness is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2