I’m working on summarising a column of data about the village from which participants hail. The information arrived as character variables, which I’ve changed into factors.
There are about ~30 different villages represented. Some are just mispellings – eg. Chicago and Chicgao.
I wanted to summarise this info into factors levels A, B, C, D and Other, where A is village A and all mispellings of village A, and Other is everything else that doesn’t fit into A-D. (The number of mispellings are limited, thankfully.)
I have no idea how to do this other than manually recoding a lot of data points.
df %>%
mutate(x= recode(x, "c('A', 'misspeltA')='A'; c('E', 'F', 'G' ... 'Z') = 'Other'))
I’ve also tried
df$village <- factor(df$village,
levels =c("A","B","C","mispeltA","D","E","F"),
labels =c("A","B","C","A","D","Other","Other"))
Is there a faster way?
2
forcats::fct_collapse()
might help a bit:
library(dplyr)
set.seed(42)
df_ <- tibble(x = c("misspeltA", LETTERS[1:9]))
df_ |>
mutate(
x_collapsed = forcats::fct_collapse(x,
A = c('A', 'misspeltA'),
B = c('B'),
C = c('C'),
D = c('D'),
other_level = "Other"
)
)
#> # A tibble: 10 × 2
#> x x_collapsed
#> <chr> <fct>
#> 1 misspeltA A
#> 2 A A
#> 3 B B
#> 4 C C
#> 5 D D
#> 6 E Other
#> 7 F Other
#> 8 G Other
#> 9 H Other
#> 10 I Other
You can also build a named list of levels that can be spliced (!!!
) and passed to fct_collapse()
; for example wrangle villages vector so it would be suitable as a starting point (perhaps sort unique values alphabetically or by number of occurrences), create a named list and use dput()
output as a base for manual edit:
sort(df_$x) |>
unique() |>
purrr::set_names() |>
as.list() |>
dput()
#> list(A = "A", B = "B", C = "C", D = "D", E = "E", F = "F", G = "G",
#> H = "H", I = "I", misspeltA = "misspeltA")
lvl <-
list(
A = c("A", "misspeltA"),
B = "B", C = "C", D = "D"
)
df_ |>
mutate(x_collapsed = forcats::fct_collapse(x, !!!lvl, other_level = "Other"))
#> # A tibble: 10 × 2
#> x x_collapsed
#> <chr> <fct>
#> 1 misspeltA A
#> 2 A A
#> 3 B B
#> 4 C C
#> 5 D D
#> 6 E Other
#> 7 F Other
#> 8 G Other
#> 9 H Other
#> 10 I Other
An automatic approach combining similar information into a new grp using agrep
.
If this is your data
cities <- c("Chicgo", "Atalanta", "Chicago", "Chicgao", "Atlanta", "Atlnata",
"Boston", "Botson")
Regrouping with agrep
sapply(cities, (x)
paste(agrep(x, cities, max.distance=3), collapse=""))
Chicgo Atalanta Chicago Chicgao Atlanta Atlnata Boston Botson
"134" "256" "134" "134" "256" "256" "78" "78"
as a data.frame
data.frame(cities, grp =
sapply(cities, (x)
paste(agrep(x, cities, max.distance=3), collapse="")), row.names=NULL)
cities grp
1 Chicgo 134
2 Atalanta 256
3 Chicago 134
4 Chicgao 134
5 Atlanta 256
6 Atlnata 256
7 Boston 78
8 Botson 78
Note that you can play around with max.distance
, taking higher values if more dissimilarities should be combined (with the trade off that some false combinations may occur).