I have a dataset with German special charaters (ä
, ö
, ü
, ß
) and other special characters (brackets ()
, /
, , etc.):
df <- data.frame(
id = 1:8,
date = seq.Date(as.Date("2024-12-01"), as.Date("2024-12-08"), "day"),
group = rep(LETTERS[1:2], 4),
text1 = c("Österreichische Botschaft", "Außenministerium", "Bundesheer", "Präsident",
"Bundesregierung / Bundeskanzler", "Parteien", "Justiz",
"Erwerbstätig (unselbständig)")
)
I have another dataset containing the original strings and the replacement strings:
df_replace <- data.frame(
original = c("Österreichische Botschaft", "Außenministerium", "Präsident",
"Bundesregierung / Bundeskanzler", "Parteien", "Justiz",
"Erwerbstätig (unselbständig)"),
replacement = c("Austrian embassy", "Foreign ministry", "President",
"Government / Chancellor", "Parties", "Judiciary", "Working (employed)")
)
I now want to replace all the strings in df
using the df_replace
dataset.
Using stringi::stri_replace_all_fixed
works fine, as long as the number of rows in the df
and the df_replace
datasets are the same:
df |> mutate_all((x) stringi::stri_replace_all_fixed(x, df_replace$original,
df_replace$replacement))
But this is not the case for my df
.
Hence, I receive the following error-message:
Caused by warning in "stringi::stri_replace_all_fixed()": ! longer object length is not a multiple of shorter object length
In addition, this solution only works, if the terms are in the same order. For example, if I switch the first terms (“Österreichische Botschaft”, “Austrian embassy”) to the last position, there are no replacements anymore.
Using the following code did not work either:
df |> mutate_all((x) stringr::str_replace_all(x, df_replace$original,
df_replace$replacement))
Because with stringr::str_replace_all
I run into problems because of the special characters (especially the brackets).
Is there another way how to solve this problem?
An approach using match
transform(df, repl = df_replace$replacement[match(text1, df_replace$original)])
output
id date group text1 repl
1 1 2024-12-01 A Österreichische Botschaft Austrian embassy
2 2 2024-12-02 B Außenministerium Foreign ministry
3 3 2024-12-03 A Bundesheer <NA>
4 4 2024-12-04 B Präsident President
5 5 2024-12-05 A Bundesregierung / Bundeskanzler Government / Chancellor
6 6 2024-12-06 B Parteien Parties
7 7 2024-12-07 A Justiz Judiciary
8 8 2024-12-08 B Erwerbstätig (unselbständig) Working (employed)
1
You may simply do a left join:
dplyr::left_join(df, df_replace, by=c("text1" = "original"))
1