I need to breakdown the string values contained in a column. Some cells do not need any split at all. Others might required one, two, or more.
I would also like the split values to be stored in concurrent columns.
For instance, if my initial dataframe looks like this
df2 <- data.frame(district= 1000:1003,
party= c("PartyGreen", "Coalition1(PartyRed-PartyBlue)", "PartyRed", "Coal.(PartyBlue-PartyOrange-VelvetParty)"))
I would like to look like this:
df.neat.2 <- data.frame(district= 1000:1003,
party= c("PartyGreen", "Coalition1(PartyRed-PartyBlue)", "PartyRed", "Coal.(PartyBlue-PartyOrange-VelvetParty)"),
party1= c("PartyGreen", "PartyRed", "PartyRed", "PartyBlue"),
party2= c(NA, "PartyBlue", NA, "PartyOrange"),
party3= c(NA, NA, NA, "VelvetParty"))
Note how some cells contain a single value that does not need to be split. Also note that the string split takes place inside a parenthesis ()
and is split by a dash.
library(tidyr)
library(dplyr)
library(stringr)
df2 |>
mutate(parties = str_remove_all(party, ".*\(|\).*"),
parties = str_split(parties, fixed("-"))) |>
unnest_wider(parties, names_sep = "_")
# district party parties_1 parties_2 parties_3
# <int> <chr> <chr> <chr> <chr>
# 1 1000 PartyGreen PartyGreen NA NA
# 2 1001 Coalition1(PartyRed-PartyBlue) PartyRed PartyBlue NA
# 3 1002 PartyRed PartyRed NA NA
# 4 1003 Coal.(PartyBlue-PartyOrange-VelvetParty) PartyBlue PartyOrange VelvetParty
If you wanted the column names to be party_1
, party_2
, etc. instead of parties_1
, parties_2
, etc. you could pipe this output to:
rename_with(~ str_replace(., fixed("parties"), "party"), starts_with("parties"))
Using stringr
, dplyr
, and tidyr
:
df2 |>
separate_wider_delim(party, delim = fixed('-'), names_sep = '', too_few = 'align_start') |>
mutate(across(
starts_with('party'),
(s) str_remove(s, "^.*?\(|\)"))
)
0
Using just base R, we may gsub
unnecessary stuff away, strsplit
at "-"
and cbind
to the data.
> df2 |> cbind(
+ (s <- strsplit(gsub('.*\(|\).*', '', x), '-')) |>
+ lapply(`length<-`, max(lengths(s))) |>
+ do.call(what='rbind.data.frame') |>
+ setNames(paste0('party', seq_len(max(lengths(s)))))
+ )
district party party1 party2 party3
1 1000 PartyGreen PartyGreen <NA> <NA>
2 1001 Coalition1(PartyRed-PartyBlue) PartyRed PartyBlue <NA>
3 1002 PartyRed PartyRed <NA> <NA>
4 1003 Coal.(PartyBlue-PartyOrange-VelvetParty) PartyBlue PartyOrange VelvetParty