I’m cleaning data from a qualtrics survey that includes a few questions where the answer choices were randomized (e.g. you rate five tv shows, and the show order is randomized for each respondent). The answers show up as separate columns for each: e.g. tv_show_1, tv_show_2 with the rows being individual respondents’ ratings (excellent, good, bad etc.), and then there is a column that tells me the display order for each individual survey respondent, with tv shows separated using vertical bars like this|.
I’ve never worked with anything like this and am a bit stumped. I’d love some help figuring out how to write R code or a function that can take the display order column and use it to categorize the answers given in the prior five columns (or whatever else you can think of that would be better than that). Thank you!
I haven’t tried anything yet! Just glancing at the data
Here is an example of what the data look like (this is made up for the example!):
respondents <- c("Respondent 1", "Respondent 2", "Respondent 3")
tv_show_1 <- c("excellent", "good", "bad")
tv_show_2 <- c("good", "bad", "neutral")
tv_show_3 <- c("neutral", "good", "neutral")
tv_show_DO <- c("Friends|Seinfeld|Full House", "Seinfeld|Friends|Full House", "Seinfeld|Full House|Friends")
df <- data.frame(respondents, tv_show_1, tv_show_2, tv_show_3, tv_show_DO)
print(df)
user25563153 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2
Using by
, you could split the respondents, strsplit
the show variable at the "|"
and along the derived alphabetical order
sort the unlist
ed ratings. Give the sort
ed names of the TV shows.
> res <- by(df, df$respondents, (x) {
+ shw <- el(strsplit(x$tv_show_DO, '\|'))
+ tv <- unlist(x[grep('^tv_show_[1-3]', names(x))])[order(shw)] |>
+ setNames(sort(shw)) |> t() |> as.data.frame()
+ cbind(respondents=x[, 1], tv)
+ }) |> do.call(what='rbind')
> res
respondents Friends Full House Seinfeld
Respondent 1 Respondent 1 excellent neutral good
Respondent 2 Respondent 2 bad good good
Respondent 3 Respondent 3 neutral neutral bad
Data:
> dput(df)
structure(list(respondents = c("Respondent 1", "Respondent 2",
"Respondent 3"), tv_show_1 = c("excellent", "good", "bad"), tv_show_2 = c("good",
"bad", "neutral"), tv_show_3 = c("neutral", "good", "neutral"
), tv_show_DO = c("Friends|Seinfeld|Full House", "Seinfeld|Friends|Full House",
"Seinfeld|Full House|Friends")), class = "data.frame", row.names = c(NA,
-3L))
You have a combination of data encoded by column name (e.g. tv_show_1
) and data encoded by position (e.g. a|b|c
). There are many ways to approach this, but for further analysis I’d suggest putting the data into tidy format where the position encoding is made explicit in it’s own column. We can do this separately for the two types of data and then combine them.
First, I collect the rating columns and pivot them longer.
library(tidyverse)
df_ratings <- df |>
select(-tv_show_DO) |>
pivot_longer(-respondents)
We can do something similar for the delimited data, assigning each row a name that matches the pattern for the other data.
df_shows <- df |>
select(respondents, tv_show_DO) |>
separate_longer_delim(tv_show_DO, delim = "|") |>
mutate(name = paste0("tv_show_", row_number()), .by = respondents)
And finally combine:
df_shows |>
left_join(df_ratings)
This gives us a nice table combining all the data:
respondents tv_show_DO name value
1 Respondent 1 Friends tv_show_1 excellent
2 Respondent 1 Seinfeld tv_show_2 good
3 Respondent 1 Full House tv_show_3 neutral
4 Respondent 2 Seinfeld tv_show_1 good
5 Respondent 2 Friends tv_show_2 bad
6 Respondent 2 Full House tv_show_3 good
7 Respondent 3 Seinfeld tv_show_1 bad
8 Respondent 3 Full House tv_show_2 neutral
9 Respondent 3 Friends tv_show_3 neutral