I am a researcher and R novice working with a large dataset consisting of many small excel files to be read together. I have imported these into R using the read_excel functions and have them all in a large table, but am running into issues trying to format the data appropriately for analysis.
As a small description of the dataset, I have different subject IDs who were tested on two different days and who were exposed to different terms in different conditions. Basically, I have a variable “Term”, a variable “Condition”, and a variable “Origin”. The origin variable is a result of using rbindlist(idcol = “Origin”, fill=T), so each is a (long) filepath, which I have redacted below except for the identifying information: the first part “SUBXX” represents my subject number and the second part “SOX” represents the day (either 1 or 2) they were tested on. See a small example dataset below:
df <- data.frame(Origin = c("C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx"))
df$Term <- c("Owl", "Dog", "Rat", "Fox", "Cat", "Cow", "Dog", "Bug", "Cow",
"Mouse", "Bat", "Cat")
df$Condition <- c("M", "L", "L", "L", "M", "M", "L", "L", "M", "M", "M", "M")
> df
Origin(shortened) Term Condition
1 SUB01_S01 Owl M
2 SUB01_S01 Dog L
3 SUB01_S01 Rat L
4 SUB01_S02 Fox L
5 SUB01_S02 Cat M
6 SUB01_SO2 Cow M
7 SUB02_S01 Dog L
8 SUB02_S01 Bug L
9 SUB02_S01 Cow M
10 SUB02_S02 Mouse M
11 SUB02_S02 Bat M
12 SUB02_S02 Cat M
Lastly, I have a list of all possible terms:
termList <- c(“Cat”, “Dog”, “Cow”, “Rat”, “Fox”, “Bug”, “Owl”, “Bat”, “Mouse”, “Bear”)
What I want to do is to 1) order the dataframe so that the terms appear in the same order as the termList, and 2) add the terms that do not appear for each participant, noting their Day as 0 and their condition as “U”. Additionally, I want to replace the “origin” column with two separate columns, one containing participant ID and the other containing day.
Desired result:
ParNum Term Condition Day
1 1 Cat M 2
2 1 Dog L 1
3 1 Cow M 2
4 1 Rat L 1
5 1 Fox L 2
6 1 Bug U 0
7 1 Owl M 1
8 1 Bat U 0
9 1 Mouse U 0
10 1 Bear U 0
11 1 Cat M 2
12 1 Dog L 1
13 1 Cow M 1
14 1 Rat U 0
15 1 Fox U 0
16 1 Bug L 1
17 1 Owl U 0
18 1 Bat M 2
19 1 Mouse M 2
20 1 Bear U 0
I am not a CompSci person so I usually build my way up from little problems to the larger ones. Starting small, I tried to use R’s inbuilt apply() functions, as well as setdiff, to find the concepts which don’t appear for each SubID. The following code:
df%>%
group_by(Origin) %>%
tapply(setdiff(termList, df$Term))
only returned a single 1, which is confusing. Shouldn’t setdiff() return a character variable (i.e. whatever term is missing?) Trying the other options lapply() and sapply() both returned the message “object ‘Bear’ of mode ‘function’ was not found”.
I also attempted a for loop, again by starting small and just trying to find the missing terms for each SubId. The following:
mismatch <- character()
for (i in df$Origin) {
mismatch <- setdiff(termList, tbl$origin)
}
Returned
[1] "Cat" "Dog" "Cow" "Rat" "Fox" "Bug" "Owl" "Bat" "Mouse"
[10] "Bear"
But I was expecting a subset of terms for each SubID. Could anyone give any advice?
EDIT: I used the solution proposed by Edward below, namely:
#3 replace the "origin" column with two separate columns, one
# containing participant ID and the other containing day.
separate_wider_position(df, Origin,
widths=c(69, ParNum=2, 1, Day=2, 9)) |>
mutate(Term=factor(Term, levels=termList),
Day=as.numeric(Day)) |>
#2 add the terms that do not appear for each participant, noting
# their Day as 0 and their condition as "U".
complete(ParNum, Term, fill = list(Day=0, Condition="U")) |>
#1 order the dataframe so that the terms appear in the same order
# as the termList,
arrange(ParNum, Term)
Which works. However, there is one other problem I forgot to mention: in my full dataset, each concept appears twice in the spreadsheet (same condition each time). So the sorted list using the above method doubles any concept which isn’t in condition “U”, like so:
ParNum Term Day Condition
<chr> <fct> <dbl> <chr>
1 1 Cat 0 U
2 1 Dog 1 L
3 1 Dog 1 L
4 1 Cow 0 U
5 1 Rat 1 L
6 1 Rat 1 L
7 1 Fox 2 L
8 1 Fox 2 L
9 1 Bug 0 U
10 1 Owl 1 M
11 1 Owl 1 M
12 1 Bat 0 U
13 1 Mouse 0 U
14 1 Bear 0 U
There is no reason for me to retain these doubles so I’d just like to get rid of them. Is such a thing possible?
Maranta is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
3
First, you use strsplit
to seperate id and day,
> df <- strsplit(df$Origin, 'D(?=\d$)', perl=TRUE) |>
+ do.call(what='rbind.data.frame') |>
+ setNames(c('id', 'day')) |>
+ cbind(df[-1])
> df
id day Term Condition
1 ID1 1 Owl M
2 ID1 1 Dog L
3 ID1 1 Rat L
4 ID1 2 Fox L
5 ID2 2 Cat M
6 ID2 2 Cow M
7 ID2 1 Dog L
8 ID2 1 Bug L
9 ID2 1 Cow M
10 ID2 2 Mouse M
11 ID2 2 Bat M
12 ID2 2 Cat M
Then create a data frame of all combinations using expand.grid
.
> (termdf <- expand.grid(Term=termList, day=unique(df$day), id=unique(df$id)))
Term day id
1 Cat 1 ID1
2 Dog 1 ID1
3 Cow 1 ID1
4 Rat 1 ID1
5 Fox 1 ID1
6 Bug 1 ID1
7 Owl 1 ID1
8 Bat 1 ID1
9 Mouse 1 ID1
10 Bear 1 ID1
11 Cat 2 ID1
12 Dog 2 ID1
13 Cow 2 ID1
...
and merge
it to df
. If you really need to sort it by the order of termList
, you can sort_by
on a respective list, but this sorts unusually non-alphabetic and I wouldn’t recommend it.
> merge(df, termdf, all=TRUE) |>
+ sort_by(~list(id, day, as.integer(factor(Term, termList)))) |>
+ `rownames<-`(NULL)
id day Term Condition
1 ID1 1 Cat <NA>
2 ID1 1 Dog L
3 ID1 1 Cow <NA>
4 ID1 1 Rat L
5 ID1 1 Fox <NA>
6 ID1 1 Bug <NA>
7 ID1 1 Owl M
8 ID1 1 Bat <NA>
9 ID1 1 Mouse <NA>
10 ID1 1 Bear <NA>
11 ID1 2 Cat <NA>
12 ID1 2 Dog <NA>
13 ID1 2 Cow <NA>
14 ID1 2 Rat <NA>
15 ID1 2 Fox L
16 ID1 2 Bug <NA>
17 ID1 2 Owl <NA>
18 ID1 2 Bat <NA>
19 ID1 2 Mouse <NA>
20 ID1 2 Bear <NA>
21 ID2 1 Cat <NA>
22 ID2 1 Dog L
23 ID2 1 Cow M
24 ID2 1 Rat <NA>
25 ID2 1 Fox <NA>
26 ID2 1 Bug L
27 ID2 1 Owl <NA>
28 ID2 1 Bat <NA>
29 ID2 1 Mouse <NA>
30 ID2 1 Bear <NA>
31 ID2 2 Cat M
32 ID2 2 Cat M
33 ID2 2 Dog <NA>
34 ID2 2 Cow M
35 ID2 2 Rat <NA>
36 ID2 2 Fox <NA>
37 ID2 2 Bug <NA>
38 ID2 2 Owl <NA>
39 ID2 2 Bat M
40 ID2 2 Mouse M
41 ID2 2 Bear <NA>
As you can see, this results in a balanced panel rather than the outcome you describe, the exact meaning of which I am not sure of.
Data:
> dput(df)
structure(list(Origin = c("ID1D1", "ID1D1", "ID1D1", "ID1D2",
"ID2D2", "ID2D2", "ID2D1", "ID2D1", "ID2D1", "ID2D2", "ID2D2",
"ID2D2"), Term = c("Owl", "Dog", "Rat", "Fox", "Cat", "Cow",
"Dog", "Bug", "Cow", "Mouse", "Bat", "Cat"), Condition = c("M",
"L", "L", "L", "M", "M", "L", "L", "M", "M", "M", "M")), row.names = c(NA,
-12L), class = "data.frame")
> termList <- c("Cat", "Dog", "Cow", "Rat", "Fox", "Bug", "Owl", "Bat", "Mouse", "Bear")
3
library(tidyverse)
#3 replace the "origin" column with two separate columns, one containing participant ID and the other containing day.
separate_wider_position(df, Origin,
widths=c(ParNum=3, 1, Day=1)) |>
mutate(Term=factor(Term, levels=termList),
Day=as.numeric(Day)) |>
#2 add the terms that do not appear for each participant, noting their Day as 0 and their condition as "U".
complete(ParNum, Term, fill = list(Day=0, Condition="U")) |>
#1 order the dataframe so that the terms appear in the same order as the termList,
arrange(ParNum, Term)
# A tibble: 22 × 4
ParNum Term Day Condition
<chr> <fct> <dbl> <chr>
1 ID1 Cat 0 U
2 ID1 Dog 1 L
3 ID1 Cow 0 U
4 ID1 Rat 1 L
5 ID1 Fox 2 L
6 ID1 Bug 0 U
7 ID1 Owl 1 M
8 ID1 Bat 0 U
9 ID1 Mouse 0 U
10 ID1 Bear 0 U
6