Thiết kế website giá rẻ

Question

I am a researcher and R novice working with a large dataset consisting of many small excel files to be read together. I have imported these into R using the read_excel functions and have them all in a large table, but am running into issues trying to format the data appropriately for analysis.

As a small description of the dataset, I have different subject IDs who were tested on two different days and who were exposed to different terms in different conditions. Basically, I have a variable “Term”, a variable “Condition”, and a variable “Origin”. The origin variable is a result of using rbindlist(idcol = “Origin”, fill=T), so each is a (long) filepath, which I have redacted below except for the identifying information: the first part “SUBXX” represents my subject number and the second part “SOX” represents the day (either 1 or 2) they were tested on. See a small example dataset below:

df <- data.frame(Origin = c("C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx",                   "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx",                   "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx",                       "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx"))
df$Term <- c("Owl", "Dog", "Rat", "Fox", "Cat", "Cow", "Dog", "Bug", "Cow",
             "Mouse", "Bat", "Cat")
df$Condition <- c("M", "L", "L", "L", "M", "M", "L", "L", "M", "M", "M", "M")
> df
   Origin(shortened) Term Condition
1  SUB01_S01          Owl     M
2  SUB01_S01          Dog     L
3  SUB01_S01          Rat     L
4  SUB01_S02          Fox     L
5  SUB01_S02          Cat     M
6  SUB01_SO2          Cow     M
7  SUB02_S01          Dog     L
8  SUB02_S01          Bug     L
9  SUB02_S01          Cow     M
10 SUB02_S02          Mouse   M
11 SUB02_S02          Bat     M
12 SUB02_S02          Cat     M

Lastly, I have a list of all possible terms:
termList <- c(“Cat”, “Dog”, “Cow”, “Rat”, “Fox”, “Bug”, “Owl”, “Bat”, “Mouse”, “Bear”)

What I want to do is to 1) order the dataframe so that the terms appear in the same order as the termList, and 2) add the terms that do not appear for each participant, noting their Day as 0 and their condition as “U”. Additionally, I want to replace the “origin” column with two separate columns, one containing participant ID and the other containing day.

Desired result:

 ParNum Term Condition   Day
1   1   Cat         M      2
2   1   Dog         L      1
3   1   Cow         M      2
4   1   Rat         L      1
5   1   Fox         L      2
6   1   Bug         U      0
7   1   Owl         M      1
8   1   Bat         U      0 
9   1   Mouse       U      0
10  1   Bear        U      0
11  1   Cat         M      2
12  1   Dog         L      1
13  1   Cow         M      1
14  1   Rat         U      0
15  1   Fox         U      0
16  1   Bug         L      1
17  1   Owl         U      0
18  1   Bat         M      2
19  1   Mouse       M      2
20  1   Bear        U      0

I am not a CompSci person so I usually build my way up from little problems to the larger ones. Starting small, I tried to use R’s inbuilt apply() functions, as well as setdiff, to find the concepts which don’t appear for each SubID. The following code:

df%>%
  group_by(Origin) %>%
  tapply(setdiff(termList, df$Term))

only returned a single 1, which is confusing. Shouldn’t setdiff() return a character variable (i.e. whatever term is missing?) Trying the other options lapply() and sapply() both returned the message “object ‘Bear’ of mode ‘function’ was not found”.

I also attempted a for loop, again by starting small and just trying to find the missing terms for each SubId. The following:

mismatch <- character()
for (i in df$Origin) {
  mismatch <- setdiff(termList, tbl$origin)
}

Returned

[1] "Cat"   "Dog"   "Cow"   "Rat"   "Fox"   "Bug"   "Owl"   "Bat"   "Mouse"
[10] "Bear"

But I was expecting a subset of terms for each SubID. Could anyone give any advice?

EDIT: I used the solution proposed by Edward below, namely:

#3 replace the "origin" column with two separate columns, one 
# containing participant ID and the other containing day.
separate_wider_position(df, Origin, 
                        widths=c(69, ParNum=2, 1, Day=2, 9)) |>
   mutate(Term=factor(Term, levels=termList),
          Day=as.numeric(Day)) |>

#2 add the terms that do not appear for each participant, noting 
# their Day as 0 and their condition as "U". 
  complete(ParNum, Term, fill = list(Day=0, Condition="U")) |>

#1 order the dataframe so that the terms appear in the same order 
# as the termList, 
  arrange(ParNum, Term)

Which works. However, there is one other problem I forgot to mention: in my full dataset, each concept appears twice in the spreadsheet (same condition each time). So the sorted list using the above method doubles any concept which isn’t in condition “U”, like so:

   ParNum Term    Day Condition
   <chr>  <fct> <dbl> <chr>    
 1 1    Cat       0 U        
 2 1    Dog       1 L
 3 1    Dog       1 L       
 4 1    Cow       0 U        
 5 1    Rat       1 L
 6 1    Rat       1 L        
 7 1    Fox       2 L  
 8 1    Fox       2 L      
 9 1    Bug       0 U        
10 1    Owl       1 M 
11 1    Owl       1 M       
12 1    Bat       0 U        
13 1    Mouse     0 U        
14 1    Bear      0 U

There is no reason for me to retain these doubles so I’d just like to get rid of them. Is such a thing possible?

Thiết kế website giá rẻ

Danh mục

Sorting columns and adding new variables using setdiff in R