I have a dataset like this:
structure(list(study_id = structure(c("P005", "P005", "P005",
"P008", "P008", "P008", "P021", "P021", "P021", "P028", "P028",
"P028", "P032", "P032", "P032", "P036", "P036", "P036", "P037",
"P037", "P037", "P049", "P049", "P049", "P053", "P053", "P053",
"P069", "P069", "P069", "P079", "P079", "P079", "P089", "P089",
"P089", "P093", "P093", "P093", "P096", "P096", "P096", "P104",
"P104", "P104", "P105", "P105", "P105"), label = "ISMART Study ID", format.stata = "%9s"),
phase = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), levels = c("Baseline", "Midterm",
"Final"), class = "factor"), selfeff1 = structure(c(3L, 3L,
3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, NA, 3L,
3L, 3L, 3L, 3L, 3L, NA, 3L, 3L, 3L, 3L, 3L, 3L, 3L, NA, 3L,
2L), levels = c("Not confident", "Somewhat confident", "Very confident"
), class = "factor"), selfeff3 = structure(c(3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 3L,
3L, 3L, NA, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, NA, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L
), levels = c("Not confident", "Somewhat confident", "Very confident"
), class = "factor")), class = "data.frame", row.names = c(NA,
-48L))
This is pivot-long format dataset. each study_id has three rows for Baseline, Midterm and Final value. Now I want to use the carryforward/carryback method to impute the missing value. But since they are repeated measure, I also want to apply the rule like:
- If they are missing baseline, but have midterm: carryback (i.e., replace baseline with midterm);
- If they are missing midterm, but have final: carryback (i.e., replace midterm with final)
- If they are missing final, but have midterm: carryforward (i.e., replace final with midterm)
- If they are missing both baseline and final, carryforward and back midterm (i.e., replace both with midterm).
I tried to write a function to achieve that since in my real dataset, I have selfeff1-13. The code is like this:
impute_values <- function(x, phase) {
# Carryback: Replace baseline with midterm if baseline is missing but midterm is available
if (phase == "Baseline" & is.na(x) & phase == "Midterm" & !is.na(x)) {
x <- na.locf(x)
}
# Carryback: Replace midterm with final if midterm is missing but final is available
# Carryforward: Replace final with midterm if final is missing but midterm is available
else if (phase == "Midterm" & is.na(x) & phase == "Final" & !is.na(x[3])) {
x <- na.locf(x)
} else if (phase == "Midterm" & !is.na(x) & phase == "Final" & is.na(x[3])) {
x <- na.locf(x, option="nocb")
}
# For the case where both baseline and final are missing but midterm is available,
# we can simply carry forward the missing values from midterm
else if (phase == "Baseline" & is.na(x) & phase == "Final" & is.na(x) &
phase == "Midterm" & !is.na(x)) {
x <- na.locf(x)
}
return(x)
}
But when I try to test this function with one variable: say, selfeff1, I use the code :
df2 <- df %>%
mutate(selfeff1=impute_values(selfeff1, phase))
summary(is.na(df2$selfeff1)
I got error that saying:
error in if(```)NULL, the condition has length>1
Could someone help to show me how to fix it and make it work for my case?
0
You could use a function prepl
that paste
s the is.na
-structure into a binary pattern, e.g. "001"
for study_id P037 in selfeff3. So you can easily apply replacement logic for each case in each selfeff* column using grep
in by
(which you can imagine like a combination of split
and lapply
) then unsplit
. This makes it clear at a glance what is happening and it can be expanded as required.
> prepl <- (x) {
+ p <- paste(+is.na(x), collapse='')
+ if (grepl('10.', p)) {
+ x[1] <- x[2]
+ x
+ } else if (grepl('.10', p)) {
+ x[2] <- x[3]
+ x
+ } else if (grepl('.01', p)) {
+ x[3] <- x[2]
+ x
+ } else if (grepl('1.1', p)) {
+ x[c(1, 3)] <- x[2]
+ x
+ } else {
+ x
+ }
+ }
> icl <- grep('^selfeff\d+$', names(df))
> df[icl] <- lapply(df[icl], (x) by(x, df$study_id, prepl) |> unsplit(df$study_id))
> df
study_id phase selfeff1 selfeff3
1 P005 Baseline Very confident Very confident
2 P005 Midterm Very confident Very confident
3 P005 Final Very confident Very confident
4 P008 Baseline Very confident Very confident
5 P008 Midterm Very confident Very confident
6 P008 Final Very confident Very confident
7 P021 Baseline Somewhat confident Very confident
8 P021 Midterm Very confident Very confident
9 P021 Final Very confident Very confident
10 P028 Baseline Somewhat confident Somewhat confident
11 P028 Midterm Very confident Very confident
12 P028 Final Very confident Very confident
13 P032 Baseline Very confident Somewhat confident
14 P032 Midterm Very confident Very confident
15 P032 Final Very confident Somewhat confident
16 P036 Baseline Very confident Very confident
17 P036 Midterm Very confident Very confident
18 P036 Final Very confident Very confident
19 P037 Baseline Very confident Very confident
20 P037 Midterm Very confident Very confident
21 P037 Final Very confident Very confident
22 P049 Baseline Very confident Very confident
23 P049 Midterm Somewhat confident Somewhat confident
24 P049 Final Very confident Very confident
25 P053 Baseline Very confident Somewhat confident
26 P053 Midterm Very confident Very confident
27 P053 Final Very confident Very confident
28 P069 Baseline Very confident Very confident
29 P069 Midterm Very confident Very confident
30 P069 Final Very confident Very confident
31 P079 Baseline Very confident Very confident
32 P079 Midterm Very confident Very confident
33 P079 Final Very confident Very confident
34 P089 Baseline Very confident Very confident
35 P089 Midterm Very confident Very confident
36 P089 Final Very confident Very confident
37 P093 Baseline Very confident Very confident
38 P093 Midterm Very confident Very confident
39 P093 Final Very confident Very confident
40 P096 Baseline Very confident Very confident
41 P096 Midterm Very confident Very confident
42 P096 Final Very confident Very confident
43 P104 Baseline Very confident Very confident
44 P104 Midterm Very confident Very confident
45 P104 Final Very confident Very confident
46 P105 Baseline Very confident Very confident
47 P105 Midterm Very confident Very confident
48 P105 Final Somewhat confident Somewhat confident
1
There may be specific reasons why you want to use a loop with your actual data, however for your example an approach based on vec_fill_missing() may be more practical/straightforward:
library(dplyr)
library(vctrs)
df <- structure(list(study_id = structure(c("P005", "P005", "P005",
"P008", "P008", "P008", "P021", "P021", "P021", "P028", "P028",
"P028", "P032", "P032", "P032", "P036", "P036", "P036", "P037",
"P037", "P037", "P049", "P049", "P049", "P053", "P053", "P053",
"P069", "P069", "P069", "P079", "P079", "P079", "P089", "P089",
"P089", "P093", "P093", "P093", "P096", "P096", "P096", "P104",
"P104", "P104", "P105", "P105", "P105"), label = "ISMART Study ID", format.stata = "%9s"),
phase = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), levels = c("Baseline", "Midterm",
"Final"), class = "factor"), selfeff1 = structure(c(3L, 3L,
3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, NA, 3L,
3L, 3L, 3L, 3L, 3L, NA, 3L, 3L, 3L, 3L, 3L, 3L, 3L, NA, 3L,
2L), levels = c("Not confident", "Somewhat confident", "Very confident"
), class = "factor"), selfeff3 = structure(c(3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 3L,
3L, 3L, NA, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, NA, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L
), levels = c("Not confident", "Somewhat confident", "Very confident"
), class = "factor")), class = "data.frame", row.names = c(NA,
-48L))
df2 <- df %>%
mutate(selfeff1 = vec_fill_missing(selfeff1, direction = "updown"), .by = study_id)
df2
#> study_id phase selfeff1 selfeff3
#> 1 P005 Baseline Very confident Very confident
#> 2 P005 Midterm Very confident Very confident
#> 3 P005 Final Very confident Very confident
#> 4 P008 Baseline Very confident Very confident
#> 5 P008 Midterm Very confident Very confident
#> 6 P008 Final Very confident Very confident
#> 7 P021 Baseline Somewhat confident Very confident
#> 8 P021 Midterm Very confident Very confident
#> 9 P021 Final Very confident Very confident
#> 10 P028 Baseline Somewhat confident Somewhat confident
#> 11 P028 Midterm Very confident Very confident
#> 12 P028 Final Very confident Very confident
#> 13 P032 Baseline Very confident Somewhat confident
#> 14 P032 Midterm Very confident Very confident
#> 15 P032 Final Very confident Somewhat confident
#> 16 P036 Baseline Very confident Very confident
#> 17 P036 Midterm Very confident Very confident
#> 18 P036 Final Very confident Very confident
#> 19 P037 Baseline Very confident Very confident
#> 20 P037 Midterm Very confident Very confident
#> 21 P037 Final Very confident <NA>
#> 22 P049 Baseline Very confident Very confident
#> 23 P049 Midterm Somewhat confident Somewhat confident
#> 24 P049 Final Very confident Very confident
#> 25 P053 Baseline Very confident Somewhat confident
#> 26 P053 Midterm Very confident Very confident
#> 27 P053 Final Very confident Very confident
#> 28 P069 Baseline Very confident Very confident
#> 29 P069 Midterm Very confident Very confident
#> 30 P069 Final Very confident Very confident
#> 31 P079 Baseline Very confident <NA>
#> 32 P079 Midterm Very confident Very confident
#> 33 P079 Final Very confident Very confident
#> 34 P089 Baseline Very confident Very confident
#> 35 P089 Midterm Very confident Very confident
#> 36 P089 Final Very confident Very confident
#> 37 P093 Baseline Very confident Very confident
#> 38 P093 Midterm Very confident Very confident
#> 39 P093 Final Very confident Very confident
#> 40 P096 Baseline Very confident Very confident
#> 41 P096 Midterm Very confident Very confident
#> 42 P096 Final Very confident Very confident
#> 43 P104 Baseline Very confident Very confident
#> 44 P104 Midterm Very confident Very confident
#> 45 P104 Final Very confident Very confident
#> 46 P105 Baseline Very confident Very confident
#> 47 P105 Midterm Very confident Very confident
#> 48 P105 Final Somewhat confident Somewhat confident
Created on 2024-04-24 with reprex v2.1.0