I am working with some data and I would like to create a column for which the next row value depends on that of the previous row. A for loop was my first thought for this, but the data I am working with is over 6 million rows, and the for loop took over 1 hour to complete.
I am looking for an alternative to a for loop that would accomplish this. The data is formatted in such a way that I don’t believe the dplyr lag()
would accomplish what I need. Say I have the following data:
df = structure(list(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), y = c(0,
1, 2, 3, 0, 1, 2, 3, 4, 5), z = c(5, NA, NA, NA, 6, NA, NA, NA,
3, 2)), class = "data.frame", row.names = c(NA, -10L))
When df$y
is not NA
, I would like a new column, df$aa
to simply return the value in df$z
. In cases where df$y
does is NA
, I want column df$aa
to simply be the last non NA
value.
Here is the for loop I developed. It works fine with small amounts of data, but as mentioned, it was far too slow with 6 million rows.
for(i in 1:nrow(df)){
if(!is.na(df$z[i])){
df$aa[i] = df$z[i]
} else{
df$aa[i] = df$aa[i-1]
}
}
This code presents the desired output. Your input is greatly appreciated!