I have a dataframe in R where I need to shift the values up in a particular column using a numeric value. The numeric value which is used as input to shift the values up are associated with a grouping variable and are often different from group to group. Here is an example of a mock dataframe
mock data frame
df<-data.frame(ID=c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c"),
var_description = c("description 1", "description 2","description 3","description 4","description 5",
"description 1", "description 2","description 3","description 4","description 5",
"description 1", "description 2","description 3","description 4","description 5"),
var_value = c(NA,NA,NA,"desc 1 val","desc 2 val",
NA,"desc 1 val","desc 2 val","desc 3 val","desc 4 val",
NA,NA,NA,NA,"desc 1 val"),
shift_val = c(3,3,3,3,3,
1,1,1,1,1,
4,4,4,4,4))
Here is the current working solution:-
#getting unique vals
unique_id = unique(df$ID)
#initialising dataframe to collect results
df_results<-data.frame()
#function to shift values up
shift <- function(x,n){
c(x[-(seq(n))],rep(NA,n))
}
#for loop
for (i in unique_id) {
#filtering by each ID
one_id<-df%>%filter(ID==i)
#getting value to shift values by
Move_by_val = unique(one_id$shift_val)
#shifting
one_id$new_var_value<-shift(one_id$var_value, Move_by_val)
#binding one_id onto df_results
df_results<-rbind(df_results,one_id)
}
head(df_results,10)
# ID var_description var_value shift_val new_var_value
# 1 a description 1 <NA> 3 desc 1 val
# 2 a description 2 <NA> 3 desc 2 val
# 3 a description 3 <NA> 3 <NA>
# 4 a description 4 desc 1 val 3 <NA>
# 5 a description 5 desc 2 val 3 <NA>
# 6 b description 1 <NA> 1 desc 1 val
# 7 b description 2 desc 1 val 1 desc 2 val
# 8 b description 3 desc 2 val 1 desc 3 val
# 9 b description 4 desc 3 val 1 desc 4 val
# 10 b description 5 desc 4 val 1 <NA>
Using the shift_val
which is associated with each unique df$ID
, I am able to shift the var_value
up, where the value matches the var_description
.
The current solution I am using works but when applied to a much larger dataframe (such as one that is ~ >100k rows), it becomes slow and inefficient.
Can someone recommend an alternative solution which gives me the same output as the one I have but has potential to be much more efficient? perhaps a data.table
or purrr
solution might be best, but I am not well versed in either. Any help is appreciated!