I have a dataframe that indicates for each ID, the neighboring IDs. My second dataframe contains a count value for each unique ID.
The goal is to create a variable that, for each ID, is the sum of the neighbouring IDs’ counts; if the count of neighbour V1 is 10, the count of V2 is 40 and V3 120, the desired output is 170 for that ID.
I created a script below that does this for me, but it is very slow. Since my actual dataset is huge (both long and wide), I would like to optimize the speed of this transformation. Is there any other way?
row_sum <- function(row) {
row_values <- unlist(lapply(row, function(ID) df_counts$count[df_counts$ID == ID]))
sum(row_values, na.rm=TRUE)
}
set.seed(42) ## for sake of reproducibility
#mockup data to reproduce problem
df1 <- data.frame(ID=sort(sample(1:100)), V1_id=sample(1:100),
V2_id=sample(1:100), V3_id=sample(1:100))
df_counts <- data.frame(ID=sort(sample(1:100)),
count=runif(100, min=0, max=999))
sum_of_counts <- as.data.frame(apply(df1, 1, row_sum))
Desired output:
ID | sum_of_counts |
---|---|
1 | 920 |
2 | 940 |
If I understand you correctly you want to add values of df_counts
corresponding to the V*_id
s in df1
and add them as a new column.
> transform(df1, foo=rowSums(sapply(df1[-1], (i) df_counts[i, 'count'])))
ID V1_id V2_id V3_id foo
1 1 1 28 3 2832.9074
2 2 13 79 33 1452.9103
3 3 78 93 93 374.8441
4 4 5 57 84 1851.1584
5 5 73 84 9 2133.3638
6 6 55 89 54 1358.5333
7 7 16 80 10 1156.1185
8 8 90 88 71 1168.0793
9 9 43 61 92 959.9667
10 10 42 43 82 1076.2204
11 11 91 22 25 1292.1155
12 12 57 26 74 1705.8851
13 13 29 54 73 2165.5263
14 14 25 98 19 852.5623
15 15 63 96 79 750.5993
[...]