Im trying to merge two datasets that I have. They are very large files (3.8 million lines) and I cannot just simply merge these files….
the two data sets look like this
Dataset 1:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 |
---|---|---|---|---|---|---|---|---|---|---|---|
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD |
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD |
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD |
ds1 <- data.table(X1= c(chr, chr, chr), X2= c(source, source, source), X3= c(type, type, type), X4= c(min, min, min), X5= c(max, max, max), X6= c(length, length, length), X7= c(strand, strand, strand), X8= c(value, value, value), X9= c(DataA, DataA, DataA), X10= c(DataB, DataB, DataB), X11= c(DataC, DataC, DataC), X12= c(DataD, DataD, DataD))
Dataset 2:
X9 | X10 | X11 | X12 | X13 |
---|---|---|---|---|
DataA | DataB | DataC | DataD | 12345 |
DataA | DataB | DataC | DataD | 12345 |
DataA | DataB | DataC | DataD | 12345 |
ds2 <- data.table(X9= c(DataA, DataA, DataA), X10= c(DataB, DataB, DataB), X11= c(DataC, DataC, DataC), X12= c(DataD, DataD, DataD), X13= c(12345, 12345, 12345))
With the lines of code I’ve tried (merge, inner and outer joins, cbind, reduce, etc.) even after I have changed every parameter setting and specified which columns to merge by to get my desired outcome I keep getting outputs that look like this or slight variations on this (where the NAs fill different columns):
Output:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD | NA |
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD | NA |
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD | NA |
However I am needing to keep the numerical values in column X13 from dataset 2. I am wanting something that looks like this:
Desired output:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD | 12345 |
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD | 12345 |
chr | source | type | min | max | length | strand | value | DataA | DataB | DataC | DataD | 12345 |
I am new to R and am very confused and frustrated… I’ve been trying to figure this out for over a week on my own and Im still so lost. The closest I’ve been able to get was with left join that gave the exact output as above with column X13 filled with NAs. Full join was also kind of close however it doubled my table size from 3.8 million lines to 7.6 million lines with the first 3.8 million lines looking like the output above with column X13 filled with NAs and the last 3.8 million lines having columns X1-X8 filled with NAs.
left join code to get output shown above:
merged_data <- left_join(df1, df2, by = c('X9', 'X10', 'X11', 'X12'))
full join code to get output mentioned:
merged_data <- full_join(df1, df2, by = c('X9', 'X10', 'X11', 'X12'))
If anyone has any suggestions that would be great!
Thank you!!!
user25899828 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.