I have over 250 large .txt
files (each approximately 1GB) in a folder. I would like to remove duplicated rows based on two columns of id1
and id2
being mindful of my Macbook’s memory limitations.
An examples of the .txt
files are as follows (each file has many more columns):
user_1<-structure(list(id1 = c(4860291, 4860291, 4860291, 1030170,
1438568, 1592420, 1702541, 1852977, 2143816, 2677860, 2677860,
2677860, 2677860, 2912792, 2939878, 3043611, 4357890, 4769884,
3225376, 3225376, 2864095), country = c("Peru", "Peru", "Peru",
"United States", "United States", "Saint Vincent and the Grenadines",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"Belgium", "Belgium", "Netherlands"), id2 = c(393550, 393550,
393550, 393529, 393529, 393529, 393529, 393529, 393529, 393529,
393529, 393529, 393529, 393529, 393529, 393529, 393529, 393529,
393530, 393530, 393533)), row.names = 2372162:2372182, class = "data.frame")
.
.
.
user_250<-structure(list(id1 = c(4860280, 4860280, 2090372, 2970838,
3469113, 1081704, 2243163, 3840895, 4159060, 4198012, 1604125,
3159686, 3159686, 3159686, 2020953, 3444818, 2346651, 3733232,
4577578, 4779099, 1832738), country = c("United States", "United States",
"India", "United States", "United States", "Australia", "United States",
"United States", "United States", "United States", "France",
"France", "France", "France", "United States", "United States",
"Australia", "United States", "United States", "United States",
"Germany"), id2 = c(674, 674, 675, 676, 679, 680, 682, 682,
682, 682, 686, 686, 686, 686, 688, 688, 691, 694, 694, 694, 695
)), row.names = 1648773:1648793, class = "data.frame")
My understanding for doing this for single file is:
read.delim("myfilepath/user_1.txt")
user_1 [!duplicated(user_1[c(id1,id2)]),]
How should I apply this to all the 250 files above?