I have a very large text file with tens of thousands of rows and around a hundred columns. The columns in the text file are stored based on character position. 1-20 is column1, 21-59 is column2, etc. Each row is around 1800 individual characters long. My current code is working how I want it to for about the first 70,000 rows. However, this particular file has some random rows that do not follow the pattern. These lines throw off the program and it doesn’t read the rest of the lines. All the regular rows start with a 12 digit long identification number while the bad rows look like this.
00 000 000 000 000 000 000 000 000 000 02 94 00 00 00 00 00 00 N
The bad rows are not all identical and they appear at random intervals. Is there an easy way I can filter out these bad lines so that the program will read all the lines in the file? Thanks!
Currently my code looks like this:
df1 <- readLines(“txtfile.txt”)
df2 <- read.table(text=gsub(“(.{1800})”, “1 “, df1, perl=TRUE), header = FALSE, sep = “r”)
df3 <- separate(df2, V1, c(‘all my columns names’), sep = c(20, 62 , 85, etc), remove = TRUE, convert = TRUE)
user25108420 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.