I have some relatively large binary files (~2MB or ~20000 rows) which contain numeric data. The ideal form is:
.003666666 21.63934 25.85458 16.33911 -0.1533379 -0.8353634 100.0039
0.02083334 21.64326 25.85454 16.33850 -0.1534234 -0.8358188 99.97899
0.06250000 21.64577 25.85449 16.33680 -0.1536558 -0.8370742 99.99504
...
765.7708 1050.427 26.15542 -13.62440 0.4277960 5.347188 99.98061
765.8125 1050.428 26.15540 -13.62495 0.4277650 5.346800 100.0498
-1.000000 1050.429 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
but the binary makes no distinctions between columns. Instead, in the file header it lists the number of columns there should be and their names. This, plus some other comments make the header variable size.
I am trying to write code that can read all the way to the end of this file. However, because of this header I can’t just measure the filesize and divide by 4 bits.
I want to know how to read all the way to the end of this file when both the header and the data may be variable size. I use readBin
, which can only read so many datapoints at once. I can chunk the data, but I do not know how to make sure that I get all the way to the end of the file no matter its size. There is a character f
indicating the start of the data, which I find as follows:
HeaderLength <- function(filename){
binaryfile <- file(filename,"rb")
x <- -1
repeat{
line <- readLines(binaryfile,n=1,skipNul = TRUE)
if(str_detect(line, "f") == TRUE){
break}
x <- x+1
}
return(x)
}
Is there a way to measure the number of bytes before that character, to subtract off from the filesize? Or a way to tell readBin
to read until the end of the file, like negative numbers in readLines(n=)
? Opening in https://hexed.it/ suggests there’s an end-of-file character I could perhaps check for, but so far when I just set n really big, I end up with nonsensical numbers that I know can’t be in the file, suggesting readBin
is looping back to the start rather than returning this character or a “not a numeric” error.
#after skipping header...
tail(readBin(binaryfile, "double", n=700000,size = 4, signed=TRUE,endian = 'little'))
[1] 1.139154e+22 4.676133e-42 0.000000e+00 0.000000e+00 0.000000e+00 1.139154e+22
#The biggest number in the left-most column should be ~765
Sanitized test file for your use.
#code so you can open the file:
binaryfile <- file("Example_file_with_header.bin","rb")
readLines(binaryfile,n=27,skipNul = TRUE) #skip header; normally determined dynamically
readBin(binaryfile, "raw", n=5,size=1,endian = 'little') #skip some non-numeric characters
readBin(binaryfile, "double", n=7,size = 4, signed=TRUE,endian = 'little') #read first row of 7 numbers
close(binaryfile)
S B is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.