In a folder (path = "D:/DataLogs/
), I have several subfolders. Inside these subfolders, I would like to retrieve all the csv starting only with “QCLog” and merge them (rbind
) into a single data.frame (all these csv have the same headers and structures), while creating a first new column including the full name of these QCLog csv.
The difficulty is that, in some subfolders, there may be csv (starting with QCLog or not) directly accessible and others located in several zip files (examples within the green rectangles below).
Is this feasible?
Thanks for help
1
Something like the following might work. Untested.
# function to unzip the csv files
# first get the csv filenames, then extract them to a temp directory
# return one data.frame only
read_csv_in_zip_file <- function(filename, tmpdir) {
csv_files <- unzip(filename, list = TRUE)[["Name"]]
i <- grep("QCLog.*\.csv", csv_files)
fls <- unzip(filename, files = csv_files[i], exdir = tmpdir)
df_list <- lapply(fls, read.csv)
res <- Map((x, f) {
x$filename <- basename(f)
x
}, df_list, fls)
res <- do.call(rbind, res)
row.names(res) <- NULL
res
}
# where to put the unziped files
tmp_dir <- tempdir()
path <- "D:/DataLogs"
pattern <- "QCLog.*\.csv|\.zip"
fls <- list.files(path = path, pattern = pattern, full.names = TRUE, recursive = TRUE)
df_list <- lapply(fls, (f) {
if(grep("\.zip", f)) {
read_csv_in_zip_file(f, tmp_dir)
} else {
res <- read.csv(f)
res$filename <- f
res
}
})
# one data.frame only
df_all <- do.call(rbind, df_list)
# final clean up
unlink(tmp_dir, recursive = TRUE)
# rm(df_list)
4