I try to process multiple tab-delimited-textfiles in R in parallel with library(future.apply)
.If I use 100 textfiles it runs fine.
If I use 1000 textfiles it gives this error:
Error in unserialize(node$con) :
MultisessionFuture (future_lapply-8) failed to receive results from cluster RichSOCKnode #8 (PID 30716 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 7 globals exported is 87.57 KiB. The three largest globals are ‘process_file’ (48.16 KiB of class ‘function’), ‘...future.elements_ii’ (26.07 KiB of class ‘list’) and ‘%>%’ (7.47 KiB of class ‘function’)
or this error:
Error: No such device or address
Each textfile is 5000KB.
I have 256GB RAM and a 32 core processor.
I have tried different settings for num_parts
and workers
in my code below although the errors persist.
library(tidyverse)
library(readr)
library(future.apply)
start.time <- Sys.time()
# Define the test function
process_file <- function(file_path) {
# Print the basename of the file being processed
print(paste("Processing file:", basename(file_path)))
# Read the data
data <- suppressMessages(read_tsv(file_path, col_names = TRUE, show_col_types = FALSE))
# some example processing:
return( tibble(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Score = c(85, 90, 95)
))
# Combine all results into a single dataframe
bind_rows(result)
}
# Specify the directory containing your files
file_directory3 <- "J:/test"
file_paths3 <- list.files(file_directory3, pattern = "*.txt", full.names = TRUE)
# Split file paths into parts
num_parts <- 128
split_file_paths <- split(file_paths, cut(seq_along(file_paths), breaks = num_parts, labels = FALSE))
# Configure future for parallel processing
plan(multisession, workers = 16) # Adjust workers according to your CPU cores
# Apply the function to each part in parallel and combine results
results <- future_lapply(split_file_paths, function(file_part) {
part_results <- lapply(file_part, process_file) # Apply `process_file` to each file
bind_rows(part_results) # Combine tibbles within this part
})
# Combine all results into one tibble
combined_result <- bind_rows(results)
# Shut down future plan
plan(sequential)
Any idea what can be the cause and how to solve it?
I can split my data in parts of 100 although this is no guarantee it will function for every part of 100 files I suppose.
Thanks a lot!