I am trying to read large txt files into dataframe. Each file is 10-15 GB in size,
as the IO is taking long time. I want to read multiple file in parallel and get them in separate dataframe.
I tried below code
<code>from multiprocessing.pool import ThreadPool
def read_file(file_path):
return spark.read.csv(file_path)
pool = ThreadPool(10)
df_list = pool.starmap(read_file,[[file1,file2,file3...]])
</code>
<code>from multiprocessing.pool import ThreadPool
def read_file(file_path):
return spark.read.csv(file_path)
pool = ThreadPool(10)
df_list = pool.starmap(read_file,[[file1,file2,file3...]])
</code>
from multiprocessing.pool import ThreadPool
def read_file(file_path):
return spark.read.csv(file_path)
pool = ThreadPool(10)
df_list = pool.starmap(read_file,[[file1,file2,file3...]])
But it gives pickel error.
How can I do this?, any alternative for my requirement?
I want to read multiple file in parallel and get them in separate dataframe.