I have a dataframe with metadata df_meta
that has the uri of text files with the content that I require:
| id | s3_uri |
| ——– | ————– |
| 1 | s3://bucket1/prefix |
| 2 | s3://bucket2/another_prefix |
| 3 | s3://bucket3/a_different_prefix |
Note that the files can be in different buckets, so I can’t do a standard spark.read.text(path)
.
I opted to instead try the following:
s3_list = (
df_meta.select("s3_uri")
.rdd.flatMap(lambda x: x)
.collect()
)
df_want = spark.read.text(s3_list, wholetext=True).withColumn("path", input_file_name())
And thereafter join the two dataframes using the path
column.
This works for a few million records, but once my list size goes 100m+ then spark reader fails on the “Listing leaf files and directories..” step. (I have increased compute configs – mem, cpu, tasks… to no avail).
I realize I have a small file type of problem, but moving the list of files into a separate folder and then reading is not an option at present.
Any input on how make this more efficient?