I have a list of IDs that I want to find in my parquet files. For each of the IDs I do have an idea in which files they could be present i.e. I would have a mapping where I have
ID1 -> file1, file2
ID2 -> file2, file5
ID -> file3, file4 and so on…
What would be best way in spark scala to do such a task. I thought of plenty ways.
- Grouping by file, reading it and creating a join to IDs. (Can use Future to parallelize this)
- Read all files in a single df and then filter out the IDs.
Which method would be the best given there are 1000s of IDs and millions of rows in each file. Or is there any other better method?