I have a task to merge set of files and each file contains words ( separated by new line ). Files contains words in sorted order. Once merged all the files, it needs to preserve the sorted order. Files could be large to load into memory and there are lot of files ( thousands ). Following is my solution that i am trying to implement. I want to see if this ok, if this scalable or any potential issue it might have.
Soution:
Process files in batch. May be 100 files batch.
- Go through first word of each file and put them in to min-heap.
- At the end extract the word from min heap. Since this preserve the order i will have word in sorted order. Write that to a file.
- Now put another word from same file which that extracted word belongs.
- Again extract from min-heap and write to a file.
This way i am writing to temporary files sorting batch by batch. Then sort those temporary files to get final file. I really needs to be resilient as much as possible and scale. I am thinking keeping a line number which i read in separate file ( like offset ) so that if the process killed and restart i can start from offset.
Viraj is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
4
- read first word from all the files and put into a sorted list, linking to the file stream they came from
- take the first word/file from the list
- add the word to the output
- take the next word from the current file and compare with the new first word on the list.
- if it’s smaller, goto 3
- else add to sorted list and goto 2
This means:
- Your sorted list is only (number of files) long. Hopefully you have enough memory
- Your loop only makes one comparison and potentially one partial loop through the list to add the word and choose a new file.
- If you have too many files to fit them all in memory, you only need to go through them all once to find the top X. Because each file is already sorted you wont need the D+ files until you have got through the ABC… words out of the first batch