I am processing a file which consist millions of recods. I am sorting the file based on customer id column and creating dynamic batches by making sure that a single customer id record does not falls in two differenct bathces(this is being done because all batches are processed parallely).
As of now I am processing the file through apache camel which reads entire file and returns the file into List. But this approach fails if some larger file comes.
Sample Data
customerId address mobile
1 mumbai 86xxx5649
2 japan 95xxx5649
3 mumbai 86xxx5649
4 japan 95xxx5649
1 china 86xxx5649
6 london 95xxx5649
2 canada 86xxx5649
8 china 95xxx5649
After sorting
customerId address mobile
1 mumbai 86xxx5649
1 china 86xxx5649
3 mumbai 86xxx5649
4 japan 95xxx5649
6 london 95xxx5649
2 japan 95xxx5649
2 canada 86xxx5649
8 china 95xxx5649
In order to process larger files , I choose processing the files using bufferedReader and divided the file into segment. But in this approach I am unable to process the latest data at the end.