Processing 12 Million records through Spark
I’ve one use case I need to read a file which contains 12 million records on daily basis and need to process some records further in other services.
I’ve a csv file which consists a record separated by Ç separator as a identifier of different column value and in each record REG_ID will be unique and we can identify on basis of REG_ID. I need to fetch only those records which is new records in today’s file as compared to yesterday or those records which exists in yesterday’s file but that record is updated as compared to yesterday and we don’t have any date column to represent if this record is updated or not.
One more use case I’ve is like if any record with that REG_ID is not came in file from last 7 days then we need to fetch that record also and need to process that also.