I have been assigned a task to create a process to extract the values from daily files that are sent to our Data Lake. There are 2 types of files that are sent: JSON and AVRO for n number of LOBs. And each LOB is present in a folder structure. So the task at hand is to traverse through the folder, read the files, get the required insight from it (count of a particular attribute, aggregation i.e. sum of a particular business attribute and so on) and put it in a report csv file.
Each file is about 1.8-2.4 GB and each folder (LOB) recives 24 files every day.
The main problem is I have to do it entirely on python and not using any big data tool as the process is meant to be a Early Caution System that will trigger a flag for any discrepacies in count/sum before the actual files are ingested using Big Data Frameworks. The system memory of the server is low (about 32 GB) so a 1-1.5 GB file takes about 3 minutes to process if done using Pandas with concurrent users.
Can someone suggest the tools, framework or an LLD (Low Level Design) how I can achive this using python? I am okay to run the process for 4-5 hours as the actual ingestion processes run in the second half of the day.
The servers I use do not support hadoop, spark and other big data frameworks.
I am looking for valuable insgihts from people in this community who have had experience in building similar applications.