I am working on a project where I have hundreds of Excel files containing financial transactions (e.g., stocks, mutual funds, etc.) in varying formats. Each file can contain one or more tables, and the tables are located at different positions across different sheets with respect to different brokers like zerodha, groww, etc.
My goal is to train a machine learning model that can automatically identify and extract the relevant table(s) from any given Excel file, even if the format is unfamiliar.
I have previously built ml and dl models on image and pdf dataset. The challenge is annotating and training the model on excel files.
Some example files are following :
https://drive.google.com/drive/folders/1YixwjLg2ZskRXMI5WMjns5kD1Ujfpphd?usp=sharing
The problem is every broker as a different format in which they give this data to user so I cannot parse it manually as format changes frequently.
Like in image, pdf files we train on a no. of files and then new format is given, the model can give the required data with precision and accuracy even if the format is changed. The challenge is to perform this on excel files. I am having trouble with feature extraction as I can’t find much features in excel files.
How to solve or approach this problem. Please go through the sample files.
I tried annotating the files by taking start row end row start column and end column to identify where the table is located. and Using fuzzy mapping to map differeny assets such as equity or mf or f&o or intraday.
For feature extraction i took data of each cell – data type of cell. I can’t find whih features are relevant or if there are any features in excel files.
I used XGBBoost.
The result I got were mostly all data not only the table data as only feature is datatype.