I wasn’t quite getting the answers I needed on Reddit then I remembered my friends here at Stackoverflow so I figured I’d give this a shot.
I’m recently picking up some machine learning concepts and it’s been really fun lately. I’m working on a document processor that will intake a bunch of PDFs and label the data within it. I want to know if I have the right approach for generating training data for a model that I will create later.
The path to the PDF file is fed into a function that extracts the text using DocTR. The data is formatted like so:
data_list.append({
'id':coord_key,
'text':block_list[coord_key],
"x1":float(block["geometry"][0][0]),
"y1":float(block["geometry"][1][0]),
"x2":float(block["geometry"][0][1]),
"y2":float(block["geometry"][1][1])
});
Coord_Key serves as a ID for a block of text. x1,x2,y1,y2 serve are the coordinates for the bounding box on the document so the position is known. The text is of course the text in question.
The idea is to then take this data and pass it into Label Studio where I will then use Open AI to help me apply the specific labels I’m looking to use. Ex: address
, phoneNumber
, remarks
, etc. I will of course review what it comes up with and make corrections as I sift through the information. That way the model I go to train has the best information to learn from.
The goal at this point is to generate training data so that I can have something written using Tensorflow learn how to do the labeling on its own.
My question is basically if this approach is correct and if there’s any additional considerations I should take for creating training data for my model before I start going heavy on creating more of this data from documents I have to test on.
Thank you for any help!