I want to extract tables from a medical bill from a pdf pages using AWS textract, It is either failing to detect the partial table, was header is on the previous page, or it is detecting a wrong table cells(The output table is having wrong number of columns). I am using aws textract API to make a textract client to use analyze_document method to fetch the tables.
Case – 1 : Failing to detect the partial table
First Page
Second Page
Here The entry
Hepatitis C Virus Ab (Rapid) Qualitative is not detected as a table by textract as the table header for the entry is on the previous page and this is just a partial table.
Case – 2 : Failing to detect the table correctly (columns 1-2 are combined in the output)
First Page
Second Page
Here the entries
Partial Table are detected as a table but it is not matching with the table header columns of that table which is on previous page.
The table detected by the textract looks something like this.
Textract Table Output
Here there are 8 columns detected by textract, but originally there are 9 columns. Basically for the first entry,
1 & Renal Test … should have detected as seperate cells.
This both issues are happening becuase the context is lost which is on previous page (The column Names and Table Header information).
Is there any way we can fetch these table which are split by page-breaks in pdf?
Or Any other solution to parse these output correctly, without losing information?
I have tried using the context from previous page while processing the pages iteratively but as textract also fails to identify partial tables altogether it is not helpful for each cases.
imac9 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.