I have a set of PDF files with varying structures, and I’m trying to extract key-value pairs from them using Apache PDFBox in Java. I’ve encountered difficulties due to the different formats of the PDFs.
- Keys and values can be single or multiple words.
- Keys and values can be missing.
- Multiple key-value pairs on the same line, without any delimiters between them.
- Position of key-value pairs within the document is not fixed.
- Value can either be in front of key or just below the key.
- Value can spread over multiple lines.
- Key-value pair correspond to each-other.
- Same key named a little bit differently in different PDFs.
- There can be table in PDF holding key-value pairs.
- One cell can hold both key and value.
- Value can be either in right or down cell.
- One cell can have multiple key-value pairs.
Here’s what I tried:
- Extracted table and then iterated over cells, matched keys and if there’s any value.
- Extracted line texts, then iterated over lines, matched keys and if there’s any value.
But whatever I did is totally PDF’s structure dependent. I need to code it independent from the structure, as I have 100+ structures.
To give a general idea of how PDFs structures look like: I’m dealing with clients invoices. All clients have almost same keys, but differently structured. Some clients create invoices directly from Tally Software.
Please, guide me with the strategy to deal with the difficulty.
Thanks