I need to develop a web software to receive mainly PDF documents and extract information using OCR. I need to save the original document and the OCR extraction for later searches. I have already solved the OCR issue, but what do you recommend for storing information and then searching? My architecture should be based on Java and the texts to be extracted could be complete documents.
I have been evaluating whether MongoDB, Hadoop, ElasticSearch until Lucene are the best alternative.
MongoDB has a 16 MB limitation and possibly I have to store more than that.
I need an architecture suggestion