I’m trying to extract the highlighted text from an image using Python. Currently, I’m using pytesseract to convert the image to an editable PDF with this code:
import pytesseract
image_path = "./image/with_mark.jpg"
pdf = pytesseract.image_to_pdf_or_hocr(image_path, extension='pdf')
with open('test.pdf', 'w+b') as f:
f.write(pdf) # pdf type is bytes by default
This successfully converts the image to a PDF, but it doesn’t differentiate the highlighted text.
I’ve seen examples using the Solr OCR Highlighting Plugin to achieve this, but I’m specifically looking for a Python-based solution.
Is it possible to extract only the highlighted text using pytesseract? If not, are there other Python libraries or methods that could help me achieve this?
The highlighted text I’m trying to extract looks like this:
Any suggestions or guidance would be greatly appreciated. Thank you!