This is the image
This is the sample image that i will convert into text.
and Here is the output:
“|
| .
indicators (Bids:
S.1.4.1. valid Certificate of Registration and LJ Poy |
Professional Licensure Examination for
Teachers (LET);
S.1.4.2. Master’s degree in education or in any of :
__ the allied fields; and —
S.1.4.3. comply with other requirements of the CHED. _—!
S.2. Other qualifications such as the following are considered: |__|
“
Those bold letter/char are unwanted. I think its because of the boxes.
Can anyone help me fix this?
Heres my code
from PIL import Image
from progress.bar import Bar
import pytesseract # type: ignore
import uuid
import fnmatch
import os
import cv2
import numpy as np
directory = "D:/Python/Projects/Pdf_OCR/pdf_file"
files = Path(directory).glob("*.png")
rand_filename = str(uuid.uuid4())
create_text_file = open(f"{rand_filename}.txt", "x")
total_imgs = fnmatch.filter(os.listdir(directory), "*png")
bar = Bar("Processing", max=len(total_imgs))
pytesseract.pytesseract.tesseract_cmd = r"C:Program FilesTesseract-OCRtesseract.exe"
open_text_file = open(f"{rand_filename}.txt", "w")
for file in files:
img = Image.open(file)
custom_config = r"-l eng --psm 6 --oem 3"
open_text_file.write(pytesseract.image_to_string(img, config=custom_config))
bar.next()
open_text_file.close
bar.finish()