What I am trying to do, converting PDF to text using OCR. Now I am using multiprocessing to reduce time, without multiprocessing the time to process a 88 page. The PDF is 1.5 min, but with a multiprocessing time is a lot. In around 10 mins, I do not know why.
Here is my code:
def process_image(image):
start_time = time.time()
img_byte_arr = BytesIO()
image.save(img_byte_arr, format='PNG')
img_byte_arr = img_byte_arr.getvalue()
text = pytesseract.image_to_string(Image.open(BytesIO(img_byte_arr)))
print(f'Processed image in {time.time() - start_time:.2f} seconds')
return text
async def extract_text_from_pdf(file):
pdf_content = await file.read()
print("Converting PDF to images...")
start_time = time.time()
images = convert_from_bytes(pdf_content)
print(f"Converted PDF to {len(images)} images in {time.time() - start_time:.2f} seconds")
print("Processing images...")
start_time = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
results = list(executor.map(process_image, images))
print(f"Processed all images in {time.time() - start_time:.2f} seconds")
extracted_text = "nn".join(results)
cleaned_text = re.sub(r's+', ' ', extracted_text).strip()
return cleaned_text
And here are the console logs:
Converting PDF to images...
Converted PDF to 88 images in 3.93 seconds
Processing images...
Processed image in 0.42 seconds
Processed image in 36.72 seconds
Processed image in 57.31 seconds
Processed image in 59.74 seconds
I also tried a Multithreading, the same issue.
New contributor
AYUSH GOYAL 22110009 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1