I am calling DocumentAI OCR batch processing from Workflows generally quite successfully, however, I occasionally get the following error:
{
"caughtError": {
"message": "An error occurred during execution of a long-running operation.",
"operation": {
"done": true,
"error": {
"code": 3,
"message": "Failed to process all documents."
},
"metadata": {
"@type": "type.googleapis.com/google.cloud.documentai.v1.BatchProcessMetadata",
"createTime": "2024-07-18T12:43:54.456405Z",
"individualProcessStatuses": [
{
"inputGcsSource": "gs://redacted_bucket_name/redacted.pdf",
"status": {
"code": 3,
"message": "Invalid input document content."
}
}
],
"state": "FAILED",
"updateTime": "2024-07-18T12:44:25.393457Z"
},
"name": "projects/###/locations/us/operations/###"
},
"tags": [
"OperationError"
]
}
I am able to isolate the error down to the page by breaking the pdf up to see which page is causing the error. However, I am unable to discern what the exact cause of the error is programmatically to fix it.
I am able to use libraries like pypdf
and pymupdf
to open the pdf and successfully extract the text (if it exists). I have also tried using pymupdf
to compress the pdf using various compression settings, and I am able to compress the pdf, but this does not prevent the error. On one pdf where I was encountering the issue, setting enableNativePdfParsing
to false
allowed the pdf to process, however this does not prevent the error in all cases I am seeing.
Is there some way I can preprocess my pdfs to ensure the pdf is compatible with DocumentAI or is there another way to solve this issue?