We are using Azure DocumentExtractionSkill to extract content from PDFs. However, there’s an issue with some PDF files which are not supported by the service. Can someone provide some guidance on how to extract content from those pdfs.
I believe this is happening because of some pdf contains any unsupported fonts or characters, that I got to know after debugging a little.
"skills": [
{
"@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
"name": "Extract Text",
"description": null,
"context": "/document",
"parsingMode": "default",
"dataToExtract": "contentAndMetadata",
"inputs": [
{
"name": "file_data",
"source": "/document/file_data"
}
],
"outputs": [
{
"name": "content",
"targetName": "extracted_content"
}
],
"configuration": {
"imageAction": "generateNormalizedImages",
"[email protected]": "#Int64",
"normalizedImageMaxWidth": 2000,
"[email protected]": "#Int64",
"normalizedImageMaxHeight": 2000
}
}
This is how the index looks like after indexing which is kind of not searchable:
Indexes
Is there any solution to extract content from this kind of pdfs document, since the content inside is in English only and they are not even images, the text is selectable.
Thanks,
I have tried different skills as well like Translation-Skills, Split-Skill but didn’t work.
noobie_ is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.