I’m building a document processing pipeline in Azure AI Studio and need advice on the best approach.
My Setup:
-
Documents stored in Azure Blob Storage
-
Mostly PDFs (some text-based, some scanned)
-
Many documents contain tables
-
Planning to use Azure Document Intelligence
Current Issue:
I’ve tried using the built-in indexer in Azure AI Studio with blob storage, but when I used this indexed data with a model, it gave wrong answers or no answers at all. I believe this is due to inaccurate processing of scanned PDFs and tables.
My Questions:
-
Is Azure Document Intelligence the best choice here, especially for
table extraction or scanned files? -
What’s the best way to implement this in Azure AI Studio as a job?
-
Any tips for accurate table extraction, especially from scanned
docs? -
How should I approach embedding generation for potentially large
documents?
I’m new to Azure AI Studio and looking for guidance on best practices. Any advice or alternative suggestions are welcome!