I’m trying to use Azure AI Search to return back the specific pages from within a group of pdfs that match a search query. Right now I’m using the “generateNormalizedImagePerPage” image action to turn each page to an image, and then using the OcrSkill to read the text from the generated images. This allows me to split the content, but the problem is that when you query the index, it returns the entire pdf documents instead of just the specific pages that match.
I thought that I could use index projections to get each page of the pdf as a separate document in the search index.
This is what I tried. I created the index.
var index = new SearchIndex(name: "myindex")
{
Fields =
[
new SearchField (name: "id", type: SearchFieldDataType.String)
{ IsSearchable = true, IsKey = true, },
new SearchField (name: "content", type: SearchFieldDataType.String)
{ IsFilterable = true, IsKey = false },
new SearchField (name: "pagetext", type: SearchFieldDataType.String)
{ IsSearchable = true },
new SearchField (name: "pagenumber", type: SearchFieldDataType.String)
{ IsSearchable = true }
]
};
And then I created the index projections setting the projection mode to skip indexing parent documents. I also set the parentKeyFieldName to “content” because this article says that this field must be an Edm.String, can’t be the key field, and must have Filterable set to true.
var mappings = new List<InputFieldMappingEntry>
{
new (name: "pagetext")
{
Source = "/document/normalized_images/*/text"
},
new (name: "pagenumber")
{
Source = "/document/normalized_images/*/pageNumber"
}
};
var selectors = new List<SearchIndexerIndexProjectionSelector>
{
new (targetIndexName: "myindex",
parentKeyFieldName: "content",
sourceContext: "/document/normalized_images/*",
mappings: mappings)
};
var indexProjections = new SearchIndexerIndexProjections(selectors)
{
Parameters = new SearchIndexerIndexProjectionsParameters
{
ProjectionMode = IndexProjectionMode.SkipIndexingParentDocuments
}
};
My problem is that I get an error when trying to create my skillset.
One or more index projection selectors are invalid.
Details: Index 'myindex' must contain field 'content', it must be of type Edm.String,
cannot be the key field and it must be filterable.
This error confuses me because I thought I met all the requirements for the targetIndexName specified in the article:
- Must already have been created on the search service before the skillset containing the index projections definition is created.
- Must contain a field with the name defined in the parentKeyFieldName parameter. This field must be of type Edm.String, can’t be the key field, and must have filterable set to true.
- The key field must have searchable set to true and be defined with the keyword analyzer.
- Must have fields defined for each of the names defined in mappings, none of which can be the key field.