So I have an Azure AI Search service. It uses a blob storage as a source for files that are indexed and used for search.
The workflow is the following:
product-data
is the storage for original files that go through indexer it indexes the document_id
, filename
and url
.
There also is storage for file chunks that is called product-chunks
. For each document a folder is generated. Its name is a hash value so there is no link to the original file. It stores files content splitted into chunks (JSONs). This storage is also indexed by the same fields + content
, chunk_id
, file_path
so that ai could find some text.
What i need to do is to delete chunks that represent the original file when i delete the file itself. Right now when I delete the file, chunks are still there and search results are given based on the old information.
I’ve found this article https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs?tabs=portal, but looks like it’s for the case when I don’t have related data sources.
The original idea was to implement an Azure function that is triggered on files add/delete operations from event grid and it works, but the incoming event’s don’t have any data that I could use to link it to chunks. Here is the delete event
{
"Data": {},
"Id": "1d1bc3dd-a01e-0007-1276-b9c34a06a6a9",
"Topic": "/subscriptions/679b4fa6-ca49-.../resourceGroups/rg-euw-AI/providers/Microsoft.Storage/storageAccounts/ahopenai",
"Subject": "/blobServices/default/containers/product-data/blobs/Requirement.pdf",
"EventType": "Microsoft.Storage.BlobDeleted",
"EventTime": "2024-06-08T07:34:49.3801182+00:00",
"DataVersion": ""
}
The solution I have in my head is to add meta data like original_file_name
to chunks that are generated from original files, but here are 2 obstacles for me:
- the chunks are stored in a folder with random name, so I can’t just delete this folder, cause I don’t know the name. It’s created after the original document is loaded.
- I haven’t found a way to add meta data to generated chunks as for now I have no clue what is responsible for this document to chunks splitting (still searching)