I am currently working with the following setup:
Milvus version 2.3.7, pymilvus version 2.3.6.
A database in Milvus containing 4 million 768-dimensional vectors.
My challenge involves performing a vector search across a large number of IDs, ranging from 10,000 to 500,000. For instance, if I have a query that matches 100,000 documents but only 70,000 of those are available in my inventory. I need to filter out the available items based on this information which I have in the form of a bitset.
The current workflow is as follows:
Query Milvus: Execute a query within Milvus to retrieve document IDs for matched vectors without availability consideration.
Post-Processing: Implement post-filtering using the bitset to isolate only those documents with their corresponding availability bit set to ‘1’. This results in an O(n) computational complexity, where n is the size of the matched documents.
I am seeking a method or strategy that either reduces this operation to O(1) time complexity or allows me to directly fetch available items via Milvus without needing any post-filtering.
Milvus does support filtering with bitsets (applying filters prior to running the actual approximate nearest neighbor (ANN) search). As I already have an availability bitset on hand, is there any way to utilize this within Milvus?
I also thought of other approach where i have updated my availability in form of arrays in documents like this
`Items :
[
{
"product_id": "12345",
"product_title-vector": [0.23, 0.45, 0.87, ....] #this is the vector field of required dimension,
"store_availability": [0, 2, 14, 3, 5, ...] #upto 2000 Stores
},
{
"product_id": "12346",
"product_title-vector": [0.23, 0.45, 0.87, ....] #this is the vector field of required dimension,
"store_availability": [0, 5, ...] #upto 2000 Stores
},
{
"product_id": "12347",
"product_title-vector": [0.23, 0.45, 0.87, ....] #this is the vector field of required dimension,
"store_availability": [0, 5, ...] #upto 2000 Stores
},
{
"product_id": "12348",
"product_title-vector": [0.23, 0.45, 0.87, ....] #this is the vector field of required dimension,
"store_availability": [0, 3, 5, ...] #upto 2000 Stores
},
{
"product_id": "12349",
"product_title-vector": [0.23, 0.45, 0.87, ....] #this is the vector field of required dimension,
"store_availability": [0, 2, 14, 3, 5, ...] #upto 2000 Stores
},
]
To search available items in relevant stores i was trying to use boolean expressions
bool_expr = "(store_availability in [0] && store_availability[0]==4) && " +
"(store_availability in [1] && store_availability[1]==2) && " +
"(store_availability in [2] && store_availability[2]==14)"
search_results = collection.search(
data=query_vector_resized,
anns_field="product_title_vector",
param=search_params,
limit=10,
expr=bool_expr,
output_fields=['product_id', 'product_name', 'store_availability']
)
However the major challenge is inventory updates every 10 Minutes and i can not do so many bulk updates so often
I appreciate any guidance or suggestions on how this can be achieved efficiently.