I have a boolean array which represents store-availability of retail products across 3000 different stores.
so, my schema looks like below:
product_id = FieldSchema(
name="product_id",
dtype=DataType.INT64,
is_primary=True,
auto_id=False
)
product_title_vector = FieldSchema(
name="product_title_vector",
dtype=DataType.FLOAT_VECTOR,
dim=768,
)
store_availability = FieldSchema(
name="store_availability",
dtype=DataType.ARRAY,
element_type=DataType.BOOL,
max_capacity=3000
)
The catch here is as per the general rule, boolean array of size 3000 will take ~3KB per record.
I have 5 million items/records. so in total store_availability field alone will take 15GB as per the below calculation.
memory = 3000 bytes/item * 5,000,000 items = 15GB
And then I perform a milvus search with expr for filtering results based on given store-id like below:
query_vector = np.random.random((1, 128)).astype(np.float32)
query_vector_resized = np.resize(query_vector, (1, 768))
store_id = 1
filter_expr = "store_availability[{}] == true".format(store_id - 1)
print("Filter Expression : ", filter_expr)
search_params = {"metric_type": "L2", "params": {"nprobe": 16}}
search_results = collection.search(
data=query_vector_resized,
anns_field="product_title_vector",
expr=filter_expr,
param=search_params,
limit=1
)
I am finding this 15GB for a single metadata field as resource-intensive. Although it is better performant, what if the number of stores increase to 40,000 in future.
I tried sparse-vector representation but most of the products are available on 80% of the stores and still it consumes a lot of space.
Any approaches that can be suggested for making this space-efficient while having filter-expr within search will be of great help!