I have created a collection with the following specifications:
- Milvus Version: 2.4.4
- CPU
- Number of Entities: 20 million
- Vector Field: One field of type float with a dimension of 512
- Boolean Field: Represents gender, with a 50% probability for both male and female
- Metric: COSINE
- M: 64
- efconstruction: 256
- ef: 128
- Index Type: HNSW
- I did not configure values for partition, segment, or num_shards.
In my initial benchmark, I evaluated Milvus’s performance against Numpy’s dot product and was pleased with the results.
Now, I want to add an additional field schema that also contains a boolean value indicating the gender of each embedding vector, allowing me to restrict queries based on gender. For instance, I aim to retrieve the 50 nearest neighbors that are male. To achieve this, I will generate gender data with an equal probability of 50%, resulting in half of the collection being male and the other half female.
I conducted benchmarks under this scenario, and the findings are outlined below. As illustrated in the plot, filtering results by gender did not confer any advantages; for example, in one case, the filtering was only 1.06 times faster than non-filtered queries.
Qi Xiang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.