I am using chroma and querying a data collection. I get different results for the same query. Here is the code (Python):
# note that the creation of the collection is using chroma standard settings
collection = self.chroma_client.get_collection(name=title)
result = collection.query(
query_texts=["derivatives"],
n_results=20,
include=["documents", "distances"],
)
This github support case explains a possible cause but doesn’t offer a solution that works in my case.
https://github.com/chroma-core/chroma/issues/860
According to the link it may be that the Approximate Nearest Neighbors (ANN) search algorithm prioritises speed over accurancy and therefore may not always include the closest distances in a result set that is limited in number. In my case I limit the results to 20 items. This will sometimes cause the top result to have a distance of ~1.5 when in reality the closest result has a distance of ~0.73 (note that for distances a smaller number is a closer match). As suggested in the link, increasing the limit of results increases the likelihood of the correct results being included in the result set (e.g. incresing to 50 results will increase the likelihood of the top results being included). However, I wanted to see if anyone had found a better solution to this e.g. using different algorithms or some custom configuration. Thanks in advance.