I’m trying to do a vector search in Python (using langchain), while pre_filtering the results from the Mongo database before I do the vector query.
Relevant parts of my model example:
class Question(Document):
content = StringField(required=True)
...
class Theme(Document):
question = ReferenceField(Question, required=True)
text = StringField(required=True) # AKA Category title
embedding = ListField(FloatField())
Basically, Question is a reference field of Theme.
So, I want to run a semantic search on theme collection, but want to PREFILTER by question and so have set up an index for theme:
{
"mappings": {
"fields": {
"embedding": [
{
"dimensions": 1536,
"similarity": "cosine",
"type": "knnVector"
}
],
"question": {
"type": "token"
}
}
}
}
Finally, here is my (simplified) python code:
from langchain.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings
from pymongo import MongoClient
from bson import ObjectId
import os, json
# Set environment variables
os.environ['OPENAI_API_KEY'] = ""
os.environ["MONGODB_HOST"] = ""
# Connect to the MongoDB database
mongo_client = MongoClient(os.environ["MONGODB_HOST"])['text-mining-langchain']
embeddings = OpenAIEmbeddings()
# Get the collection
collection = mongo_client['theme']
# Filter the documents based on the 'question' field
question_id = ObjectId('663252400674de6854bf6594')
pre_filter_dict = {"question": str(question_id)}
vectorstore = MongoDBAtlasVectorSearch(collection, embeddings, text_key="text",
embedding_key="embedding", index_name="default")
# Perform similarity search on the filtered documents
query = 'Ease of Use and Accuracy'
docs = vectorstore.similarity_search_with_score(query, k=10, pre_filter=pre_filter_dict)
docs
When I run it WITHOUT the pre_filter (i.e. empty pre_filter), I get results as expected.
However, when I try to run WITH pre_filter, I get zero results, even though I SHOULD have some results.
Is it a mismatch between index set up as token and my converting objectId to string to search? I’ve tried setting up index as objectId and string, and get error that question needs to be a token.
Can someone help with this?