I am working on a small project related to RAG and am stuck (Apparently cuz I don’t know much)
I used mxbai-embed-large as embeddings and Chroma db as Vector store all goes well to this point.
Issue: When I try to retrieve data with similarity threshold it returns 0 docs and without threshold and k it always returns 4 docs no matter the query.
What is it that I am doing wrong?
Here my Code:
Vector Store Creation File:
# Load Docs and then store embeddings in the Chroma DB
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
embeddings = OllamaEmbeddings(
base_url="http://43.204.231.131:11434",
model="mxbai-embed-large",
)
loader = PyMuPDFLoader("./data/aliceShort.pdf")
data = loader.load()
# print(len(data))
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=300,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
chunks = text_splitter.split_documents(data)
print(f"Split {len(data)} documents into {len(chunks)} chunks.")
db = Chroma.from_documents(chunks, embeddings,persist_directory="./chroma_langchain_db")
query = "Who is Alice?"
docs = db.similarity_search(query)
print(docs[0].page_content)
Query File:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma
embeddings = OllamaEmbeddings(
base_url="http://65.2.37.27:11434",
model="mxbai-embed-large",
)
db = Chroma(persist_directory="./chroma_langchain_db", embedding_function=embeddings)
query_text="Who is Alice?"
retriever = db.as_retriever(
search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.1})
docs = retriever.invoke(query_text)
print(len(docs))
I tried to solve this issue by changing models used for embeddings yet issue remains the same.