I have a document with three attributes: tags, location, and text.
Currently, I am indexing all of them using LangChain/pgvector/embeddings.
I have satisfactory results, but I want to know if there is a better way since I want to find one or more documents with a specific tag and location, but the text can vary drastically while still meaning the same thing. I thought about using embeddings/vector databases for this reason.
Would it also be a case of using RAG (Retrieval-Augmented Generation) to “teach” the LLM about some common abbreviations that it doesn’t know?
import pandas as pd
from langchain_core.documents import Document
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from langchain_openai.embeddings import OpenAIEmbeddings
connection = "postgresql+psycopg://langchain:langchain@localhost:5432/langchain"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
collection_name = "notas_v0"
vectorstore = PGVector(
embeddings=embeddings,
collection_name=collection_name,
connection=connection,
use_jsonb=True,
)
# START INDEX
# df = pd.read_csv("notes.csv")
# df = df.dropna() # .head(10000)
# df["tags"] = df["tags"].apply(
# lambda x: [tag.strip() for tag in x.split(",") if tag.strip()]
# )
# long_texts = df["Texto Longo"].tolist()
# wc = df["Centro Trabalho Responsável"].tolist()
# notes = df["Nota"].tolist()
# tags = df["tags"].tolist()
# documents = list(
# map(
# lambda x: Document(
# page_content=x[0], metadata={"wc": x[1], "note": x[2], "tags": x[3]}
# ),
# zip(long_texts, wc, notes, tags),
# )
# )
# print(
# [
# vectorstore.add_documents(documents=documents[i : i + 100])
# for i in range(0, len(documents), 100)
# ]
# )
# print("Done.")
### END INDEX
### BEGIN QUERY
result = vectorstore.similarity_search_with_relevance_scores(
"EVTD202301222707",
filter={"note": {"$in": ["15310116"]}, "tags": {"$in": ["abcd", "xyz"]}},
k=10, # Limit of results
)
### END QUERY