I’m a bit new to the whole RAG pipeline thing and find myself being a bit lost in the endless possibilities of building one. My goal is to create a script that can transform about 60 anatomical pdfs into a vector store database and use this to answer questions about body parts and return the references to the pages of the pdfs where that information was taken from.
My script so far looks like this because it is the only way I have managed to make it work:
import os
import faiss
import nest_asyncio
from dotenv import load_dotenv
from llama_index.core import (
Settings,
SimpleDirectoryReader,
StorageContext,
VectorStoreIndex,
load_index_from_storage,
)
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler
from llama_index.vector_stores.faiss import FaissVectorStore
nest_asyncio.apply()
load_dotenv()
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])
Settings.callback_manager = callback_manager
save_dir = "./documents/vector_store"
d = 1536
faiss_index = faiss.IndexFlatL2(d)
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
if not os.path.exists(save_dir):
print("Saving vector store to disk ...")
documents = SimpleDirectoryReader("./documents/test/").load_data()
vector_store = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
vector_store.storage_context.persist(persist_dir=save_dir)
vector_query_engine = vector_store.as_query_engine(similarity_top_k=3)
else:
print("Loading vector store from disk...")
vector_store = FaissVectorStore.from_persist_dir(save_dir)
storage_context = StorageContext.from_defaults(
vector_store=vector_store, persist_dir=save_dir
)
index = load_index_from_storage(storage_context=storage_context)
vector_query_engine = index.as_query_engine(similarity_top_k=3)
response = vector_query_engine.query(
"What is the diaphragm and what position does it occupy in the body?"
)
print(response)
for i, node in enumerate(response.source_nodes):
metadata = node.node.metadata
text_chunk = node.node.text
page_label = metadata.get("page_label", "N/A")
file_name = metadata.get("file_name", "N/A")
print(f"Reference nr: {i+1}, Page: {page_label}, Document: {file_name}")
print(f"Text Chunk: {text_chunk}n")
And this is the (beginning of the) output:
Trace: query
|_CBEventType.QUERY -> 2.734167 seconds
|_CBEventType.RETRIEVE -> 0.417225 seconds
|_CBEventType.EMBEDDING -> 0.417225 seconds
|_CBEventType.SYNTHESIZE -> 2.316942 seconds
|_CBEventType.TEMPLATING -> 0.0 seconds
|_CBEventType.LLM -> 2.30051 seconds
**********
A diaphragm is a dome-shaped muscle that separates the thoracic cavity from the abdominal cavity. It is positioned below the lungs and heart, and above the liver, stomach, and other abdominal organs. The diaphragm is connected to the thoracic aorta, which supplies blood to the chest wall and thoracic organs, and the inferior vena cava, which returns blood from the lower body to the heart.
Reference nr: 1, Page: 317, Document: random_pdf.pdf
Text Chunk: even during sleep, and must have a constant flow of
blood to supply oxygen and remove waste products.For this reason there are four vessels that bring bloodto the circle of Willis. From this anastomosis, severalpaired arteries (the cerebral arteries) extend into thebrain itself.
The thoracic aorta and its branches supply the
chest wall and the organs within the thoracic cavity.These vessels are listed in T able 13–1.
The abdominal aorta gives rise to arteries that sup-ply the abdominal wall and organs and to the common
iliac arteries, which continue into the legs. Notice inFig. 13–3 that the common iliac artery becomes theexternal iliac artery, which becomes the femoral artery,which becomes the popliteal artery; the same vesselhas different names based on location. These vesselsare also listed in T able 13–1 (see Box 13–3: PulseSites).
The systemic veins drain blood from organs or
parts of the body and often parallel their correspond-The Vascular System 299
Figure 13–5. Arteries and veins of the head and neck shown in right lateral view. Veins
are labeled on the left. Arteries are labeled on the right.
My questions are two:
- on a more theoretical level: I thought a RAG pipeline needed (in a very simplified fashion) 1) embedding of the chunks 2) retrieval based on similarity 3) rephrasing of the answer by an LLM; however, this script works fairly well while apparently skipping both 1 and 3, so am I missing the point? or does llama-index abstract away from a lot of the implementation?
- on a practical level: how do I improve on this? The script works as in it usually outputs reasonable answers, but the text in “source_nodes” sometimes is very unsatifactory in terms of its relevance
Any help/guidance or resources would be super appreciated!
ela is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.