everyone. Currently I am working on a Langchain RAG framework using Ollama. I have a question towards the chunk size in the Document Splitter.
Now I decide to use qwen2:72b model as both embedding models and llm models. Here is the snapshot of qwen2:72b model information in Ollama.
We can see that the context windows is 32768 tokens and embedding length is 8192 tokens.
And here is my code to construct vector database.
model_name = "qwen2:72b"
content_windows = 10000
chunk_overlap = content_windows // 100
vdbase_dir = "chroma_db_" + model_name + "-" + str(content_windows // 1000) + "k"
markdown_path = "data/output.md"
loader = UnstructuredMarkdownLoader(markdown_path)
documents = loader.load()
# Split loaded documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=content_windows, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(documents)
# Initialize Embeddings
embed_model = OllamaEmbeddings(model = model_name)
# Create and persist a Chroma vector database from the chunked documents
vs = Chroma.from_documents(
documents=docs,
embedding=embed_model,
persist_directory = vdbase_dir, # Local mode with in-memory storage only
collection_name="rag"
)
My question is that how to determine the best chunks_size here. As far as I know, chunks_size should be as larger as possible but should not be larger than the context window of llm when combining the prompt token and output token.
Let’s assume the prompt token and output token are no more than 1000. So how long the chunk_size should be, 32768 – 1000 = 31700 (we can use 30000 tokens) or 8192 – 1000 = 7192 (we can use 7000 tokens)?
Another question is that can we use different work embedding models and large language models in a RAG framework? For example, use llama3:70b as word embedding model and qwen2:72b as llm since qwen2:72b has longer context window. What’s the good practices in RAG framework?