I’m currently working on a example where I am trying to embed this document https://www.gutenberg.org/files/1727/1727-h/1727-h.htm#chap24, using the Ollama model and the Chroma vector database. My code is as follows:
import time
from langchain_community.llms import Ollama
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
# Initialize Ollama model
ollama = Ollama(
base_url='http://localhost:11434',
model="llama3.1"
)
# Load the Odyssey by Homer from Project Gutenberg
loader = WebBaseLoader("https://www.gutenberg.org/files/1727/1727-h/1727-h.htm")
data = loader.load()
# Split the text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
all_splits = text_splitter.split_documents(data)
# Option to use a smaller subset of the data for testing
subset_splits = all_splits[:100] # Adjust the number for testing
# Initialize the embedding model and vector database
embeded = OllamaEmbeddings(model="nomic-embed-text")
# Adding logging to monitor progress
start_time = time.time()
print("Starting embedding process...")
vectorstore = Chroma.from_documents(documents=subset_splits, embedding=embeded)
end_time = time.time()
print(f"Embedding process completed in {end_time - start_time} seconds")
I am running it in jupyter notebook
.
The problem I’m encountering is that the embedding process is taking an excessively long time. I started the process, and it has been running for over 2553 seconds (about 36 minutes) without completion.
Could there be any specific bottlenecks or limitations with the models I’m using?
My CPU usage is always less than the 20%.
- CPU: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz RAM: 16 GB
- Graphics Card: NVIDIA GeForce RTX 3050 Laptop GPU Graphics Card
- Memory: 4 GB Integrated Graphics Card: Intel(R) UHD Graphics
- Integrated Graphics Card Memory: 1 GB Python Version: 3.12.4
- Langchain Community Version: 0.2.10 Operating System: Microsoft
- Windows 10 [Version 10.0.22631.3958]