I successfully embedded a 400-page PDF document within 1-2 hours. However, when I tried to embed a CSV file with about 40k rows and only one column, the estimated embedding time is approximately 24 hours.
Here is the code I used:
embedder = OllamaEmbeddings(model="nomic-embed-text", show_progress=True)
file_path = 'filtered_combined_info.csv'
loader = CSVLoader(
file_path=file_path,
encoding='utf-8', # or 'ISO-8859-1' if utf-8 doesn't work
autodetect_encoding=False # Set to True if you want to attempt autodetection
)
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(data)
persist_directory = 'db'
vectordb = Chroma.from_documents(documents=docs,
embedding=embedder,
persist_directory=persist_directory)
Why is the embedding process for the CSV file taking significantly longer than for the PDF file? Are there any optimizations or changes I can make to reduce the embedding time for the CSV file?
Additionally, is there anything I am doing wrong that might be causing it to take so much time?