“I have a PDF file as my Langchain external database, which contains text content and corresponding URLs. For example,the text looks like this:
‘To solve the problem,
step 1 You can first click the buttom on the right corner,
step 2 then open the link: https://xxxxxx, to fill the form,
step 3 then send the form to us
‘
After using this PDF as an external database, my model tends to ignore the URLs in the PDF and only extract the text to answer my questions. I’m not sure if this is an embedding issue or a problem with the text splitter.
My model is ‘TheBloke/Llama-2-13B-chat-GGUF’ and the model basename is ‘llama-2-13b-chat.Q5_0.gguf’ (the model is in bin format).”
This is how I process the pdf file
def create_vector_db():
loader = DirectoryLoader(DATA_PATH,
glob='*.pdf',
loader_cls=PyPDFLoader)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
chunk_overlap=50)
texts = text_splitter.split_documents(documents)
embeddings_model = OpenAIEmbeddings(api_key=")
db = FAISS.from_documents(texts, embeddings_model)
db.save_local(DB_FAISS_PATH)
Run Zhang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.