I watched this tutorial (https://youtu.be/2TJxpyO3ei4) on setting up RAG (retrieval augmented generation) using LLMs (I used a local embedding model and a local model for queries). I want to be able to have a data folder where I can read the documents from HTML files (or more preferably links). I believe this website (https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/html/) goes through it but I don’t know how to add this to my already existing code that loads the document.
Here is the code (that works correctly for PDFs, and now I want to add HTML files/links):
`
def load_documents():
document_loader = PyPDFDirectoryLoader(DATA_PATH)
return document_loader.load()
`
I tried changing the document_loader to be equal to something that took from HTML, but then the Pdfs weren’t working properly. Also I don’t know how to get the links from online.
I’m fairly certain the answer involves this: loader = UnstructuredHTMLLoader(“example_data/fake-content.html”)