“unstructured” and langchain’s “HTMLHeaderTextSplitter” ignores “pre” and/or “code” HTML tags
I want to read a webpage and split it into chunks to feed a vector database in a RAG pipeline. This webpage has python code examples on it, but I cannot create chunks with that code text, it is ignored by the splitters. I tried both unstructured python package, and HTMLHeaderTextSplitter class (from langchain_text_splitter package) with the same result.