I want to read a webpage and split it into chunks to feed a vector database in a RAG pipeline. This webpage has python code examples on it, but I cannot create chunks with that code text, it is ignored by the splitters. I tried both unstructured python package, and HTMLHeaderTextSplitter class (from langchain_text_splitter package) with the same result.
The HTML code I want to parse looks like this:
...
<h2 id="examples_1">Examples</h2>
<h3 id="create-camera">Create camera</h3>
<p>Create a camera. Setting invalid_entity_id as the parent entity will make the camera to be created under the Ego entity, as it must be</p>
<pre><code class="language-python">camera_id = workspace.create_entity( anyverse_platform.WorkspaceEntityType.Camera, "New Camera", anyverse_platform.invalid_entity_id )
</code></pre>
<hr>
<h3 id="add-resource-to-workspace">Add resource to workspace</h3>
...
The “unstructured” package based script I use to split the webpage into chunks is this:
from unstructured.partition.html import partition_html
elements = partition_html(url=web_path)
element_dict = [el.to_dict() for el in elements]
output_path = os.path.join(output_dir_documentation, 'unstructured.json')
with open(output_path, 'w', encoding='utf-8') as output_file:
output_file.write(json.dumps(element_dict, indent=2))
The JSON result corresponding to the piece of HTML above:
{
"type": "Title",
"element_id": "e68ee04dff59551b7d1ae07a2f8a00dc",
"text": "Examples",
"metadata": {
"category_depth": 1,
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "2253b75dcb33b928dae76ea64543f053",
"url": "https://anyverse.gitlab.io/anyversestudio/",
"filetype": "text/html"
}
},
{
"type": "Title",
"element_id": "534a8b35bbd7e5f0b6006d63efe887a9",
"text": "Create camera",
"metadata": {
"category_depth": 2,
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "e68ee04dff59551b7d1ae07a2f8a00dc",
"url": "https://anyverse.gitlab.io/anyversestudio/",
"filetype": "text/html"
}
},
{
"type": "NarrativeText",
"element_id": "ae9df01594a27733b24d33ca212f2e66",
"text": "Create a camera. Setting invalid_entity_id as the parent entity will make the camera to be created under the Ego entity, as it must be",
"metadata": {
"page_number": 1,
"languages": [
"eng"
],
"parent_id": "534a8b35bbd7e5f0b6006d63efe887a9",
"url": "https://anyverse.gitlab.io/anyversestudio/",
"filetype": "text/html"
}
},
{
"type": "Title",
"element_id": "ec7fe9ae1c7f315580886ee99e826f3c",
"text": "Add resource to workspace",
"metadata": {
"category_depth": 2,
"page_number": 2,
"languages": [
"eng"
],
"parent_id": "e68ee04dff59551b7d1ae07a2f8a00dc",
"url": "https://anyverse.gitlab.io/anyversestudio/",
"filetype": "text/html"
}
},
As you can see, the python code text is missing.
I also tried langchain for this:
from langchain_text_splitters import HTMLHeaderTextSplitter
splitter = HTMLHeaderTextSplitter(
headers_to_split_on=[
("h1","Header1"),
("h2","Header2"),
("h3","Header3"),
]
)
chunks = splitter.split_text_from_url(web_path)
for index,chunk in enumerate(chunks):
output_path = os.path.join(output_dir_documentation, f'{index}.txt')
with open(output_path,"w",encoding="utf-8") as f:
f.write(str(chunk))
The corresponding chunk for the “Create camera” header is:
page_content='Create a camera. Setting invalid_entity_id as the parent entity will make the camera to be created under the Ego entity, as it must be' metadata={'Header1': 'Scripting', 'Header2': 'Examples', 'Header3': 'Create camera'}
No text from the pre/code HTML tag appears. Of course, I checked out all generated chunks for the webpage, and no text under pre/code tags are in the chunks.
What I am missing here? How can I tune partition_html and/or HTMLHeaderTextSplitter in order to get the text under pre/code HTML tags?
NOTE: I found out that using BeautifulSoup I can get the missing text from the “pre” tags, but this complicates the chunking too much because I just need the title (, , etc) as a chunking condition. Chunking the no-code part first, and then extracting the code part, to finally merge both somehow doesn’t seem the way to go. Specialized tools like langchain and unstructured should be able to handle this pre and/or code HTML tags.
The BeautifulSoup code:
import requests
data = requests.get(web_path)
from bs4 import BeautifulSoup
soup = BeautifulSoup(data.text, 'html.parser')
content = soup.find_all("pre")