I am using BeautifulSoup4 to parse HTML string into a structured object.
So for each HTML element (e.g. soup.body.title) I wanted there to be a attribute called embed (e.g. soup.body.title.embed). So I created child classes for Tag and BeautifulSoup with the embed attribute added.
but there is a problem. The type of root node object is EmbedSoup, which is as I intended, but the type of soup.body
is bs4.element.Tag instead of EmbedTag. How do I make sure that all elements of the BeautifulSoup Tree should be of type EmbedTag and not bs4.element.Tag. Or is there another solution to this problem that I am having?
class EmbedTag(Tag):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.embed = None # Initialize the embed attribute to None
class EmbedSoup(BeautifulSoup):
def __init__(self, *args, **kwargs):
kwargs['element_classes'] = {'tag': EmbedTag}
super().__init__(*args, **kwargs)
# Parse the HTML with the custom BeautifulSoup class
soup = EmbedSoup(html_content, 'html.parser')
type(soup) --> EmbedSoup
type(soup.body) --> bs4.element.Tag