By default it seems that html.parser.HTMLParser
cannot handle self closing tags correctly, if they are not terminated using /
. E.g. it handles <img src="asfd"/>
fine, but it incorrectly handles <img scr="asdf">
by considering it as a non-self closing tag.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.depth = 0
def handle_starttag(self, tag, attrs):
print('|'*self.depth + tag)
self.depth += 1
def handle_endtag(self, tag):
self.depth -= 1
html_content = """
<html>
<head>
<title>test</title>
</head>
<body>
<div>
<img src="https://closed.example.com" />
<div>1</div>
<div>2</div>
<img src="https://unclosed.example.com">
<div>3</div> <!-- will be indented too far -->
<div>4</div>
</div>
</body>
</html>
"""
parser = MyHTMLParser()
parser.feed(html_content)
Is there a way to change this behaviour so it correctly handles self-closing tags without slash, or maybe a workaround?
For context: I’m writing a script for an environment where I only have access to a pure python interpreter and can only use built-in libraries, I cannot use any other ones.