I’ve heard, that parsing HTML using the Cthulhu way is not very good. But what are the right ways to parse HTML? Or is it possible to parse it at all?
4
Or is it possible to parse it at all?
Some say it’s possible, and that even webbrowsers use this feature to display web pages.
what are the right ways to parse HTML?
Basically you need a parser able to express the idea that an html element can be composed of other html elements.
<div>
some text
<div>
nested element!!
</div> <!--a regular expression cannot tell if this closes the first or second div-->
</div>
This cannot be done with regular expression. But you can do it with more general kinds of parsers.
see https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not
1