I have often wondered why strict parsing was not chosen when creating HTML. For most of the Internet history, browsers have accepted any kind of markup and tried their best to parse it. The process degrades performance, permits people to write gibberish, and makes it difficult discontinue obsolete features.
Is there a specific reason why HTML is not strictly parsed?
10
The reason is simple: At the time of the first graphical browsers, NCSA Mosiac and later Netscape Navigator, almost all HTML was written by hand. The browser authors (Netscape was built by ex-Mosaic folks) recognized quickly that refusing to render incorrect HTML would be held against them by the users, and voila!
5
Because making best guesses is the right thing to do, from a browser-maker’s perspective. Consider the situation: ideally, the HTML you receive is completely correct and to spec. That’s great. But the interesting part is what happens when the HTML is not correct; since we’re dealing with input from a source that we have no influence on, really, we have to be prepared for this. Now when that happens, what could we do? We have two options: a) fail, and b) make a best effort to recover from the error. If we fail, the user has nothing but a useless error message, and there is nothing they can do about it, because they don’t control the server. If we make a best effort, the user has at least what we could make of the page, and often the guess is mostly right.
The only real problem with this is when you need the error messages, which is typically in a development situation – you want to make sure the HTML you generate is correct, and since “works in browser X” is not equivalent to “correct”, we can’t simply run it through a browser and see if it works: we can’t tell the difference between correct HTML and incorrect HTML that the browser has fixed for you. This is a solvable problem though; there are browser plugins that report standards violations, there’s the W3C validator, and lots of other similar tools.
4
HTML authors and authoring tools produce crappy markup. Browsers do their best with it for competitive reasons: a browsers that fails to render most of web pages in any reasonable way will be rejected by users, who won’t care the least about whose fault it is.
It’s rather different from what programming language implementations do. Compilers and interpreters work on code that can be assumed to be written by a programmer, whereas everyone and his brother can write HTML with minimal training, or without. HTML markup is code, in a sense, but it’s data rather than programming language instructions, and the (good) tradition in software is to be tolerant with data.
XHTML in principle imposes strict (XML) parsing rules, so that an XHTML document served with an XML content type will be displayed only if it is well-formed in the XML sense – otherwise, only the first error is communicated to the user. This never became popular in web authoring – almost all of the “XHTML” around is served as text/html and processed as traditional tag soup in a very liberal way, just with some new eccentricities .
8
The short of it would be that HTML was based on another non-hyperlinked markup language called SGML often used for documentation and manuals and the like.
From an article about the history of HTML:
Tim had mentioned that some of the early HTML documents were based on an old SGML language that CERN was already using:- We have included in HTML some tags from the SGML tagset used at and once supported at CERN […] The HTML parser will ignore tags which it does not understand, and will ignore attributes which it does not understand of CERN-SGML tags.
[…] most of the early HTML tags were actually taken from the CERN SGMLGuid language, which itself was a variant of AAP (an early SGML language). For example, title, hn, p, ol and so on are all apparently taken from this language. The only radical change was the addition of the all important anchor () link, without which the WWW wouldn’t have taken off.
Taking note of the part I’ve bolded, basically, they implemented a subset of the tags available in the SGML system they were familiar with, adding the new anchor <a> tag, and choosing to ignore any of the many tags they didn’t care about or wish to support for wahtever reason (such as tags for bibliography lists, xmp for “example”, “box” tag to draw a box around a block of text, etc). So the simplest way to do that is to be forgiving of markup that is not known by the parser and ignore unknown markup as best as possible, regardless of whether the cause is user typed bad markup, or the quickest easiest way to convert existing documents to this new HTML format is to add some hyperlinks to existing SGML documents, and ignore whatever tags aren’t supported or implemented.
1
This is partially a historic remnant of the browser war
IE and netscape were competing to take over the market and kept releasing new features that kept becoming more and more “awesome”, and were force to accept the pages designed for the other browser.
This means that browser accept and ignore unknown tags silently, after the committees started getting involved … well you have a committee designing stuff and as a result a lot of different versions (with some ambiguously worded specs) where browser want to support most of them, and creating a separate parser for each version would be enormous bloat. So it is (relatively) easier to use a single parser with different modes.
For another part netscape and IE wanted html to be accessible for the common man (as was the fad those days) which means trying to do what the user wanted to be done instead of what he said to do and tripping over every dangling tag.
Making the problem worse is that there are also several “tutorial” sites teaching the wrong thing and thinking they are right because what they teach works.
Ultimately this means that if you now create a browser with only strict html parsing 99% of the sites out there will just not work.
2
It was not a deliberate decision. The HTML standards didn’t explicitly state how to handle invalid HTML, so early browsers did what was simplest to implement: Skip over any unrecognized syntax and ignore all problems.
But first let’s clarify what we are talking about when we call HTML parsing non-strict, because there are a few separate issues at play.
HTML supports a number of syntax shortcuts which may give the impression that the HTML parser is not very strict, but which actually follows precise and unambiguous rules. These shortcuts are inherited from SGML (which has even more syntax shortcuts) in order to cut down on boilerplate and make it simpler to write by hand. For example:
- Some elements are implied by context and can therefore be left out. For example the
<html>
,<head>
and<body>
tags can be left out. - Some elements are automatically closed. For example, the
<p>
element is automatically closed when another block level element starts. The first HTML spec didn’t even have the</p>
closing tag, since it was unnecessary. - Attributes with predefined values be shortened, e.g instead of
disabled="disabled"
you can just writedisabled
. - Attributes only need to be quoted if they contain non-alphanumeric characters.
<input type=text>
is fine.
Furthermore, unknown elements and attributes are ignored. This is a deliberate decision to make the language extensible in a backwards-compatible way. Early browsers implemented this by just skipping any token they didn’t recognize.
But in addition to this, there is the more controversial question of what to do with outright illegal structure like a <td>
outside of a <table>
, overlapping elements like <i>foo<b>bar</i>baz</b>
or mangled syntax like <p <p>
. For a long time it was not specified how browsers should handle such malformed HTML. Browsers makers therefore just did what was simplest to implement. The first browsers were very simple pieces of software. They didn’t have a DOM or anything like that, they basically processed tags like a linear sequence of formatting codes. HTML like <i>foo<b>bar</i>baz</b>
can easily be processed in a linear way as [italic on]foo[bold on]bar[italic off]baz[bold off]
even though it cannot be parsed into a valid DOM tree.
Browers were not able to validate HTML up front due to incremenatal rendering. It was an important feature to be able to render HTML as soon as it was received, since the internet was slow. If a browser received HTML like <h1>Hello</h2>
. Then it would render Hello
in H1
style before the </h2>
end tag was received. Since the invalid end tag is only detected after the headline is rendered, it doesn’t make much sense to throw an error at this point. The simplest is just to ignore the unmatched end-tag.
Since unknown attributes should be ignored, the parser just skipped any unknown token in attribute position. In <p <p>
, the second<p
would be interpreted as an unknown attribute token and skipped. This turned out to be useful when XML HTML syntax became fashionable, since you could write <br />
and the trailing slash would just be ignored by HTML parsers.
There is a persistent rumor that the “lenient” parsing of HTML was a deliberate feature in order to make it easier for beginner or non-technical authors. I believe this is an after-the-fact justification, not the real reason. The behavior of HTML parsers is much better explained by implementers always just choosing what was simplest to implement. For example one of the most common syntax errors is mismatched quotes, like <p align="left>
. In this case the parser will just scan until the next quote, regardless of how much content it has to skip. Large blocks of content, perhaps the entire rest of the document, may disappear without a trace. This is clearly not very helpful. It would be much more helpful if an unescaped > terminated an attribute. But scanning until next matching quote was the simplest to implement, so this is what browsers did.
Unfortunately, these early design decisions were hard to undo when the web took off, because it turns out that it is impossible to make parsing stricter. The problem is if one browser introduces stricter parsing, some pages will break in this browser which works fine in other browser, and then people will just abandon the browser which “doesn’t work”. For example Netscape initially didn’t render a table at all if the closing </table>
was missing. But Internet Explorer did render the table, which forced Netscape to change their parsing to also render the table.
Early implementation decisions came back to bite the browser developers. For example in the beginning it was the simplest to just allow overlapping elements, but when the DOM was introduced, browsers had to implement complex rule for how to represent overlapping elements in a DOM-tree. Different handling of invalid HTML became a source of incompatibilities between browsers. Eventually the authors of the HTML standard bit the bullet and specified in excruciating detail how to parse any form of invalid HTML. This specification is enormously complex, but at least a source of incompatibility is eliminated.
XHTML was an attempt to improve the situation by providing a strictly parsed version of HTML. But the attempt largely failed because XHTML didn’t provide any significant benefit to authors compared to HTML. More importantly, browser vendors were not enthusiastic about the effort.
But why are browser vendors not enthusiastic about a strictly parsed version of HTML? Surely it would make their work easier? The problem is that browsers would still need to be able to parse all the existing pages on the internet which are not going to be fixed any time soon. Supporting strict parsing in addition would just add a new mode to the HTML parser which is complex enough as it is, and for no significant benefit to the user.
Well we tried to establish a nice strict option in the 000s but it didn’t pan out because people following “best practices” blindly, blamed the browsers when their incorrect markup went to pieces in strict mode. And the browser vendors didn’t like being blamed.
They claimed it was because they wanted the web more accessible to non-professionals but nobody was being stopped from using HTML 4 in its most lenient form.
That said, you can still serve HTML5 as XML if you desire strict-style layout. IMO it can be a good way to reap the benefits of doing layout or UI work in a stricter mode before you pass it on to other people who may or may not want it as strict without any real risks (barring them ripping the doctype out because they actually favor quirks mode – in 2017 (the time of this edit) they should be shot. So it’s still there basically but do some research. I seem to recall there being some caveats that we didn’t have with XHTML that didn’t really impact layout work. Just don’t spread the word that it’s “the only way to do it right” or the twits who buy into that kind of talk will dogpile the idea, blame the browsers again, and they’ll take the teeth out of the only strict alternative we have left. (2017 edit: I have no idea whether this still works – gave up)
http://mathiasbynens.be/notes/xhtml5