Python-based html5lib in Firefox's new HTML5 Parser
DZone 's Guide to
Python-based html5lib in Firefox's new HTML5 Parser
Join the DZone community and get the full member experience.
Join For Free
Last month, Firefox started
shipping with a new HTML5 parser by default. Although it might not seem like much, it's another milestone in the journey of HTML5 since it replaces Gecko's old HTML parser. The transition was seamless and allows detailed defining of HTML5 documents. It started with
html5lib, a Python implementation of the WHATWG HTML5 spec (essentially a tokenizer, a parser, and a serializer), which was developed by James Graham and
Anne van Kesteren.
html5lib, a python library for parsing HTML, started in 2006 and has gone from version 0.1 to 0.9. It focuses less on performance, like some C libraries, and instead does a much better job of recognizing the wide variety of HTML on the web. There is also a PHP implementation and a Ruby port that hasn't been maintained for awhile. The user documentation
elaborates.
In version 0.9 html5lib gained the following features:
(Users of the sanitizer must ensure that they serialize with quoted attribute values to avoid some known script injection holes in older browsers including IE < 8)
The best part of the library is the test suite for parsing HTML according to the HTML5 spec. Using this test suite, Mozilla developed the Validator.nu code to drive the new HTML parser in the Gecko rendering engine.
Maybe you'll find an application or build one that uses the html5lib library. The project is open to new contributors and the code is available under the MIT license.

In version 0.9 html5lib gained the following features:
- Parses valid and invalid HTML documents to a tree
- Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup (deprecated) and custom simpletree output formats
- DOM to SAX converter
- Reports parse errors
- Character encoding detection
- Filtering and serializing of trees
- HTML+CSS sanitizer
- Many unit tests
(Users of the sanitizer must ensure that they serialize with quoted attribute values to avoid some known script injection holes in older browsers including IE < 8)

The best part of the library is the test suite for parsing HTML according to the HTML5 spec. Using this test suite, Mozilla developed the Validator.nu code to drive the new HTML parser in the Gecko rendering engine.

Maybe you'll find an application or build one that uses the html5lib library. The project is open to new contributors and the code is available under the MIT license.
Topics:
Opinions expressed by DZone contributors are their own.
{{ parent.title || parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}