Python-based html5lib in Firefox's new HTML5 Parser

DZone 's Guide to

Python-based html5lib in Firefox's new HTML5 Parser

· Web Dev Zone ·
Free Resource
Last month, Firefox started shipping with a new HTML5 parser by default.  Although  it might not seem like much, it's another milestone in the journey of HTML5 since it replaces Gecko's old HTML parser.  The transition was seamless and allows detailed defining of HTML5 documents.  It started with html5lib, a Python implementation of the WHATWG HTML5 spec (essentially a tokenizer, a parser, and a serializer), which was developed by James Graham and Anne van Kesteren.

html5lib, a python library for parsing HTML, started in 2006 and has gone from version 0.1 to 0.9.  It focuses less on performance, like some C libraries, and instead does a much better job of recognizing the wide variety of HTML on the web.    There is also a PHP implementation and a Ruby port that hasn't been maintained for awhile.  The user documentation elaborates

In version 0.9 html5lib gained the following features:

  • Parses valid and invalid HTML documents to a tree
  • Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup (deprecated) and custom simpletree output formats
  • DOM to SAX converter
  • Reports parse errors
  • Character encoding detection
  • Filtering and serializing of trees
  • HTML+CSS sanitizer
  • Many unit tests

(Users of the sanitizer must ensure that they serialize with quoted attribute values to avoid some known script injection holes in older browsers including IE < 8)

The best part of the library is the test suite for parsing HTML according to the HTML5 spec.  Using this test suite, Mozilla developed the Validator.nu code to drive the new HTML parser in the Gecko rendering engine. 

Maybe you'll find an application or build one that uses the html5lib library.  The project is open to new contributors and the code is available under the MIT license.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}