Over a million developers have joined DZone.

Python-based html5lib in Firefox's new HTML5 Parser

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

Last month, Firefox started shipping with a new HTML5 parser by default.  Although  it might not seem like much, it's another milestone in the journey of HTML5 since it replaces Gecko's old HTML parser.  The transition was seamless and allows detailed defining of HTML5 documents.  It started with html5lib, a Python implementation of the WHATWG HTML5 spec (essentially a tokenizer, a parser, and a serializer), which was developed by James Graham and Anne van Kesteren.

html5lib, a python library for parsing HTML, started in 2006 and has gone from version 0.1 to 0.9.  It focuses less on performance, like some C libraries, and instead does a much better job of recognizing the wide variety of HTML on the web.    There is also a PHP implementation and a Ruby port that hasn't been maintained for awhile.  The user documentation elaborates

In version 0.9 html5lib gained the following features:

  • Parses valid and invalid HTML documents to a tree
  • Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup (deprecated) and custom simpletree output formats
  • DOM to SAX converter
  • Reports parse errors
  • Character encoding detection
  • Filtering and serializing of trees
  • HTML+CSS sanitizer
  • Many unit tests

(Users of the sanitizer must ensure that they serialize with quoted attribute values to avoid some known script injection holes in older browsers including IE < 8)

The best part of the library is the test suite for parsing HTML according to the HTML5 spec.  Using this test suite, Mozilla developed the Validator.nu code to drive the new HTML parser in the Gecko rendering engine. 

Maybe you'll find an application or build one that uses the html5lib library.  The project is open to new contributors and the code is available under the MIT license.

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.


The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}