html5lib, a python library for parsing HTML, started in 2006 and has gone from version 0.1 to 0.9. It focuses less on performance, like some C libraries, and instead does a much better job of recognizing the wide variety of HTML on the web. There is also a PHP implementation and a Ruby port that hasn't been maintained for awhile. The user documentation elaborates.
In version 0.9 html5lib gained the following features:
- Parses valid and invalid HTML documents to a tree
- Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup (deprecated) and custom simpletree output formats
- DOM to SAX converter
- Reports parse errors
- Character encoding detection
- Filtering and serializing of trees
- HTML+CSS sanitizer
- Many unit tests
(Users of the sanitizer must ensure that they serialize with quoted attribute values to avoid some known script injection holes in older browsers including IE < 8)
The best part of the library is the test suite for parsing HTML according to the HTML5 spec. Using this test suite, Mozilla developed the Validator.nu code to drive the new HTML parser in the Gecko rendering engine.
Maybe you'll find an application or build one that uses the html5lib library. The project is open to new contributors and the code is available under the MIT license.