Platinum Partner
html,parser,jdk,swing,jtidy

How do you parse HTML in Java?

The Open Source HTML Parsers in Java page is useful in listing the HTML parsers that are out there. But it doesn't give much of a clue about which are the "best" in a given situation. In other words, how should one decide which HTML parser to use? And, doesn't the proliferation of HTML parsers out there imply that there is something wrong with the JDK's own HTML parser, javax.swing.text.html.HTMLEditorKit.Parser?

All things being equal, shouldn't one prefer to use a utility provided by the JDK over one provided by a third party library? (For this reason, I'm assuming that all things are not equal in this case.) I've been parsing HTML using the JDK's HTML parser, based on the approach described in Parsing HTML with Swing, although that's an article written in 2003, so it may be dated. The author of that article points to this weakness of the Swing HTML parser: "The biggest downside to this HTML parser is that it is not thread safe (thread safety has always been a problem with Swing components). This HTML processor is no different. I have used the Swing parser in heavily threaded environments, and it has resulted in a crash—eventually. If you want to use this HTML processor in a heavily threaded environment, you need to take steps to ensure that only one thread uses it at a time."

Is that the only weakness here? (By the way, on the positive side, the author writes: "I have used this parser with a number of programs that I have written, and I have found it to be very useful. It is particularly helpful for handling improperly formatted HTML, which can trip up some HTML parsers.") I guess the other HTML parsers may have additional features, those that relate to transformation in addition to parsing. And the other parsers probably allow for walking the DOM, rather than inspecting tags in the way that the Swing HTML Parser does. I have used JTidy before, but didn't find the benefits to outweigh the cumbersomeness of having to deal with a third party library.

Anyone care to share their experiences with these utilities?

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}