Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Parsing HTML: Selecting the Right Library (Part 1)

DZone's Guide to

Parsing HTML: Selecting the Right Library (Part 1)

Consider the many libraries out there for your HTML parsing needs. This series starts by looking at the popular HTML parsers made for Java.

· Web Dev Zone ·
Free Resource

Bugsnag monitors application stability, so you can make data-driven decisions on whether you should be building new features, or fixing bugs. Learn more.

HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML with a parser generator. Actually, you may not even need to do that if you choose a popular parser generator, like ANTLR. That is because there are already available grammars ready to be used.

HTML is so popular that there is even a better option: using a library. It is easier to use and usually provides more features, such as a way to create an HTML document or support easy navigation through the parsed document. For example, usually, it comes with a CSS/jQuery-like selector to find nodes according to their position in the hierarchy.

The goal of this series is to help you find the right library to process HTML. Whatever you are using: Java, C#, Python, or JavaScript, we got you covered.

We are not going to see libraries for more specific tasks, such as article extractors or web scraping, like Goose. They have typically restricted uses, while in this article, we focus on the generic libraries to process HTML.

Java

Let’s start with the Java libraries to process HTML.

Lagarto and Jerry

Jodd is set of Java micro frameworks, tools and utilities

Among the many Jodd components available, there are Lagarto, an HTML parser, and Jerry, defined as jQuery in Java. There are even more components that can do other things. For instance CSSelly, which is a parser for CSS-selectors strings and powers Jerry, and StripHtml, which reduces the size of HTML documents.

Lagarto works as a traditional parser more than the typical library. You have to build a visitor and then the parser will call the proper function each time a tag is encountered. The interface is simple and, mainly, you have to implement a visitor that will be called for each tag and for each piece of text. Lagarto is quite basic, it just does parsing. Even the building of the DOM tree is done by an extension, aptly called DOMBuilder.

While Lagarto could be very useful for advanced parsing tasks, usually you will want to use Jerry. Jerry tries to stay as close as possible to jQuery, but only to its static and HTML manipulation parts. It does not implement animations or Ajax calls. Behind the scenes, Jerry uses Lagarto and CSSelly, but it is much easier to use. Also, you are probably already familiar with jQuery.

The documentation of Jerry is good and there are a few examples in the documentation, including the following one:

// from the documentation 
public class ChangeGooglePage
{
    public static void main(String[] args) throws IOException
    {
        // download the page super-efficiently
        File file = new File(SystemUtil.getTempDir(), "google.html");
        NetUtil.downloadFile("http://google.com", file);

        // create Jerry, i.e. document context
        Jerry doc = Jerry.jerry(FileUtil.readString(file));

        // remove div for toolbar
        doc.$("div#mngb").detach();
        // replace logo with html content
        doc.$("div#lga").html("<b>Google</b>");

        // produce clean html...
        String newHtml = doc.html();
        // ...and save it to file system
        FileUtil.writeString(
            new File(SystemUtil.getTempDir(), "google2.html"),
            newHtml);
    }
}


HTMLCleaner

HTMLCleaner is a parser that is mainly designed to be a cleaner of HTML for further processing. As the documentation explains it:

HtmlCleaner is an open source HTML parser written in Java. HTML found on the Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring some order to the tags, attributes and ordinary text. For any given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create the Document Object Model. However, you can provide custom tag and rule sets for tag filtering and balancing.

This explanation also reveals that the project is old, given that in the last few years, the broken HTML problem is much less prominent than it was before. However, it is still updated and maintained. So the disadvantage of using HTMLCleaner is that the interface is a bit old and can be clunky when you need to manipulate HTML.

The advantage is that it works well even on old HTML documents. It can also write the documents in XML or pretty HTML (i.e., with the correct indentation). If you need JDOM and a product that support XPath, or you even like XML, look no further.

The documentation offers a few examples and API documentation, but nothing more. The following example comes from it.

HtmlCleaner cleaner = new HtmlCleaner();
final String siteUrl = "http://www.themoscowtimes.com/";
 
TagNode node = cleaner.clean(new URL(siteUrl));
 
// traverse whole DOM and update images to absolute URLs
node.traverse(new TagNodeVisitor() {
    public boolean visit(TagNode tagNode, HtmlNode htmlNode) {
        if (htmlNode instanceof TagNode) {
            TagNode tag = (TagNode) htmlNode;
            String tagName = tag.getName();
            if ("img".equals(tagName)) {
                String src = tag.getAttributeByName("src");
                if (src != null) {
                    tag.setAttribute("src", Utils.fullUrl(siteUrl, src));
                }
            }
        } else if (htmlNode instanceof CommentNode) {
            CommentNode comment = ((CommentNode) htmlNode); 
            comment.getContent().append(" -- By HtmlCleaner");
        }
        // tells visitor to continue traversing the DOM tree
        return true;
    }
});
 
SimpleHtmlSerializer serializer = 
    new SimpleHtmlSerializer(cleaner.getProperties());
serializer.writeToFile(node, "c:/temp/themoscowtimes.html");


Jsoup

jsoup is a Java library for working with real-world HTML

Jsoup is a library with a long history, but a modern attitude:

  • It can handle old and bad HTML, but it also equipped for HTML5
  • It has powerful support for manipulation, with support for CSS selectors, DOM Traversal and easy addition or removal of HTML
  • It can clean HTML, both to protect against XSS attacks and in the sense that it improves structure and formatting

There is little more to say about jsoup, because it does everything you need from an HTML parser and even more (e.g., cleaning HTML documents). It can be very concise.

In this example it directly fetch HTML documents from an URL and select a few links. On line 9 you can also see a nice option: the chance to automatically get the absolute url even if the attribute href reference a local one. This is possible by using the proper setting, which is set implicitly when you fetch the URL with the connect method.

Document doc = Jsoup.connect("http://en.wikipedia.org/")
               .userAgent("Mozilla")
               .get();

Elements newsHeadlines = doc.select("#mp-itn b a");

print("\nLinks: (%d)", newsHeadlines.size());
for (Element link : newsHeadlines) {
   print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
}


The documentation lacks a tutorial, but it provides a cookbook, that essentially fulfills the same function, and an API reference. There is also an online interactive demo that shows how jsoup parses an HTML document.

Conclusion

That's all for now! Next up, we'll dive into the popular C# libraries that are out there, compare them and their functionalities, and see when you should consider using them.

Monitor application stability with Bugsnag to decide if your engineering team should be building new features on your roadmap or fixing bugs to stabilize your application.Try it free.

Topics:
web dev ,html ,java parser ,libraries ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}