DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Coding
  3. Java
  4. Behind-the-Scenes Secrets of Jsoup: Introduction

Behind-the-Scenes Secrets of Jsoup: Introduction

Jsoup is — probably — the most popular library in Java community.

Nathanael Yang user avatar by
Nathanael Yang
·
Jan. 15, 19 · Presentation
Like (5)
Save
Tweet
Share
18.71K Views

Join the DZone community and get the full member experience.

Join For Free

Jsoup would probably be the most popular “working with real-world HTML” library in the Java community. I’ve been using it for web crawler stuff since 1.7.3 (the latest release is 1.11.3), but a little bit surprised to see that there is little introduction or analysis regarding its source code and implementations.

Since I will use jsoup as an example for the OOP course(SE500) I teach at Olivet Institute of Technology from Jan 2019, I tried to summarize the most important ideas behind the scenes so that at least I know what I will be talking about. This series was inspired by jsoup-learning and I reused some graphs from it. Many thanks to Yihua Huang for digging into this beautiful library.

This is the first part of the series on using jsoup. I will give a brief introduction about its feature and general code structure. After that, I will analyze the DOM parser and CSS selector implementation mechanism, plus with some interesting tips and tricks in later articles.

Jsoup was developed by Jonathan Hedley, a Senior Manager of Software Development at Amazon. According to the change logs, the initial beta was released at Jan 31, 2010, so it has been about 9 years! He is still maintaining the code base regularly, though not that actively as before. It might be due to “jsoup is in general, stable release” as he said.

I use the latest version 1.12.1-SNAPSHOT for these series. It’s a Maven project without any external dependencies, though it introduced junit, gson, and jetty for unit test and integration test usage. According to the statistic result of cloc, there are 68 Java source files under src\main\java, 12015 lines of code, 4177 lines of comment, and 1991 blank lines. As a library with good test coverage, there are 46 Java source files under src\test\java, with 7911 lines of code, 350 lines of comment, and 1672 blank lines.

The percentage of test code against production code is 65.8 percent, which is pretty good. As for test coverage, Intellij IDEA code coverage runner gives the report in the below, which is also impressive.

Package Class, % Method, % Line, %
org.jsoup 98% (229/233) 87% (1269/1457) 83% (6135/7317)
org.jsoup.helper 100% (11/11) 79% (138/174) 83% (692/824)
org.jsoup.internal 100% (3/3) 100% (27/27) 95% (141/147)
org.jsoup.nodes 96% (31/32) 87% (356/407) 88% (1286/1455)
org.jsoup.parser 100% (114/114) 91% (490/535) 79% (2924/3699)
org.jsoup.safety 100% (9/9) 100% (44/44) 95% (286/300)
org.jsoup.select 100% (58/58) 81% (193/237) 92% (772/836)

You may find that package org.jsoup.examples is missing here. Since it is used as a showcase, it’s reasonable that there is no need to write tests against them, thus I exclude it. It might be better to remove them out of production code and extract to another project with more examples covering most frequent scenarios — just in my opinion.

As a widely used library aiming at “working with real-world HTML," the higher the test coverage is, the better. I’ve heard that Evan You, author of Vue.js, even achieved 100 percent unit test coverage! 669 test cases make sure jsoup stay at good status — but this still cannot prevent new issues happen. Anyway, real-world is always a crazy world, go to the test cases and you will believe what I said. For example, <p =a>One<a <p>Something</p>Else, <div id=1<p id='2', what the heck is this? What should be the expected correct parsing result? Can you figure them out in five seconds?

Even though jsoup is well covered by unit tests and maintains a high quality of implementation, you still need to be very careful when you prepare to upgrade to a new version — actually, this is something to be self-evident — you just need more and more test cases to make your life easier. I remember very clearly in December 2016, some of my unit tests suddenly failed after I upgraded from 1.8.3 to 1.10.1 without doing anything else. Since my software was used by thousands of clients, I immediately report an issue in GitHub and rollback to 1.8.3 for a while, I just can’t imagine what will happen if I simply upgrade it, you know, some bugs just appear in certain circumstances and not easy to reproduce in normal functional tests.

From the test coverage result, you will also get a whole picture of code structure. While jsoup providing convenient methods to submit HTTP requests and get responses, the most important parts are still under package org.jsoup.parseand org.jsoup.select. I will introduce them in the second and third articles of these series. The five lines of code below, covering most frequently used scenarios, are fairly clean, easy, and simple, which only means that jsoup did quite a good job for its API user experience.

Document doc = Jsoup.connect("https://en.wikipedia.org").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
    log("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));
}


Jsoup also provides a website for you to play around with its selector. Try jsoup is the place where you can explore features of jsoup without writing one line of code. It is also created for issue report. You can save the input, parameters, and output so that those who want to help you will go to the point much faster. For example, I want to retrieve all the post titles of Old Young Boys Club: http://try.jsoup.org/~iZFlhTQQAZnkXoPSCKrG4OisFcg. The upgrade issue I mentioned above also had a corresponding session link http://try.jsoup.org/~NOfOU7vXHAaHWhDnHv5qBIPtE1M still available today.

After you get familiar with the features of jsoup, it’s time to go to source code to understand the mechanism. I would suggest you run class org.jsoup.examples.Wikipedia and debug the five lines of code above to see what actually happened step by step. Beware, it’s a long journey. If you get stuck you may also go over the test cases to understand what kind of problems the code will resolve — and how you would resolve. It’s also helpful to fork the repository and try to submit some pull requests, you may either pick up an issue and try to fix or make some minor enhancements. Actually, I just submitted Pull Requests #1157 and #1158 yesterday plus two issues: #1156 and #1159. I hope Jonathan would accept my PR and consider fixing these issues.

PS: Once there was a Japanese Samurai who submitted a Pull Request #564 at Apr 27, 2015, got approved and merged at Nov 19, 2017. He recorded this unforgettable experience in twitter.

Image title

Stay tuned!

You can check out the rest of the series in my blogger  — there are five finished articles already!

Notice: Based on feedback from the comments and to avoid confusion, I'd list the links of all the 5 articles here first, as it's not easy to publish all the 5 articles in DZone in a short time.

Behind-the-Scenes Secrets of Jsoup I: Introduction 

Behind-the-Scenes Secrets of Jsoup II: Traverse A Tree Before It Was Built

Behind-the-Scenes Secrets of Jsoup III: The Tree and The State Machine 

Behind-the-Scenes Secrets of Jsoup IV: CSS Selector 

Behind-the-Scenes Secrets of Jsoup IV: Tips & Tricks of Optimization 

Jsoup unit test code style

Published at DZone with permission of Nathanael Yang. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Quick Pattern-Matching Queries in PostgreSQL and YugabyteDB
  • Bye-Bye, Regular Dev [Comic]
  • 5 Factors When Selecting a Database
  • Agile Transformation With ChatGPT or McBoston?

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: