Over a million developers have joined DZone.

HtmlUnit Example for Html Parsing

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

In continuation of my earlier blog HtmlUnit vs JSoup, in this blog, I will show you how to write a simple web scraping sample using HtmlUnit. This example will parse html data and get unstructured Web data in a structured format.

In this simple example, we will connect to Wikipedia and get list of all movies and their wikepedia source links. The page looks as below,

HtmlUnit: Screen awards movie list

HtmlUnit: Screen awards movie list

As always let us start with a maven dependency entry in our pom.xml to include HtmlUnit as below,

<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.11</version>
</dependency>

Again we will start with a simple JUnit testcase as below,

@Test
public void testBestMovieList() throws FailingHttpStatusCodeException, MalformedURLException, IOException {

final WebClient webClient = new WebClient();
final HtmlPage startPage = webClient.getPage("http://en.wikipedia.org/wiki/Screen_Award_for_Best_Film");

String source = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@href";
String[] sourceArr = source.split(":");

String title = "/html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[:?:]/td[2]/i/a/@title";
String[] titleArr = title.split(":");

String titleData = titleArr[0] + 2 + titleArr[2];
String sourceData = sourceArr[0] + 2 + sourceArr[2];
List<DomNode> titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
List<DomNode> sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Hum Aapke Hain Kaun", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Hum_Aapke_Hain_Kaun", sourceNodes.get(0).getNodeValue());

titleData = titleArr[0] + 3 + titleArr[2];
sourceData = sourceArr[0] + 3 + sourceArr[2];
titleNodes = (List<DomNode>) startPage.getByXPath(titleData);
assertTrue(titleNodes.size() > 0);
sourceNodes = (List<DomNode>) startPage.getByXPath(sourceData);
assertTrue(sourceNodes.size() > 0);
assertEquals("Dilwale Dulhaniya Le Jayenge", titleNodes.get(0).getNodeValue());
assertEquals("/wiki/Dilwale_Dulhaniya_Le_Jayenge", sourceNodes.get(0).getNodeValue());
}

If you notice I am accessing the page http://en.wikipedia.org/wiki/Screen_Award_for_Best_Film which looks as per the above diagram. We are getting the 1st and 2nd movies on the page and JUnit assert for the same and the test succeeds. If you also notice I am using the XPaths to access the elements like /html/body/div[3]/div[3]/div[4]/table[2]/tbody/tr[2]/td[2]/i/a/@title. The way I am extracting the XPath is to use Firebug as per this blog HtmlUnit vs JSoup.

I hope this blog helped you.


 

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.

Topics:
java,frameworks,tips and tricks

Published at DZone with permission of Krishna Prasad, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}