Parsing HTML: Selecting the Right Library (Part 2)
With Java out of the way, now we compare two C# libraries designed for parsing HTML, considering their strengths and weaknesses.
Join the DZone community and get the full member experience.
Join For FreeLast time, we looked over the various HTML parsers you can consider when working with Java. This time, we'll examine a couple of popular C# libraries worth considering as we examine their features, benefits, and drawbacks when processing HTML.
C#
AngleSharp
The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications.
AngleSharp is, quite simply, the default choice for whenever you need a modern HTML parser for a C# project. In fact, it does not just parse HTML5, but also its most-used companions: CSS and SVG. There is also an extension to integrate scripting in the context of parsing HTML documents: both C# and JavaScript, based on Jint. That means that you can parse HTML documents after they have been modified by JavaScript — both the JavaScript included in the page or a script you add yourself.
AngleSharp fully supports modern conventions for easy manipulation, like CSS selectors and jQuery-like constructs. But it is also well-integrated with the .NET world, with support for LINQ for DOM elements. The author mentions that it may evolve into something more than a parser, but for the moment, it can do simple things like submitting forms.
The following example from the documentation shows a few features of AngleSharp.
var parser = new HtmlParser();
var document = parser.Parse("<ul><li>First item<li>Second item<li class='blue'>Third item!<li class='blue red'>Last item!</ul>");
//Do something with LINQ
var blueListItemsLinq = document.All.Where(m => m.LocalName == "li" && m.ClassList.Contains("blue"));
//Or directly with CSS selectors
var blueListItemsCssSelector = document.QuerySelectorAll("li.blue");
Console.WriteLine("Comparing both ways ...");
Console.WriteLine();
Console.WriteLine("LINQ:");
foreach (var item in blueListItemsLinq)
Console.WriteLine(item.Text());
Console.WriteLine();
Console.WriteLine("CSS:");
foreach (var item in blueListItemsCssSelector)
Console.WriteLine(item.Text());
The documentation may contain all the information you need, but it certainly could use better organization. For the most part, it is delivered within the GitHub project, but there are also tutorials on CodeProject by the author of the library.
HtmlAgilityPack
HtmlAgilityPack was once considered the default choice for HTML parsing with C#, although some say that was due to the lack of better alternatives — because the quality of the code was low. In any case, it was essentially abandoned for the last few years, until it was recently revived by ZZZ Projects.
In terms of features and quality, it is quite lacking, at least compared to AngleSharp. Support for CSS selector, necessary for modern HTML parsing, and support for .NET Standard, necessary for modern C# projects, are on the roadmap. On the same document, there is also a planned cleanup of the code.
If you are in need of things like XPath, HtmlAgilityPack should be your best choice. In other cases, I do not think it is the best choice right now — unless you are already using it. That is especially true since there is no documentation. That being said, the new maintainer and the prospect for better features are a good reason to keep using it if you are already a user.
// Load an HTML document
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);
// Get value with XPath
var value = doc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;
Conclusion
Now that you've seen the heavy hitters in the C# world of parsing HTML, next time, we'll take a look at the options out there for Python.
Published at DZone with permission of Gabriele Tomassetti, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments