Over a million developers have joined DZone.

A Universal Document Scraper in Scala

DZone's Guide to

A Universal Document Scraper in Scala

· Big Data Zone ·
Free Resource

Learn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now.

As part of a project I’m working on, I needed to get documents from state institutions. Instead of writing code specific to each site, I decided to try creating a “universal” document scraper. It can be found as a separate module within the main project: https://github.com/Glamdring/state-alerts/. The project is written in Scala, and can be used in any JVM project (provided you add a Scala JAR dependency). It is meant for scraping documents, rather than random data. It can probably be extended to do that, but for now I’d like it to be more (state)-document or open-data oriented, rather than a tool for commercial scraping (which is often frowned upon).

It is now in a more or less stable form--I’ve already deployed the application and it works properly--so I’ll just share a short description of the functionality. The point is to be able to specify scraping only via configuration. The class used to configure individual scraping instances is ExtractorDescriptor. There you specify a number of things:

  • The target URL, HTTP method and body parameters (in case of POST). You can put a placeholder {x} which will be used for paging.
  • The type of document (PDF, doc or HTML) and the type of the scraping workflow. In other words, how the document is reached on the target page. There are four options, depending on whether there’s a separate details page, whether there’s only a table and where the link to the document is located.
  • XPath expressions for elements, which contain meta data and the links to the documents. There’s a different expression depending on where the information is located: in a table, or in separate details page.
  • The date format, for the date of the document. Regex can be used if the date cannot be strictly located by XPath.
  • Simple “heuristics." For example, if you know the URL structure of the document you are looking for, there’s no need to locate it via XPath.
  • Other configurations, like javascript requirements, whether scraping should fail on error, and so on.

When you have an ExtractorDescriptor instance ready (for Java apps you can use the builder to create one), you can create a new Extractor(descriptor), and then (usually with a scheduled job) call extractor.extractDocuments(since).

The result is a list of documents (there are two methods – one returns a Scala list, and one returns a Java list).

The library depends on HtmlUnit, NekoHtml, Scala, xml-apis and some others, visible in the POM. It doesn’t support multiple parsers. It also doesn’t handle distributed running of scraping tasks; this you should handle yourself. No JAR release or Maven dependency is published yet; if one needs it, it has to be checked-out and built. I hope it is useful, though. If not as code, then at least as an approach to getting data from web pages programatically.

See more at: http://techblog.bozho.net/?p=1215#sthash.oHy8a4NI.dpuf

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Our Chief Data Scientist discusses the source of most headlines about AI failures here.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}