Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Don't Build a Crawler (If You Can Avoid It!)

DZone's Guide to

Don't Build a Crawler (If You Can Avoid It!)

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

On Tuesday, I spoke at the Data Science London meetup about football data, and I started out by covering some lessons I’ve learned about building data sets for personal use when open data isn’t available.

When that’s the case, you often end up scraping HTML pages to extract the data that you’re interested in, and then storing that in files, or in a database, if you want to be more fancy.

Ideally, we want to spend our time playing with the data, rather than gathering it, so we we want to keep this stage to a minimum, which we can do by following these rules:

Don’t Build a Crawler

One of the most tempting things to do is build a crawler, which starts on the home page and then follows some or all of the links it comes across, downloading those pages as it goes.

This is incredibly time-consuming, and yet this was the approach I took when scraping an internal staffing application to model ThoughtWorks consultants/projects in neo4j about 18 months ago.

Ashok wanted to get the same data a few months later, and instead of building a crawler, spent a bit of time understanding the URI structure of the pages he wanted, and then built up a list of pages to download.

It took him just a few minutes to build a script that would get the data, whereas I spent many hours using the crawler-based approach.

If there is no discernible URI structure, or if you want to get every single page, then the crawler approach might make sense, but I try to avoid it as a first option.

Download the Files

The second thing I learned is that running Web Driver, nokogiri or enlive against live web pages and then only storing the parts of the page we’re interested in is suboptimal.

We pay the network cost every time we run the script, and at the beginning of a data gathering exercise, we won’t know exactly what data we need, so we’re bound to have to run it multiple times until we get it right.

It’s much quicker to download the files to disk and work on them locally.

Use Wget

Having spent a lot of time writing different tools to download the ThoughtWorks data set, Ashok asked me why I wasn’t using Wget instead.

I couldn’t think of a good reason, so now I favor building up a list of URIs and letting Wget take care of downloading them for us. For example:

$ head -n 5 uris.txt
https://www.some-made-up-place.com/page1.html
https://www.some-made-up-place.com/page2.html
https://www.some-made-up-place.com/page3.html
https://www.some-made-up-place.com/page4.html
https://www.some-made-up-place.com/page5.html
 
$ cat uris.txt | time xargs wget
...
Total wall clock time: 3.7s
Downloaded: 60 files, 625K in 0.7s (870 KB/s)
        3.73 real         0.03 user         0.09 sys
If we need to speed things up, we can always use the ‘-P’ flag of xargs to do so:

cat uris.txt | time xargs -n1 -P10 wget
        1.65 real         0.20 user         0.21 sys
It pays to be reasonably sensible when using tools like this, and of course, read the terms and conditions of the site to check what they have to say about downloading copies of pages for personal use.

Given that you can get the pages using a web browser anyway, it’s generally fine, but it makes sense not to bombard the site with requests for every single page and instead just focus on the data in which you’re interested.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}