Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Don't Build a Crawler (If You Can Avoid It!)

DZone's Guide to

Don't Build a Crawler (If You Can Avoid It!)

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

On Tuesday, I spoke at the Data Science London meetup about football data, and I started out by covering some lessons I’ve learned about building data sets for personal use when open data isn’t available.

When that’s the case, you often end up scraping HTML pages to extract the data that you’re interested in, and then storing that in files, or in a database, if you want to be more fancy.

Ideally, we want to spend our time playing with the data, rather than gathering it, so we we want to keep this stage to a minimum, which we can do by following these rules:

Don’t Build a Crawler

One of the most tempting things to do is build a crawler, which starts on the home page and then follows some or all of the links it comes across, downloading those pages as it goes.

This is incredibly time-consuming, and yet this was the approach I took when scraping an internal staffing application to model ThoughtWorks consultants/projects in neo4j about 18 months ago.

Ashok wanted to get the same data a few months later, and instead of building a crawler, spent a bit of time understanding the URI structure of the pages he wanted, and then built up a list of pages to download.

It took him just a few minutes to build a script that would get the data, whereas I spent many hours using the crawler-based approach.

If there is no discernible URI structure, or if you want to get every single page, then the crawler approach might make sense, but I try to avoid it as a first option.

Download the Files

The second thing I learned is that running Web Driver, nokogiri or enlive against live web pages and then only storing the parts of the page we’re interested in is suboptimal.

We pay the network cost every time we run the script, and at the beginning of a data gathering exercise, we won’t know exactly what data we need, so we’re bound to have to run it multiple times until we get it right.

It’s much quicker to download the files to disk and work on them locally.

Use Wget

Having spent a lot of time writing different tools to download the ThoughtWorks data set, Ashok asked me why I wasn’t using Wget instead.

I couldn’t think of a good reason, so now I favor building up a list of URIs and letting Wget take care of downloading them for us. For example:

$ head -n 5 uris.txt
https://www.some-made-up-place.com/page1.html
https://www.some-made-up-place.com/page2.html
https://www.some-made-up-place.com/page3.html
https://www.some-made-up-place.com/page4.html
https://www.some-made-up-place.com/page5.html
 
$ cat uris.txt | time xargs wget
...
Total wall clock time: 3.7s
Downloaded: 60 files, 625K in 0.7s (870 KB/s)
        3.73 real         0.03 user         0.09 sys
If we need to speed things up, we can always use the ‘-P’ flag of xargs to do so:

cat uris.txt | time xargs -n1 -P10 wget
        1.65 real         0.20 user         0.21 sys
It pays to be reasonably sensible when using tools like this, and of course, read the terms and conditions of the site to check what they have to say about downloading copies of pages for personal use.

Given that you can get the pages using a web browser anyway, it’s generally fine, but it makes sense not to bombard the site with requests for every single page and instead just focus on the data in which you’re interested.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}