On Tuesday, I spoke at the Data Science London meetup about football data, and I started out by covering some lessons I’ve learned about building data sets for personal use when open data isn’t available.
When that’s the case, you often end up scraping HTML pages to extract the data that you’re interested in, and then storing that in files, or in a database, if you want to be more fancy.
Ideally, we want to spend our time playing with the data, rather than gathering it, so we we want to keep this stage to a minimum, which we can do by following these rules:
Don’t Build a Crawler
One of the most tempting things to do is build a crawler, which starts on the home page and then follows some or all of the links it comes across, downloading those pages as it goes.
This is incredibly time-consuming, and yet this was the approach I took when scraping an internal staffing application to model ThoughtWorks consultants/projects in neo4j about 18 months ago.
Ashok wanted to get the same data a few months later, and instead of building a crawler, spent a bit of time understanding the URI structure of the pages he wanted, and then built up a list of pages to download.
It took him just a few minutes to build a script that would get the data, whereas I spent many hours using the crawler-based approach.
If there is no discernible URI structure, or if you want to get every single page, then the crawler approach might make sense, but I try to avoid it as a first option.
Download the Files
We pay the network cost every time we run the script, and at the beginning of a data gathering exercise, we won’t know exactly what data we need, so we’re bound to have to run it multiple times until we get it right.
It’s much quicker to download the files to disk and work on them locally.
Having spent a lot of time writing different tools to download the ThoughtWorks data set, Ashok asked me why I wasn’t using Wget instead.
I couldn’t think of a good reason, so now I favor building up a list of URIs and letting Wget take care of downloading them for us. For example:
$ head -n 5 uris.txt https://www.some-made-up-place.com/page1.html https://www.some-made-up-place.com/page2.html https://www.some-made-up-place.com/page3.html https://www.some-made-up-place.com/page4.html https://www.some-made-up-place.com/page5.html $ cat uris.txt | time xargs wget ... Total wall clock time: 3.7s Downloaded: 60 files, 625K in 0.7s (870 KB/s) 3.73 real 0.03 user 0.09 sysIf we need to speed things up, we can always use the ‘-P’ flag of xargs to do so:
cat uris.txt | time xargs -n1 -P10 wget 1.65 real 0.20 user 0.21 sysIt pays to be reasonably sensible when using tools like this, and of course, read the terms and conditions of the site to check what they have to say about downloading copies of pages for personal use.
Given that you can get the pages using a web browser anyway, it’s generally fine, but it makes sense not to bombard the site with requests for every single page and instead just focus on the data in which you’re interested.