Having worked on a few data related applications over the last ten months or so Ashok and I were recently discussing some of the things that we’ve learnt.
One of the things he pointed out is that it’s very helpful to separate the different stages of a data work flow into their own applications/scripts.
I decided to try out this idea with some football data that I’m currently trying to model and I ended up with the following stages:
The stages do the following:
- Find – Finds web pages which have the data we need and writes those URLs of those to a text file.
- Download – Reads in the URLs and downloads the contents to the file system.
- Extract – Reads in the web pages from the file system and using CSS selectors extracts appropriate data and saves JSON files to disk.
- Import – Reads in the JSON files and creates nodes/relationships in neo4j.
It’s reasonably similar to micro services except instead of using HTTP as the protocol between each part we use text files as the interface between different scripts.
In fact it’s more like a variation of Unix pipelining as described in The Art of Unix Programming except we store the results of each stage of the pipeline instead of piping them directly into the next one.
If following the Unix way isn’t enough of a reason to split up the problem like this there are a couple of other reasons why this approach is useful:
- We end up tweaking some parts more than others therefore it’s good if we don’t have to run all the steps each time we make a change e.g. I find that I spend much more time in the extract & import stages than in the other two stages. Once I’ve got the script for getting all the data written it doesn’t seem to change that substantially.
- We can choose the appropriate technology to do each of the jobs. In this case I find that any data processing is much easier to do in Ruby but the data import is significantly quicker if you use the Java API.
- We can easily make changes to the work flow if we find a better way of doing things.
That third advantage became clear to me on Saturday when I realised that waiting 3 minutes for the import stage to run each time was becoming quite frustrating.
All node/relationship creation was happening via the REST interface from a Ruby script since that was the easiest way to get started.
That batch importer takes CSV files of nodes and edges as its input so I needed to add another stage to the work flow if I wanted to use it:
I spent a few hours working on the ‘Extract to CSV’ stage and then replaced the initial ‘Import’ script with a call to the batch importer.
It now takes 1.3 seconds to go through the last two stages instead of 3 minutes for the old import stage.
Since all I added was another script that took a text file as input and created text files as output it was really easy to make this change to the work flow.
I’m not sure how well this scales if you’re dealing with massive amounts of data but you can always split the data up into multiple files if the size becomes unmanageable.