Over a million developers have joined DZone.

Data Science: Don't Filter Data Prematurely

DZone's Guide to

Data Science: Don't Filter Data Prematurely

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Last year I wrote a post describing how I’d gone about getting data for my ThoughtWorks graph and one mistake about my approach in retrospect is that I filtered the data too early.

My workflow looked like this:

  • Scrape internal application using web driver and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

The problem with the first step is that I was trying to determine up front what data was useful and as a result I ended up running the scrapping application multiple times when I realised I didn’t have all the data I wanted.

Since it took a couple of hours to run each time it was tremendously frustrating but it took me a while to realise how flawed my approach was.

For some reason I kept tweaking the scrapper just to get a little bit more data each time!

It wasn’t until Ashok and I were doing some similar work and had to extract data from an existing database that I realised the filtering didn’t need to be done so early in the process.

We weren’t sure exactly what data we needed but on this occasion we got everything around the area we were working in and looked at how we could actually use it at a later stage.

Given that it’s relatively cheap to store the data I think this approach makes sense more often than not – we can always delete the data if we realise it’s not useful to us at a later stage.

It especially makes sense if it’s difficult to get more data either because it’s time consuming or we need someone else to give us access to it and they are time constrained.

If I could rework that work flow it’d now be split into three steps:

  • Scrape internal application using web driver and save pages as HTML documents
  • Parse HTML documents and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

I think my experiences tie in reasonably closely with those I heard about at Strata Conf London but of course I may well be wrong so if anyone has other points of view I’d love to hear them.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}