Over a million developers have joined DZone.

Data Science: Don't Filter Data Prematurely

DZone's Guide to

Data Science: Don't Filter Data Prematurely

· Big Data Zone
Free Resource

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Last year I wrote a post describing how I’d gone about getting data for my ThoughtWorks graph and one mistake about my approach in retrospect is that I filtered the data too early.

My workflow looked like this:

  • Scrape internal application using web driver and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

The problem with the first step is that I was trying to determine up front what data was useful and as a result I ended up running the scrapping application multiple times when I realised I didn’t have all the data I wanted.

Since it took a couple of hours to run each time it was tremendously frustrating but it took me a while to realise how flawed my approach was.

For some reason I kept tweaking the scrapper just to get a little bit more data each time!

It wasn’t until Ashok and I were doing some similar work and had to extract data from an existing database that I realised the filtering didn’t need to be done so early in the process.

We weren’t sure exactly what data we needed but on this occasion we got everything around the area we were working in and looked at how we could actually use it at a later stage.

Given that it’s relatively cheap to store the data I think this approach makes sense more often than not – we can always delete the data if we realise it’s not useful to us at a later stage.

It especially makes sense if it’s difficult to get more data either because it’s time consuming or we need someone else to give us access to it and they are time constrained.

If I could rework that work flow it’d now be split into three steps:

  • Scrape internal application using web driver and save pages as HTML documents
  • Parse HTML documents and save useful data to JSON files
  • Parse JSON files and load nodes/relationships into neo4j

I think my experiences tie in reasonably closely with those I heard about at Strata Conf London but of course I may well be wrong so if anyone has other points of view I’d love to hear them.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks


Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}