Over a million developers have joined DZone.

Do You Trust the Insights You Get? Implications for Big Data & Analytics

It's hard to know which insights to trust when you don't know exactly where they came from. Here's how we learned to trust our data as we built a brand-new SaaS.

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Do you trust the insights you get from your Big Data Analytics? What are the factors you should consider to decide if you can trust the results. In this article I consider the most important questions. Can you answer Yes or No to the questions I pose?

Trust is such a powerful emotive word. Soon after I joined Informatica to run Research & Development for Data Quality, one of the first conversations with my new boss, Ivan Chong, I asked the question; When talking to customers what are the key messages?  In his answer, he mentioned the word trust. Wow, how to sum up the Data Profiling and Data Quality features in a single word; that customers can trust their data more.

In my role as a database migration guru at Oracle, I did not use the word trust, but I believe I tried to get across the same sentiment. I used equivalence in what the Oracle Migration Workbench did. If you had 10 tables and 1 million rows total,  when you migrated to Oracle you should the same 10 tables with equal data types, and the same 1 million rows present. I mentioned too invasive and noninvasive changes. Oracle partitioning capabilities was non invasive as it was transparent to an application or user. Replacing a standard view with a materialized view could be considered an invasive change due to the timeliness of results returned. This may be fine applications and users, but it should be a deliberate decision.

At Singularities as we bring our SaaS based solution to market aimed at business users, trust features large in the decisions we are making.  These are the questions we are asking ourselves:

Do I Trust My Hosting Vendor?

As we will deliver our solution via SaaS, we must pose this question. If I look to the current market leader Amazon Web Services (AWS), I see a section on their website that addresses Security. From looking through their features list and from the recent Forrester Wave™: Public Cloud Platform Service Providers’ Security, Q4 2014, they are shown as the outright leader. I am feeling confident I can answer yes to that question.

Do I Trust My Platform

To get the scalability and openness we want from a platform we chose Hadoop. Can I  trust Hadoop? To answer that question, you must consider which Hadoop distribution you will select and how you will configure it and whether the Hosting Vendor supports the one you selected.  In one of the popular Hadoop distributions, Cloudera, I am delighted to find:

I think I can answer yes to this question.

Do I Trust My Raw Data?

How can I trust my raw data? Well you cannot trust it 100% but you can take steps to build your confidence in it. If I put my Data Wrangler hat on, give me the raw data as it comes from the entity generating it. Unlearn your classical data warehouse techniques,  don’t model it and pick out what IT thinks is most interesting. Please let me do that. Given the scalable platform that Hadoop is, I will often keep the data in its original raw format, until a provable insight is found. I can then optimize the data pipeline. Please don’t filter or roll up the data, because I might be interested in finding patterns over time (time series analysis), The needle in the haystack. With a powerful platform to crunch the data, let me find those valuable golden needles. Use the data profiling capabilities of your chosen data wrangling tools, to drive quality improvement into the data set or uncover data quality issues. Using this approach, get comfortable to say yes.

Do I Trust My Analytics?

Which analytical functions am I using and why did I choose them? What parameters did I use? With machine learning algorithm what was the training set I used? Is it very different to the data set I am now using? From our own perspective Singularities is a well founded platform that learns, stores and interacts with comprehensive and precise models of individual people and entities so they can be used in applications that stimulate, predict, diagnose and explore recommendations to influence their behaviors under diverse scenarios. Singularities models can be agents that do sophisticated operations in autonomous systems. Singularities is based on a powerful mathematical theory of information modeling. It uses equations of variables and information for representing entities and their beliefs states and behaviors. I can answer yes to this question for Singularities.

Do I Trust What I See?

What visualization tool is displaying the insights gained from your trusted analytics? Is it accessing those insights live or is it using a caching mechanism? Of the visualizations selected to include in your dashboard are you doing any pre-filtering, that will be non obvious in the displayed results?

This article illustrates the major questions to consider whether you will trust those compelling visualization you are seeing. I don’t think trust should be assumed, it should be a earned. I have pointed out five questions you should be able to answer yes too. We are interested in your thoughts on trust and the implication for Big Data and Analytics. You can email me at: donal.daly@singularities.com.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.

data wrangling,hadoop,big data,analytics,visualization

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}