It’s been nine months since I first learned about Hortonworks DataFlow (HDF), which is powered by Apache NiFi, Kafka and Storm. Back then, I was immediately able to see the productivity benefits that the Apache NiFi aspect of HDF would have brought to my previous work in analyzing data for mobile subscriber usage patterns. The ability to have real-time control over dataflows results in unprecedented operational efficiency. This means you can deliver data to systems in a faster, easier and more secure manner, which leaves you more time to focus on data analysis, rather than wasting the bulk of your energy on the mundane formatting/cleaning/parsing/extracting/preparing/moving/”ftp-ing”/copying mechanisms just to get the right data into the right system to begin your analysis.
Contextual Data is Critical
Beyond that though is the incredibly empowering (and time-saving) real-time, data provenance capability of Apache NiFi which dynamically shows you a visual map outlining the lineage of the data you are working with. Once you start any type of analysis, you often discover that that the inputs of your analysis may not be quite what you expected. People say numbers lie, and well, it’s true. Contextual metadata is critical to determining if input data is “good” or “bad." The ability to immediately trace through the dataflow, to find the “source of truth,” is vital for accurate analysis and insight. For example, the term “five degrees” can have different implications based on context and definition; “five degrees Fahrenheit" and “five degrees Celsius” imply quite different things. In California, a temperature reading of five degrees Celsius likely implies a normal winter evening but in Canada, a temperature reading of five degrees Celsius implies spring or autumn weather. If it were five degrees Farenheit, in one location it would be probable and in another it would be abnormal. In this example, the time and location of of the data provides context as to what is likely normal, versus what is outside the norm. Analysis without context is meaningless as this context provides vital indicators as to how the source data should or should not be used within an analytical framework.
When it comes to big data analytics, real-time provenance enables the immediate optimization of algorithms necessary to catapult businesses forward. Apache NiFi dynamically generates a visual chain of custody for each and every piece of data flowing through. An end-user can point and click on any piece of data and see, in real time, how the data is traversing its path from source to destination.
Contextual information from the data provenance capability of Apache NiFi is key to Prescient, a risk management company that pulls data from 48,000+ sources for its traveler safety mobile and dashboard applications. Prescient uses Hortonworks Connected Data Platforms (HDF and HDP), along with EMC Isilon storage to perform big data analysis and address risks for employees of corporations and government agencies who travel all over the world. These risks include (but are not limited to): Business reputations, liability costs and—perhaps most importantly—personal safety.
In this EMC World interview with Mike Bishop, Managing Director and Chief Systems Architect at Prescient, Mike explains how Prescient uses technology to keep people safe. Prescient pulls information from 48,000+ sources to determine which physical, health and environmental factors are most relevant to the business continuity and personal safety of specific travelers.
In the interview, Mike explains how Prescient can optimize and tune its algorithms—by using the data provenance capability of Apache NiFi—to identify which sources are relevant/irrelevant:
“Apache NiFi allows us to really cull our sources so we can limit feeds that may not be paying off, but also use those same data points to corroborate and validate the veracity of information, which is very important."
A Comprehensive Solution
Components of the Prescient solution includes EMC Isilon and Xtreme I/O, Hortonworks Hadoop and Apache NiFi, along with SAP Hana, MongoDB, and open source geo-spatial frameworks such as QGIS and ArcGIS. To learn more, please follow these links: