Apache NiFi, Not From Scratch
An introduction to implementing Apache NiFi, a flow-based data processing that analyzes your data in motion.
Join the DZone community and get the full member experience.
Join For Freeif you haven’t heard about it, yet, apache nifi is a recent addition to the list of big data technologies that hortonworks is helping to develop in the open source community. whereas hadoop is a data at rest and data processing platform, nifi is specifically a data in motion technology that uses a flow based processing paradigm . if you’d like more history and background on nifi, take a look at the official overview .
at first glance, nifi will look very different to developers with different backgrounds. java programmers think it’s one thing; etl developers think something else. this article uses the example of preparing our population health solution to scale from just a few hundred to millions of individual lives to highlight the following topics:
- nifi from a java developer’s perspective
- nifi from an etl developer’s perspective
- rewriting your existing solution from scratch using nifi
- understanding the places where nifi really helps
- adapting your existing java project to run in nifi without starting from scratch
is nifi really all that?
there are two different ways that most developers are going to come at nifi:
- from a traditional software programming background with some data processing experience but mostly application focused development, or
- from a data warehousing background where extract, transform, and load (etl) tools are the norm.
adventurous and open-minded developers will look at the nifi tutorials on how to stream the twitter gardenhose into hdfs and simply be excited to learn something new. unless they’re already facing big data scalability challenges, skeptics from the application development camp might see nifi as a point-and-click waste of time where configuration files and dynamic code will be much more flexible.
nifi for etl
the skeptics from the etl camp might scoff at nifi and write it off as big data folks trying to recreate the etl wheel. in all of these cases, the project managers associated with these developers probably see the potential for a huge hit to productivity as developers want (or are told to) rewrite existing code using nifi.
but rest assured: it doesn’t have to be that way. instead of starting from scratch with nifi as the foundation, it’s worthwhile to invert the problem and see if your project can’t simply refactor itself with nifi in mind. at amitech solutions and big cloud analytics, we have been working with the real world scenario of wearable fitness devices — something more relevant in healthcare than the canned twitter tutorial.
fitness device manufacturers are making the data from their devices available to partners using web-based apis. the model for each of these vendors tends to be fairly similar and consistent with many consumer application apis: authenticate yourself as a user then query the api for data using that session key. in the healthcare industry, startups are looking for ways to use data from wearable devices to help inform services such as population health management.
in our particular case at amitech, we’ve been working with a legacy code base that was developed during the early startup days of the company. it’s worked fine for just a few customers, covering a few thousand individuals. as the early pilots have wrapped up and the business case has become clearer to other prospective customers, the solution has needed to scale to hundreds of thousands and millions of individual fitness devices. the project has, all of a sudden, moved into the realm of big data that needs to be ingested from a huge and growing set of devices in all their many formats.
nifi from scratch
as the lead architect responsible for planning to adapt the solution for scalability, i felt that nifi, storm, spark streaming, or a similar technology would be a sensible part of the solution, so i put on my traditional data integration/etl mindset and took a look at the available nifi processors.
the list is exhaustive and very different than the kinds of components available in traditional etl tools. at the time of this post, there are more than 135 different processors listed. of course, the list included a complete set of http and rest related processors that i could use to communicate with the fitness device vendors’ apis. so, i wired together a simple series of processors that would take username and password as input, authenticate against an api to retrieve an authorization token, add that to the http header, and then query the api for the data set that i wanted.
yeah! a simple real world use of nifi that isn’t about twitter! a working and practical application is always something to celebrate, but it occurred to me that with this approach, we’d need to rebuild the existing java libraries already written for each type of device using a similar kind of approach. not hard, but certainly a hit to the project timeline and a risk to data integrity as we made the transition to the new model.
nifi not from scratch
typically, data warehousing and etl tool vendors recommended that we write your own custom components. after all, the target market for etl tools is a space where the tools are specifically marketed as reducing the need for “error prone and time consuming” manual coding. when i ran across this tutorial on writing your own nifi processor it occurred to me that nifi is the exact opposite. it’s both open source and designed for extensibility from the ground up. i found it quite reasonable to write a custom nifi processor that leverages our existing code base.
the existing code is a java program with separate classes for each device vendor, all with the same interface to abstract the nuances of each vendor from the main data export program. this interface follows a traditional paradigm: login, query, query, query, logout. given that my input to nifi above takes in simple username, password, and query criteria arguments, it seems trivial to create a nifi processor class that adapts the existing code into the nifi api. here’s a slightly abbreviated version of the actual code. (in reality, it’s all of 70 lines of code.)
we added a few dependencies and a builder to the maven pom file, and maven generates the nar file that needs to be deployed into nifi. after a quick restart, the new processor shows up in the list of available processors and the new flow looks like this — much simpler than a series of http and attribute parsing processors:
public void ontrigger(final processcontext context, final processsession session) {
final processorlog log = this.getlogger();
final atomicreference<string> value = new atomicreference<>();
boolean success = false;
flowfile flowfile = session.get();
try {
abstractdevice vendor = null;
string v = context.getproperty(device).evaluateattributeexpressions(flowfile).tostring();
switch (v) {
// … instantiate specific device class depending on flow file attribute
default:
log.error(“invalid device vendor type: ” + v);
throw new processexception(“unable to determine vendor type: ” + v);
}
// … get various other attributes we need to call the api
// here’s where we actually query the vendor api
if (vendor.login(userprop, passprop)) {
vendor.queryvendor(startdate, enddate);
value.set(vendor.getdataasstring());
if (!vendor.getdataasstring().contentequals(“”)) {
success = true;
}
}
flowfile = session.write(flowfile, new outputstreamcallback() {
public void process(outputstream out) throws ioexception {
out.write(value.get().getbytes());
}
});
if (success) {
session.transfer(flowfile, success);
} else {
session.transfer(flowfile, failure);
}
}
how does nifi add value?
so, it seems that in the case of this solution, it will be fairly reasonable to adapt existing code to run within the nifi framework without introducing a lot of the time and risk of rewriting the core business logic with a new tool. what, then, are the benefits of embedding this existing process into nifi. the data flow in the traditional program was to:
- query the operational database for a list of individuals to process
-
for each individual:
- login to the vendor api
- query vendor api for data
- parse data into normalized format
- save to rdbms
this serial process works fine for a few hundred or thousand users. the processing takes under an hour. with a million users, though, the process can’t be run serially in any reasonable timeframe. it would take several days to get through one day’s worth of processing. the logical response is to find some way of parallelizing the process.
we could setup and manage multiple instances of the java program on multiple servers. in that model, there are a lot of new risks involved in managing the infrastructure and deployment of code, though. not that those are insurmountable, but they are one thing that the nifi framework accommodates easily. nifi also gives us robust data provenance without any additional programming.
the data provenance log keeps track of every flowfile (a combination of data and attributes) and each of the transformations that happen to that flowfile along the way.
anyone who’s ever been involved in the operational support and debugging of a data integration and data processing application can see the strength of these data provenance features.
next steps
another thing that we plan to do with this project is migrate the backend data store from its current rdbms to something more flexible and easier to scale like hbase or mongodb. as it turns out, integrating the existing business logic into nifi will make this process significantly easier. instead of having to rip into the existing java program and add in new classes for writing to the new data store, we can simply route and store the same data within nifi. nifi already has the processors for doing this in a distributed and scalable way:
final thoughts
for anyone who has an existing application that needs to scale or is costing too much to scale, take a serious look at simply wrapping your current business logic into a custom apache nifi processor.
- get beyond the twitter garden hose example
- writing a custom processor for nifi is relatively simple
- other processors in nifi make it easy to adapt inputs and outputs to your existing code
- data provenance in nifi is an incredibly valuable feature for support and troubleshooting
- if you create something broadly useful, ask about contributing it to the nifi community!
nifi is enabling our population health management solution to quickly scale to track and help improve the health of millions of individuals across the globe. imagine what you could do with nifi for your business and your industry…
about the author: paul boal is the big data practice lead at amitech solutions . at stampedecon in st. louis on july 26-28, 2016 he will be presenting more details on the use of nifi and hadoop to manage and analyze data from wearable fitness devices in a population health management solution with big cloud analytics .
Published at DZone with permission of Paul Boal, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments