We just concluded our highly attended 7-part Data-In-Motion webinar series. The final installment was a very informative session on how Apache NiFi, Kafka, and Storm work together. Slides and Q&A below.
Hortonworks Community Article by Bryan Bende: Integrating Apache NiFi and Apache Kafka
Q&A from Webinar
Can we get the slides for today’s presentation?
Yes, the slides are posted to the Hortonworks slideshare channel. All slides in the 7-part HDF webinar series on Harnessing Data in motion are posted here.
NiFI, Kafka, & Storm Questions
Do NiFi and Kafka overlap in functionality?
This is a very common question – it’s actually a very complementary situation. A Kafka broker provides very low latency, especially when you have a large number of consumers pulling from the same topic. However Kafka is not designed to solve dataflow challenges – think of data prioritization and enrichment – Kafka doesn’t do that. Furthermore, unlike NIFI which handles messages with arbitrary sizes, Kafka prefers smaller messages, in the KB to MB range while NiFi is more flexible for varying sizes, up to GB per file or more. NiFi is complementary to Kafka by solving all the dataflow problems for Kafka.
In this truck data example, do we need to write custom code in Kafka/Storm or is everything is managed within NiFi components?
In this example the only code that was written was the Storm topology to calculate the average speed over a window. The Storm topology made use of the provided KafkaSpout and KafkaBolt, and only required implementing two other bolts to parse the data and calculate the average. The data flow from source to Kafka was managed by MiNiFi and NiFi, and dataflow from Kafka to the dashboard was managed by NiFi.
Why Storm and not Spark for this example?
Storm is the stream processing platform packaged with HDF, and this example was based on HDF for the overall architecture. A similar approach could be taken with Spark, or other stream processing platforms.
Isn’t PutKafka compatible (and recommended for kafka .9 and kafka .10) since with publishkafka ” there are cases where the publisher can get into an indefinite stuck state?”
It is recommended to use the processor that is built against the Kafka client matching the broker being used. This means using PutKafka with an 0.8 broker, PublishKafka with an 0.9 broker, and PublishKafka_0_10 with a 0.10 broker.
Does NiFi have a backend to store data for a dashboard ?
No, NiFi has internal repositories used to power the data flow, but these are not meant to build applications against. NiFi can be used to ingest data into many different tools that can be used to build dashboards. In this example, NiFi was ingesting data into Solr with a Banana dashboard.
NiFi Questions (More NiFi Q&A from Intro to Hortonworks Dataflow here)
Can NiFi connect to external sources Like Twitter?
Absolutely. NIFI has a very extensible framework, allowing any developers/users to add a data source connector quite easily. In the previous release, NIFI 1.0, we had 170+ processors bundled with the application by default, including the twitter processor. Moving forward, new processors/extensions can definitely be expected in every release.
Does NiFi have any connectors with any RDBMS database?
Yes, you can use different processors bundled in NiFi to interact with RDBMS in different ways. For example, “ExecuteSQL” allows you to issue a SQL SELECT statement to a configured JDBC connection to retrieve rows from a database; “QueryDatabaseTable” allows you to incrementally fetch from a DB table, and “GenerateTableFetch” allows you to not incrementally fetch the records, but also fetch against source table partitions. For more details regarding different processors: https://nifi.apache.org/docs.html
While configuring a processor, what is the language of syntax or formula used?
NiFi has a concept called expression language which is supported on a per-property basis, meaning the developer of a processor can choose whether a property supports expression language. NiFi’s expression language is documented here: https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
Is there a programming language that Apache NiFi supports?
NiFi is implemented in the Java programming language and allows extensions (processors, controller services, and reporting tasks) to be implemented in Java. In addition, NiFi supports processors that execute scripts written in Groovy, Jython, and several other popular scripting languages.
Do the Attributes get added to content (actual data) when data is pulled by NiFi
You can certainly add attributes to your FlowFiles at anytime, that’s the whole point of separating metadata from the actual data. Essentially, one FlowFile represents an object or a message moving through NiFi. Each FlowFile contains a piece of content, which is the actual bytes. You can then extract attributes from the content, and store them in memory. You can then operate against those attributes in memory, without touching your content. By doing so you can save a lot of IO overhead, making the whole flow management process extremely efficient.
Any plans to add versioning to the NiFi docs on the Apache site? Currently, I can only find docs for 1.0.0, but .0.7.1 is the stable version, right?
Great idea, we have filed a JIRA in Apache land to capture this thought: https://issues.apache.org/jira/browse/NIFI-3005. We definitely plan to add versioning to NIFI docs, as soon as we can.
I am personally a big supporter of Apache NiFi, but would like to know for many of the processors that are available in Hortonworks Data Flow version of NiFi, are they available in Apache Nifi and will Apache NiFi still be actively developed with more new features?
HDF release is, and will always be, based on Apache NIFi releases. For any new NiFi features added in HDF, Apache equivalents can absolutely be expected.
MiNiFi Questions (More info from MiNiFi Webinar, Slides, and Q&A here)
I was unable to find documentation on Apache MiNiFi, can you please point to the documentation of Apache MiNiFi?
Please find MINIFI documentations here: https://cwiki.apache.org/confluence/display/MINIFI/MiNiFi.
Any plans to make generating flows for MiNiFi a more streamlined process?
Yes, we do have plans to develop a centralized command and control console for MiNiFi, which enables a streamlined management of your dataflow, from end-to-end (starting from the source where data’s journey starts, all the way back to the core data center). Please refer to the following feature proposal in Apache for more details: https://cwiki.apache.org/confluence/display/MINIFI/MiNiFi+Command+and+Control