Introduction to Real Big Data Engineering

DZone 's Guide to

Introduction to Real Big Data Engineering

Learn about the provenance, ease of use, flexibility, data sources, data sinks, edge computing, deep learning integration, and streaming of Apache NiFi.

· Big Data Zone ·
Free Resource

A real big data engineer doesn't get bogged down in too much coding; they code only when they need to because of need or performance. Don't pick an orchestration or pipelining system du jour just because one or two startups (with massive IT staff) love it or wrote it. Pick one that allows you to do most of your work without coding, is fully open source, has tutorials and articles, and has a healthy community. If you don't see that yet, stay away. Sorry, Kafka Streams, Airflow, and 20 others where I needed to write the same streaming framework that exists in five other places so that I could get some kind of vendor lock-in or the "not created here" landmine.

I recommend Apache NiFi. It has 10+ years of engineering, is 100% open source, is Apache-licensed, and has multiple company support, hundreds of articles, a large community, hundreds of enterprises using it from banking, utilities, and retail to education, healthcare, government, and everywhere else.

Sure, sometimes you need to roll some Python, Go, Java, or Scala code; and SQL is in every flow I need to do.

If you need to move data from one, ten, 100, 1000, or 10,000 sources, do some processing, maybe run some Apache Spark, join with some Apache Hive data, and land it one of 70 popular clouds, message queues, storage systems, Hadoop, SQL, or NoSQL — there is only one tool.

If you need to extend the processing with Streaming Analytics Manager, Apache Storm, Apache Spark Streaming, Apache Flink, or some other tool du jour, push your messages to Kafka and your schemas to a Schema Registry and continue the processing.

Apache NiFi is the only tool with full provenance to know the what, where, why, how, and where your data came from, how it changed, where it went, and what went wrong upstream or downstream.

With the new versioning added in Apache NiFi 1.5, it's not only for live production coding but also for standard version control-based Agile development.

You can integrate edge deep learning:

Add sentiment analysis and NLP:

Handle specialty data formats like healthcare's HL7:

Grab all the tables in a database automagically:



I can list a hundred more real-world use cases with articles, code, tutorials, and corporations doing this live today.

Take a peek at my list from last year.

If you have another alternative that has the provenance, ease of use, flexibility, data sources, data sinks, edge computing, deep learning integration, and streaming please post in the comments.

big data ,apache nifi ,streaming ,engineering

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}