The Current Data and Event Ecosystem
In this article, we’ll look at today’s data and event ecosystem by examining some tools for each component of the data pipeline.
Join the DZone community and get the full member experience.Join For Free
Nowadays, everyone is talking about building data platforms (or data pipelines) to answer specific business questions. While information and intelligence have always been critical to business, the sheer volume, velocity, and complexity of such data has exploded.
Everything from applications, machinery, infrastructure, clothing, smartphones, and even automotive electronics collect information. With more than 13 billion devices and systems connected in 2018, and projected to grow to 70 billion by 2020, understanding this ecosystem is essential to staying competitive.
In many cases, it’s often the data (combined with the platform) that is the product. In this post, we’ll get a grasp on today’s data and event ecosystem by looking at some of the tools that others are using for each component of the data pipeline.
We’ve broken the data pipeline down into four sections: Ingestion, Transport, Storage and Management, and Processing and Visualizing. We'll look at open source as well as some commercially available solutions in the ecosystem.
But, before we start, let’s take a bird’s-eye view of the pipeline.
The Data Ingestion Ecosystem
At the beginning of any data pipeline, data ingestion involves procuring events from sources (applications, IoT devices, web and server logs, and data file uploads) and transporting them into a data store for further processing.
Data ingestion can be continuous, asynchronous, batched, in real time, or some combination thereof. There are many data ingestion technologies that can take raw data from disparate sources and upload them to a single source of truth.
Embulk is a parallel bulk data uploader built around a core and a series of community-contributed input and output plugins that supports bulk data transfer between various data stores, databases, NoSQL stores, and cloud services.
Embulk supports a number of now-standard features of data ingestion, such as guessing input file formats, parallel and distributed execution, all-or-nothing transaction control, and resuming after an upload stalls.
With over 2 million downloads, StreamSets Data Collector is a popular, “low-latency ingest infrastructure tool that lets you create continuous data ingest pipelines with a drag and drop UI.” Licensed under Apache 2.0 open source, StreamSets is a good way to set up data ingestion graphically, with minimal code and configuration.
Fluentd, a “Unified Logging Layer,” is an open source, streaming data collector, that decouples data sources from backend systems. A favorite of Yukihiro Matsumoto, creator of Ruby, Fluentd also consists of a community-maintained core combined with input and output plugins, like Embulk. FluentBit is the version maintained for embedded systems.
Apache Sqoop, Flume, and Spark
Apache Sqoop is a tool for transporting bulk data between Apache Hadoop and structured datastores like relational databases. By offloading certain tasks (such as extract, transform, load operations) onto Hadoop, it can make data warehouses more efficient.
Apache Flume is “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.”
Based on streaming data flows, and geared toward Hadoop, Flume acts as a buffer between data producers and consumers — centralized data stores — when incoming data velocity exceeds the write capacity of the stores. Flume is distributed, scalable, and fault-tolerant.
As a component of Apache Spark, Spark Streaming combines streaming with batch and interactive queries. Spark Streaming can read data from HDFS, Kafka, Twitter, and ZeroMQ, and uses Zookeeper and HDFS for high availability ingestion.
Analytics data is collected when an event code is fired, and SDKs are generally available for any number of ingestion, storage, and management tools in most major programming languages.
The Data Transport Ecosystem
Data transport overlaps somewhat with data ingestion, but “ingestion” revolves around getting data extracted from one system and into another, while “transport” concerns getting data from any location to any other.
Message brokers are a key component in data transport; their raison d’etre is to translate a message from a sender’s protocol to that of a receiver, and possibly transform messages prior to moving them.
Apache Kafka is a high-throughput distributed messaging system for consistent, fault-tolerant, and durable message collection and delivery. Kafka producers publish streams of records or topics to which consumers subscribe. These streams of records are stored and processed as they occur.
Kafka is typically used for a few broad classes of applications:
Real-time streaming data pipelines between systems or applications.
Real-time streaming applications that transform streams of data.
- Real-time streaming applications that react to streams of data.
Compared to earlier, simpler messaging systems like ZeroMQ or RabbitMQ, Kafka generally has better throughput, integrated partitioning, and fault tolerance, making it excellent for large-scale message handling.
Kafka’s use has expanded to include everything from commit logs, to website activity tracking, to stream processing.
Part of the Kafka family, Kafka Connect is a good alternative for data ingestion and export tasks. It is a framework with a number of available connectors to interact with systems and services ranging from change data capture from popular databases to MQTT and, for example, Twitter.
Amazon’s equivalent is Amazon Kinesis, a real-time data processing platform offered on Amazon Web Services. As a fully managed solution, it can handle widely varying amounts of ingest data (without worrying about scaling); it ingests, buffers, and processes streaming data in real-time.
Data Storage and Data Management Ecosystem
No one talks about Big Data or its ecosystem without including Apache Hadoop and Apache Spark. Hadoop is a framework that can process large data sets across clusters; Spark is “a unified analytics engine for large scale data processing.”
Both are widely adopted, often used together, and have strong community support with open source and commercial versions available. However, as both are early evolutionary steps in big data, they come with their unique problems.
For example, with Hadoop, aside from the well-known talent gap, users have found that the MapReduce programming paradigm isn’t a good match for all problems. These include the typically iterative tasks of a data scientist’s exploratory work.
And Spark, though it can be much faster than Hadoop (with in-memory processing), and supports SQL queries (taking the Hadoop/Spark stack comfortably out of the data engineer’s domain into that of analysts, data scientists and even managers), both technologies require infamously complicated configuration chops.
Plus, if you’ve ever used Hadoop and Spark together, you’re probably well aware of the “small files problem” — Hadoop File System (HDFS) generally works better with a small number of large files rather than vice versa.
Nonetheless, pipelines have emerged with other data stores and management methods; some established, some new. Let’s look at a few:
PostgreSQL is an open source, object-relational database management system emphasizing extensibility and standards compliance that has been around so long, it’s become a standby for companies ranging from manufacturing to IoT.
Redis is a superfast variant of the NoSQL database known as a key-value store. As such, it’s an extremely simple database that stores only key-value pairs and serves search results by retrieving the value associated with a known key.
Redis’s speed and simplicity make it well-suited for embedded databases, session caches, or queues. In fact, it’s often used in conjunction with message brokers, or as a message broker itself.
If you’re working with large, active data sets, and need to tweak the tradeoff between consistency, availability, and partition tolerance, then Apache Cassandra may be your solution. Because data is distributed across nodes, when one node — or even an entire data center — goes down, the data remains preserved in other nodes (depending on the consistency level setting).
As a wide column store, Cassandra is schema-agnostic and stores data in column families resulting in a multi-dimensional key-value store. Technically schema-free and “NoSQL,” Cassandra uses a SQL variant called CQL for data definition and manipulation, making administration easy for RDBMS experts.
The rapid instrumentation of the physical world due to IoT and data collecting applications has led to an explosion of time-stamped data. Time series databases serve this evolving niche, and among them, InfluxDB is emerging as a major player. InfluxDB, like others, can handle complex logic or business rules atop massive — and fast-growing — data sets, and InfluxDB adds the advantage of a range of ingestion methods, as well as the ability to append tags to different data points.
Based on Lucene, Elasticsearch is a distributed document and full-text indexing solution that supports complex data analytics in real time. In architectures where collection and processing of a range of data formats is necessary (read: most of them), Elasticsearch may be used alongside other solutions like Apache Kafka, PostgreSQL, Redis, and InfluxDB.
When you want to develop insights and reach conclusions to support your hypotheses, you’re in the domain of data scientists. Data visualization tools and dashboards also support managers, marketers, and even end consumers, but there are simply too many such tools, with too many areas of specialty, to possibly cover in this article.
When time-series data needs to be plotted to a graph and visualized — to monitor system performance, say, or how a particular variable or group of variables has performed over time — then a solution like Grafana might be just the ticket. Although originally built for performance and system monitoring, it now directly supports more than 40 data sources and 16 apps.
Where to Learn More
Here are a few resources where you can learn more about data pipelines and related technologies:
On Data Ingestion
On Message Brokers and Data Transport
On Data Storage and Management
What is a key-value store? What is a time-series database?
On Data Visualization
Thanks for reading!
This blog was originally published at aiven.io.
Published at DZone with permission of John Hammink, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.