How to Build a Big Data Analytics Pipeline
A detailed overview on the essential components of creating a big data analytics pipeline, specifically focusing on Hadoop, Spark, MongoDB, and services like Tableau.
Join the DZone community and get the full member experience.Join For Free
In the era of the Internet of Things, with huge volumes of data becoming available at an incredibly fast velocity, the need for an efficient analytics system could not be more relevant. Also, the variety of data is coming from various sources in various formats, such as sensors, logs, structured data from an RDBMS, etc. The need of the hour is having an efficient analytic pipeline which can derive value from data and help businesses. This paper explores creating an efficient analytic pipeline with relevant technologies.
In the past few years, the generation of new data has drastically increased. More applications are being built and they are generating more data at incredible rate. Then, there is Internet of Things, which has truly brought us into the data age. So today, organizations have huge amount of data and at the same time, they have the need to derive value from it. Considering the huge volume and the incredible rate at which data is being collected, the need arises for an efficient analytic system which processes this data and provide value in real time.
Earlier, data storage was costly and there was absence of technology which could process the data in an efficient manner. Now the storage costs have become cheaper, and the availability of technology to process big data is a reality. The first significant step towards the processing of Big Data started in 2003, when Google published the GFS (Google File System) and Mapreduce papers. Doug Cutting started writing a Big Data system based on these concepts and Hadoop was released, which is a popular Big Data system now.
An efficient analytic system should have the capability to retain the data, process the data, derive insights, and provide information in an acceptable time frame (so that it is not too late for the business to respond), and be flexible enough to process a variety of use cases.
Furthermore, the cost of developing this solution should not be prohibitive. In this article, we'll try to present a strategy on how to develop an analytics system. This system should be capable of handling a huge amount of data coming from “Internet of Things” devices.
All of the above mentioned systems require an analytics pipeline. The components of an analytics pipeline are as follows:
Components of Analytic Pipeline
The major constituents of an Analytic pipeline will be as follows:
The messaging system.
Distribution of messages to various nodes for further processing.
Analytic processing, to derive inferences from data. This will include the application of machine learning of data.
Data storage system for storing results and related information.
Interfaces or consumption of results data, e.g. visualization tools, alerts, etc.
Critical Parameters of the System
The analytic systems should:
Handle huge volumes and variety of data, i.e. Big Data. The system should be able to process millions of messages as number of devices will increase.
Show Low Latency: Have good response time (near real time). Many of these use cases will require an efficient response time, so that the impacted entity can be notified of impending event or failure.
Be scalable: Scale on various parameters, such as number of devices (hundreds of thousands), messages (million per second), storage (in tune of terabytes).
Be diverse: Serve a variety of use cases, including new and unknown ones. As the industry and use cases change or evolve, the system should be able to provide a facility to fine tune them.
Be flexible: Be flexible enough to fine tune itself to accommodate new use cases. Should be able to incorporate predictive analytics.
Be economical: The system should be cost effective so that the benefits of building such a system should not be neutralized by its cost.
An efficient analytics system will have some critical capabilities in order to respond to business needs. Also, the technology platform should not be prohibitive in terms of cost and usage. The features which we are looking to have are:
Handling high volume of data – Using a big data framework like Hadoop to retain data.
Real-time data processing – A streaming solution like Kafka coupled with Spark Streaming would be a good option.
Predictive learning– Various machine learning algorithms can be supported by Spark’s MLLib library or Hadoop Mahout Library.
Storing the results and data. A NoSQL system like MongoDB could be good choice because it provides the flexibility of storing JSON data in schema less fashion. The pipeline which we are trying to build will consist of machine generated data, hence Mongo DB could be a useful candidate.
Reporting the results – For a user interface, a Tableau-like tool could be useful. Other choices may include Qlikview. Open source tools could be Jasper or Birt. Having a mature user interface will cover the aspects of historical reporting, drill down information, etc.
Alerts - e.g. Twilio, can be used to deliver Text messages. Sending alerts through emails could also be an option.
The following diagram represents the analytic pipeline within the IOT landscape:
The following technologies are preferred for building the analytics pipeline. In addition to addressing business needs, the choice of technology is influenced by two parameters: Usage (adoption) and cost of Entry. Hadoop is the most used Big Data framework, but recently Spark is also gaining usage and popularity. Its ability to integrate various aspects of Big Data solutions seamlessly such as Streaming and building predictive models is making Spark a popular choice. In view of their acceptance at various organizations, the natural choice is to look into these technologies.
Hadoop Distributed file system
Visualization Tool such as Tableau, Qlikview, D3.js, etc.
Some of these technologies are now available as a cloud offering of various providers, such as Microsoft, IBM, Amazon, etc. These solutions offer their own benefits, such as trying out a solution quickly or building PoCs. However, things are evolving and sometimes, the choice of technology may be restricted by a platform offering. While we can expect these offerings to mature overtime, for today’s need, it would be worth to do some research and opt for an in house system, which can provide more control and flexibility in building an analytics pipeline.
Detailed Description of the Analytic Pipeline
The Messaging System
From an analytics system's perspective, Apache Kafka can be treated as the entry point. Apache Kafka is a high-throughput, distributed, publish-subscribe messaging system. This is suitable in a big data scenario as it can scale when required while providing a simple subscriber based mechanism. As we require the processing to happen in real time, this when coupled with Spark streaming, can be processed in real time. Spark incidentally, is a fast and scalable solution as it employs in memory architecture, which is considerably faster than Hadoop’s map reduce architecture.
Kafka provides two mechanisms: Producer and Listener. The API writes the data to the producer (this is done using the Priority Queue), and a Spark listener listening to Kafka will receive the data in a stream. By using this mechanism, it is ensured that Kafka handles high volume, high frequency data, and Spark streaming is able to distribute and process the load to various nodes of the Hadoop cluster. So the growth of data can be handled by adding more nodes, if required.
Another advantage of using Kafka is the mapping of use cases of each queue. The designer can design the queues to be separated by use cases, thus keeping the processing logic to a required minimum. We need not write unnecessary code to handle those use cases, which will never arrive in a queue.
Once data is available at the messaging system, we need a mechanism so that data which is coming in at high volume and high velocity can be processed efficiently to meet business needs. This can be achieved by utilizing streaming APIs of Big Data ecosystems. Spark Streaming can be used here to ensure that messages received are spread out on the cluster and processed efficiently. Other notable advantages of this is that a processing time window can be configured as needed. Suppose we want to process data every 30 seconds or every 5 minutes. This can be made use case dependent and not system dependent. This is a powerful option available at the hands of designer.
The data coming to various nodes may not conform to the required parameters. In this scenario, if a failure happens during the processing of a message, the message can be logged into log files which can be analyzed later.
This is the stage where the processing of data is actually done. Here, based on the properties of data (metadata), the analytical model is applied. For example, if the program is listening to a financial message, it would know that it needs to apply a fraud detection mechanism. This fraud detection can be then applied by means of a predictive model. For example, let's assume that we have developed a K-means algorithm which flags off suspect cases of fraud. Once this model is created, various parameters of this model will be fed into the system beforehand.
The flexibility to handle various data formats is available through implying JSON format and extracting requisite information from the data available. E.g., if our machine learning model is predicting on two parameters, say pred1 and pred2, the program at the spark streaming level can read only the required variables and pass to the machine learning model. When the model changes, the program readjusts the variables in runtime, thus providing the flexibility. The format independence from devices is achieved at the Data Sink level, where the program can translate (if required) a text or CSV-based response to JSON. In this way, it is ensured that a wrong format is caught at an earlier stage, rather failing the program at a later stage. This also encapsulates some basic security and flexibility, as the message format is not exposed to the device level.
Now, when the message is processed, these parameters are read, extraction of variables is done, appropriate model is loaded, and the data is fed into the model. Based on the results available, further action (also configurable) can be taken at the next layer.
The chosen technology here is Spark’s MLLib library. Various popular machine learning algorithms such as Decision Trees, Random Forests, K-means, etc. are already available, and these can be used to build various models. Furthermore, MLLib is continuously evolving, so we can expect it to become more mature with time. Here, not only can predictive models can be used, but also a Rule based mechanism can be developed for monitoring purposes.
After the analytic processing is done, the results need to be processed. Based on the output, these can be sent to the user in real time as an alert, or these can be stored in a data store for later viewing. For this, a NoSQL data store will be suitable, because of Volume and Velocity. The data format is kept as JSON format, and MongoDB makes this a suitable choice. Real time alerts can be configured and programmed here to send out text messages, using a service like Twilio. From this data store as a source, a number of interfaces can be developed for consumption by end users, such as reporting using Tableau, viewing on mobile devices, etc.
Security, Reliability, and Scalability
When designing an analytics system, all of the above factors should be considered. The choice of technologies like Hadoop, Spark, and Kafka address these aspects. Kerberos-based security can be configured on the Hadoop cluster, thus securing the system. Other components such as Kafka and Spark run on a Hadoop cluster, thus they are also covered by Hadoop’s security features. As these tools are designed for big data processing, data replication and reliability are provided by the infrastructure, thus enabling the engineers to focus on building the business proposition. For example, if the volume of data increases, we can add more nodes to the cluster. The underlying storage mechanism ensures that the load is evenly distributed, and a distributed computing framework ensures that every node is utilized. These technologies also use a fail-safe mechanism so when a nodes fails, the system ensures that the computation is resubmitted.
This system can be utilized in many scenarios, some of which can be seen below:
Plant Maintenance: Consider a manufacturing unit. A number of machines and parts work together to create the end product. These machine have sensors which indicate the vibrations of moving parts. We also have some information that if the vibrations reach a threshold, then a part is likely to fail. In this situation, the advisable thing to do is to temporarily shut down the machine, repair the part and resume operation.
In the above scenario, if a system is provided that can capture data coming from a large number of devices, and these devices could be generating data every second, this system should be able to process every data point, and be able to send alerts based on a certain parameter. Also, think of a scenario that a new device has been introduced to, and it has a temperature parameter to be considered. This data point also needs to be processed along with other data points and system should be able to raise an alert if need be.
The advantage of this system is that we can prevent an impending failure which could cost a lot. Also, with this kind of system in place, the investigation of finding the faulty part is eliminated along with the time to repair.
Building a system like this will need to handle data with volume, velocity and variety. The next critical step is to be able to apply wisdom and intelligence as the users learn more about the system they are using.
Food Industry: Maintaining a cold-storage supply chain for perishable items. In this scenario, a number of vehicles can broadcast their temperature levels, and the system will be able to analyze this data and notify appropriately.
Detecting fraud: This system can be utilized to detect fraudulent-looking transactions, and with near real-time response, the information can be utilized to prevent the fraud from happening. More on this is discussed in the “Using Machine Learning Algorithms” section.
Suitability for Businesses
Analytics pipelines, when deployed, can add several benefits to various organizations. Because of their generic nature, they can detect real-time financial fraud, as well as provide information of the impending failure of a device. So it can help to eliminating revenue loss from financial transactions to minimize the downtime of a plant.
This pipeline has the capability to apply itself in a number of domains, from healthcare to travel. It can filter anomalous data points in medical measurements, and it can filter the most travelled destinations. This pipeline provides one very important dimension to businesses: optimization.
Going forward, having an analytics pipeline will be a pressing need for several organizations who want to derive value from data. Building such a system is a complex task, as it requires flexibility on an unprecedented scale, not only to handle the volume of data, but a variety of data at high velocity as well.
Thankfully, with the availability of technology, it is no longer an alien concept, but a reality. The flexibility can be further extended to integrate with other cloud-based machine learning systems, such as Azure ML, Yhat, etc. Integrating these services with the pipeline will make the system more usable and more versatile.
Opinions expressed by DZone contributors are their own.