This post will be part of a weekly blog series about Geo-Distributed Analytics.
In recent years, research and industry-driven technology enhancements have been focusing on solving problems related to large-scale data processing for tightly inter-connected systems. Today, the base assumption is that one can move all relevant data into central repository often called a data lake. Technology ranges from NewSQL, NoSQL, Scale-Out Machine Learning, CEP-Engines, and In-Memory Analytics Engines to all sorts of data stores (Key/Value, Document, Columnar databases…). In the upcoming age of IoT, we are facing new challenges:
More and more data is generated geographically distributed. Therefore, devices may be connected via low bandwidth and/or volatile connections (e.g., an oil rig in a remote location via a satellite link) or there are just too many devices to effectively and reasonably transfer all relevant raw data to a centralized analytical platform in the required time window. Consequently, both storage and deep analytical capabilities must move closer to the edge, i.e. data producers. In addition to that data readings produced locally might have to stay in the country of origin due to legal compliancy rules whereas analytical model parameters might be moved to a centralized location. While the infrastructure challenges are well understood, existing database and big data technologies need to be rethought to address these new challenges.
Figure 1: ParStream Edge Analytic Appliances are located right next to where data is generated. A master node is used to distribute queries and compute final result sets.
ParStream Geo-Distributed Analytics (GDA) is following this edge-analytics approach by locating ParStream nodes close to where the data is produced, allowing for high volume data acquisition and advanced analytics in near real-time. ParStream nodes located on an oil rig, for example, might analyze incoming sensor data for anomalies using complex algorithms, using the results to trigger actions in other local applications or systems, and also send an alert to a monitoring service.
At the same time these ParStream nodes are also part of a loosely coupled GDA setup, accepting data analysis requests from a central location. Such a request could be issued by an analyst looking for a root cause of an alarm displayed by a monitoring application. It could also be a data scientist testing a new analytical approach on detailed data without having to move that data into a centralized platform. ParStream just transfers the queries to the local servers leveraging the computing performance of ParStream close to the source of data.
Figure 2: Edge analytics delivers real-time insights by minimizing network traffic.
Only the results (aggregates, model parameters etc.) will be reported back from each node in the geo-distributed cluster to the master server and from there back to the user.
As shown above in Figure 2, in a traditional, centralized analytics approach—either on premise or in the cloud —data from decentralized systems like cell phone towers has to be moved to a centralized analytical data store prior to any analysis. This means moving millions or billions of data rows via the network to a central system.
With GDA – as shown in Figure 2 above - data is stored in individual ParStream satellite nodes as close as possible to where the data is generated. Rather than moving raw data across a (low bandwidth) network, the ParStream master node is sending queries to each satellite node for local processing. The master node is also responsible for collecting the results from each satellite node and for combining them into a final result set. It is important to note, that ParStream edge analytics means more than aggregation and filtering. It ranges from simple calculations to even estimating the parameters of a statistical model using statistical languages like R or applying a statistical model to the data stored locally.
Following this edge-analytics approach, very few rows are actually being transmitted via the potentially low-bandwidth network to the master node to minimize latency and network resource usage. When satellite nodes fail to report data, ParStream will continue to return a result, based on the total data that was received while also noting which nodes failed to report. This is a significantly different approach compared to the centralized cluster based analytic platforms, which either returns a result based on the full input data set or no result at all. It is important to note that geo-distributed analytics is extending the current architecture (cloud, centralized MPP cluster, federation and that is not meant to replace those.
...next week: We will discuss a sample use case scenario and explore the value of Geo-Distributed Analytics.