Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Solving Architectural Dilemmas to Create Actionable Insights

DZone's Guide to

Solving Architectural Dilemmas to Create Actionable Insights

A big data expert discusses the Benefits of a converged real-time data pipeline platform that breaks away from the traditional tiered approach.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The organizational trend of evolution rather than revolution is currently at its peak regarding digital transformation projects. Enterprises have adopted a convergence path called DevOps. Instead of using the classic tiered structure that groups teams according to discipline, DevOps teams integrate workers from different departments in order to enhance communication and collaboration. This paradigm promotes a faster project development lifecycle because it eliminates the interdependencies that exist when functions such as software development and IT operations are completely separate functions. 

While this closed-loop paradigm is being implemented at the organizational level, it is still lagging in the supporting infrastructure. Complex integration workflows and slow data pipelines are still the leading architecture choices. This is due primarily to the power of inertia, which preserves old SOA concepts and slows the adoption of technologies that simplify integration and provide much faster data pipelines.

A unified data and analytics platform is needed to provide extreme data processing, fast data ingestion, and advanced analytics. InsightEdge is a converged platform that unifies microservices, analytics, and processing in an event-driven architecture that reduces TCO while increasing performance.

The Four A's of Big Data 

As Ventana Research Analyst David Menninger suggests, organizations must shift and evolve their focus from the three V’s of big data- Volume, Variety, and Velocity - to the four A’s: Analytics, Awareness, Anticipation, and Action.

Analytics is the ability to derive value from the billions of rows of data, which opens the door to the other three A’s. Awareness allows the organization to have situational or contextual awareness of the current event stream (such as with NLP). Anticipation is the ability to predict, foresee and prevent unwanted scenarios.

The final A is Action, or Actionable Insights, which means leveraging the first three A’s in real time to take preemptive measures that positively impact the business. This new paradigm and methodology - using actionable insights to assess operational behavior and opportunities - should be the clear choice for decision makers when they assess where to invest their corporate budget. The four A’s can reduce customer churn, optimize network routing, increase profits by facilitating adaptive pricing and demand forecasting, and even save lives by using predictive analytics to avoid calamities like critical equipment failures.

When it comes to building the architecture for your business intelligence application, the four A’s can be challenging to achieve. Over-provisioned and complex architectures that are based on the traditional organizational business intelligence ETL flow result in slow workloads and costly infrastructure.

Slow by Design, not Computation

Traditional enterprise architectures, including the big data Lambda/Kappa architecture, are based on comfortable, familiar Service Oriented Architecture (SOA) concepts that utilize a stack of products, where each component has a specific usage or specialty.

This lego block concept, where each component is agile and can potentially be replaced or upgraded, sounds great in theory. But deploying, managing, and monitoring these components in an architecture while expecting high performance through the workflow may not be realistic. The result can actually be the opposite due to the number of moving parts, bottlenecks, and other single points of failure.

Classic pipeline examples are a combination of message brokers, compute layers, storage layers, and analytical processing layers that exist within yet additional sets of tools, which must ultimately work together in a complex ecosystem. This agile SOA-based architecture is a double-edged sword, which can help on some levels but also be a serious hindrance.

Aside from the complexity factor, this methodology has additional downsides:

  1. High availability must be maintained within every component and throughout the entire workflow for enterprise production environments.
  2. There is a constant need for advanced knowledge about a varied, ever-growing stack of products.
  3. The learning curve for infrastructure management and monitoring is steep.
  4. Complex integration leads to costly over-provisioning.
  5. Increasing the number of components in the architecture can lead to reduced performance when executing the flow or getting insights in real time.

Using multiple solutions with polyglot persistence requires performing Extract-Transform-Load (ETL) operations using external, non-distributed classic ETL tools that are nearly obsolete. Aside from paying for extra software licenses, and possibly consultants with specialized ETL knowledge (ETL/Data Architects and Implementers), you also inject ETL complexities, bugs, limitations and faults into your system. Using this methodology can ultimately hurt performance due to network latency, serialization, synchronous dependencies and other context switches.

How can you overcome these challenges and start focusing on the four A’s instead of the V’s? Begin by reducing the number of moving parts and simplifying the workflow process.

The Benefits of a Unified Insights Platform

At GigaSpaces, we examined what makes the data pipeline so complex, which parts can be simplified, and how to make everything faster - much faster.

On its own, Apache Spark is limited to loading data from the data store, performing transformations, and persisting the transformed data back to the data store. When embedded in the InsightEdge Platform, Apache Spark can make changes to the data directly on the data grid, which reduces the need for multiple data transformations and prevents excessive data shuffling.

The only external tool needed is Kafka (or another message broker) to create a unified pipeline. This one additional component, together with the InsightEdge Platform, provides all the parts you need to derive actionable insights from your data in a single software platform.

Most of the traditional polyglot persistence methodologies, which include key/value stores, document databases, and even in-memory databases, support building a custom workflow using a variety of data store structures. However, classic multi-tier architectures require a storage layer that has multiple “tables” (main store and delta stores) along with, multiple layers for hot, warm, cold, and archive/history data. These tiers are built on top of yet more tiers for durable storage and sit underneath an additional management and query tier. 

The In-Memory Data Grid tier of the InsightEdge Platform can ingest and store multi-tiered datasets, which eliminates most of the layers required in traditional table-based database paradigm. The data grid can also utilize advanced off-heap persistence that leverages its unique MemoryXtend feature, which enables scaling out into the multi-tiered range at relatively low cost.

In addition to reducing the number of tiers in both the general architecture and within the In-Memory Data Grid, GigaSpaces has successfully eliminated delta tables because the grid is the live operational store. Unlike traditional databases, the grid can handle massive workloads and processing tasks, ultimately pushing big-data store (e.g. Hadoop) asynchronous replication to the background to put the multi-petabytes in cold storage. With SQL-99 support and the full Spark API, InsightEdge’s management and analytical query tier is essentially a part of the grid, leveraging shared RDDs, DataFrames, and DataSets on the live transactional data.

Simplifying the Workflow: Out-of-the-Box High Availability

Fast data, as seen on the big data curve, is the ability to handle velocity-bound streaming from an eclectic collection of data sources. The three top requirements for these hybrid transactional/analytical processing (HTAP) intensive scenarios are:

  1. A closed-loop analytics pipeline that includes data ingestion, insights, and recommended actions at sub-second latency.
  2. The convergence of myriad data types, especially in IoT and Omni-Channel environments.
  3. Ability to handle and correlate between real-time and historical data in the same pipeline.

A Closed-Loop Analytics Pipeline

The last mile in every big-data project is the ability to improve business based on analytical results - the coveted actionable insights. Most big-data projects are so focused on building and managing the data lake that they don’t get to this final, most important step.

The first challenge in creating a unified platform is addressing the issue of bi-directional integration between the transactional and analytical data stores. Integrating these two tiers usually means having two different storage layers, unified by a query layer/engine that can pull data from both, and then run the blend and batch on a third layer.

Unifying all these layers results in a single strong, consistent transactional layer that acts as the source for all processing, querying, and storage of analytical models.

Convergence of Multiple Data Types

Another challenge is the ability to merge different data types, such as POJO, PONO, document (JSON, XML) as well as other structured data types, with semi and unstructured data types.

In addition to being able to store, process, and query these vastly different data types, the platform must be able to run analytical computations on the source using Spark APIs (RDDs, DataFrames, DataSets, and SparkSQL) as the same data is being processed in real time.

In standard big data implementations, there is a separation between the transactional processing performed by the applications and the analytical storage and stack of products. This segregation of duties - or misalignment between the two stacks - creates the classic BI (Business Intelligence) "Rear-View Mirror Architecture" paradox. In this paradox, the analytics infrastructure focuses on data accumulation and retrospective analysis rather than on actionable insights. The result is a collection of analytical insights that are of limited or even minimal value by the time the analysis is complete.

To overcome this problem, the InsightEdge Platform contains out-of-the-box mapping between the Spark API and all the different data types that can be ingested. With the whole spectrum of the Spark API available immediately, you can easily and quickly create fast data analytics on top of every data type.


RDD API

To load data from the data grid, use SparkContext.gridRdd[R]. The type parameter R is a Data Grid model class. For example:

val products = sc.gridRdd[Product]()

After the Spark RDD is created, you can perform any generic Spark actions or transformations on it, such as:

val products = sc.gridRdd[Product]()
println(products.count())
println(products.map(_.quantity).sum())

To save a Spark RDD to the data grid, use the RDD.saveToGridmethod. It assumes that type parameter T of your RDD[T] is a Space class, otherwise, an exception is thrown at runtime. For example:

val products = sc.gridRdd[Product]()

After the Spark RDD is created, you can perform any generic Spark actions or transformations on it, such as:

val products = sc.gridRdd[Product]()
println(products.count())
println(products.map(_.quantity).sum())

To save a Spark RDD to the data grid, use the RDD.saveToGrid method. It assumes that type parameter T of your RDD[T] is a Space class, otherwise, an exception is thrown at runtime. For example:

val rdd = sc.parallelize(1 to 1000).map(i => Product(i, "Description of product " + i, Random.nextInt(10), Random.nextBoolean()))
rdd.saveToGrid()

To query a subset of data from the data grid, use the SparkContext.gridSql[R](sqlQuery, args) method. The type parameter R is a Data Grid model class, the sqlQuery parameter is a native Datagrid SQL query, and args are arguments for the SQL query. For example, to load only products with a quantity of more than 10:

val products = sc.gridSql[Product]("quantity > 10")
// We use parameters to ease development and maintenance
val products = sc.gridSql[Product]("quantity > ? and featuredProduct = ?", Seq(10, true))

Data Frames API

Spark RDDs are stored in the data grid as collections of objects of a certain type. You can create a Data Frame for the required type using the following syntax:

val spark: SparkSession // An existing SparkSession.
import org.insightedge.spark.implicits.all._
val df = spark.read.grid[Person]
df.show()// Displays the content of the DataFrame to stdout

Dataset API

val spark: SparkSession // An existing SparkSession.
val ds: Dataset[Person] = spark.read.grid[Person].as[Person]
ds.show()// Displays the content of the Dataset to stdout

// We use the dot notation to access individual fields
// and count how many people are below 60 years old
val below60 = ds.filter( p => p.age < 60).count()

Predicate Pushdown

One reason why the InsightEdge Platform is so powerful is its ability to use the Spark/Grid API to "push down" predicates to the data grid, leveraging the grid's indexes and aggregation power transparently to the user. The workload is delegated behind the scenes between the data grid and Spark - tasks are sent to the grid to take advantage of its indexes, filtering, and superior aggregation capabilities.

Blending Real-Time and Historical Data in the Same Pipeline

The final challenge is the need to unify and scale both real-time and historical data. Existing pipelines are built using the polyglot persistence model, which is fine in theory. However, this model has serious limitations regarding performance and management. The InsightEdge answer to this challenge is performing lazy loading and customized prefetching of all the hot and warm data to the platform so that the data is immediately available for batch queries and correlation with the incoming stream. Correlation matching and complex queries can be run very easily because there is no data shuffling back and forth between multiple storage layers and the analytical tier. The result of this approach is a live, operational data lake that doesn’t risk turning into a data swamp.

Real-Time Data Lake and Machine Learning

Operationalizing a data lake is a fundamental methodology shift that is needed to prevent your data from becoming stale, obsolete, and, finally, an enterprise data swamp. Simply building a storage layer without utilizing it correctly for both real-time and batch processing is the reason why, according to Forbes, 76% of data-driven enterprises are stuck somewhere between data accumulation and being solely reactive to the collected data.

When examining the analytics or machine learning value chain, the business impact increases exponentially with the ability to extrapolate meaningful data and finally create actionable insights.

This significant change in business analytics can be seen when the infrastructure is mature enough to allow an organization to move from (step 1) being reactive to (step 2) being predictive, and ultimately (step 3) making proactive decisions. To better understand the value of this paradigm shift, let’s examine a simple use case; a machine such as a train, which has brakes that will wear out over time. The goal is to proactively keep the train running at an optimal level, without allowing the brakes to fail in real-time.

The first step in achieving the above goal is to run analytics in real-time in order to detect anomalies or abnormal behavior as the train is moving. When applying this idea to a business organization, metrics can be collected and analyzed in real-time (also implementing classic algorithms and machine learning).

The second step is a little more complicated. The goal is not only to react when an anomaly is discovered but also to predict potential equipment or machine failure ahead of time. Being able to predict maintenance costs and plan accordingly lowers the total cost of ownership (TCO) by preventing a chain of failures that can be costly to repair. While the above example addresses only TCO, this capability can have life-saving applications for organizations in safety scenarios, such as predictive maintenance on train brakes, elevator systems, and aviation safety.

The third step is basically the "final frontier" in business analytics. We leverage the results of the previous two steps (descriptive and predictive) and apply mathematical and computational sciences to create a statistical decision tree. Action can then be taken based on this decision tree to achieve optimal results. To keep your train running at peak safety, this decision may involve purchasing a specific type of brake pad that best suits the use pattern that has been identified.

Looking at a business organization, this type of data analysis can guide leaders in making much more effective decisions, which are based on strong, current statistical forecasting instead of outdated historical analysis.

Visualizing the Data

Data scientists and project managers often require quick insight into data to understand the business impact and don’t want to waste valuable time consulting their corporate IT team. InsightEdge provides multiple ways of visualizing the data stored in the XAP in-memory data grid. 

In addition to the built-in Apache Zeppelin notebook, InsightEdge contains the powerful In-Grid SQL Query feature, which includes a SQL-99 compatible JDBC driver.  This provides a means of integrating external visualization tools, such as Tableau. These tools can be connected to the grid via a third-party ODBC-JDBC gateway, in order to access the data stored in the XAP in-memory data grid and present it in a visual format.

Conclusion

The ability to process and analyze data in a simple, fast and transactional platform is no longer optional; it is necessary in order to handle the ever-growing data workloads, leverage new deep learning frameworks like Intel’s BigDL, and lower your organization’s TCO. While technologies for managing big data are constantly and rapidly evolving, the organizational methodologies for processing data are lagging behind.

GigaSpaces’ InsightEdge Platform breaks away from the traditional tiered approach and provides a simpler, faster workflow by consolidating the In-Memory ingestion and processing tier together with the analytical tier in a tightly coupled microservices architecture. 



References and Further Reading

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
htap ,big data ,data lakes ,data swamp ,data analytics

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}