Techniques for Moving Computation to Where the Data Lives
Techniques for Moving Computation to Where the Data Lives
Building in-memory data structure stores helps achieve operational intelligence.
Join the DZone community and get the full member experience.Join For Free
Discover Tarantool's unique features which include powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU core!
In-memory computing now enables an exciting new capability called "operational intelligence" (OI) that brings many of the benefits of business intelligence to live, real-time systems. Operational intelligence uncovers immediate, actionable insights within live systems. Whereas business intelligence offers guidance from the data warehouse, OI acts on the front lines — generating personalized, contextual insights from fast-changing data and capturing opportunities in milliseconds — before the moment is lost.
The key obstacle to delivering operational intelligence is data motion. In today's distributed systems, the network has become the weak link. Managing terabytes of fast-changing data in a cluster of multi-core systems can easily saturate the network, creating bottlenecks that drive up latency and limit throughput and scalability. The secret to avoiding network bottlenecks is to move computation to where the data lives. This enables in-memory computing systems to maintain fast response times while scaling to handle huge workloads.
Moving From an In-Memory Data Grid to a Data Structure Store
Over the last several years, in-memory data grids (IMDGs) have evolved from distributed, in-memory data stores to data-parallel supercomputers hosted on clusters of commodity servers. IMDGs host collections of unstructured key-value pairs which can track changes to fast-changing data using straightforward “CRUD” (create/read/update/delete) APIs invoked by external clients. Integrating computation within an IMDG avoids the need to move data out of the IMDG for updating or analysis, thereby dramatically reducing data motion. It also enables an IMDG to take full advantage of the multi-core servers on which it is hosted.
The simplest approach to integrating computation into an IMDG is to implement predefined data structures within the IMDG. This extends the APIs beyond simple data access to higher level operations performed within the IMDG. Migrating computation for these data structures into the IMDG reduces data motion, improves performance, and offloads grid clients. This approach was pioneered by Redis, which supports several data structures, including hashed sets, sorted sets, and lists; Redis describes itself as a "data structure store."
For example, Redis lets clients embed a list of strings within a single object, and provides APIs for pushing or popping values from either end of the list. A subset of the list APIs is shown in the table to the right. Instead of reading the entire list from the grid to perform these operations, a client only needs to invoke the list APIs and supply the associated parameters. Not only does this reduce data motion to/from the client, it also eliminates the overhead of serializing the object's contents for storage within the IMDG. Lastly, application design is simplified because the details of implementing this data structure are handled by the IMDG.
The Next Step: Adding Extensibility to Data Structures
The obvious next step in migrating computation to an IMDG is to let users extend predefined data structures or create new ones. Redis makes this easy by providing Lua scripting that runs within the grid servers (and soon Redis will add loadable modules). Lua scripts can define new client APIs and access Redis commands within their implementation. An example of a trivial Lua script (excerpted from here) is shown to the right. It shows how a server-side script can accept parameters from the client and access an object stored on the local server.
While server-side scripting adds extensibility to an IMDG's predefined data structures, it has limitations and introduces vulnerabilities. The scripting language must be compatible with the IMDG’s server-side execution environment so that grid servers can run scripts. Scripts typically are limited to accessing objects located on the same grid server since the complexity of allowing distributed access is high. They must run quickly and reliably to avoid disrupting or crashing the grid service process which hosts them. They also have unprotected access to memory within the process, so they can see and possibly corrupt unrelated data.
Using a Worker Process for Extensibility
To address the limitations of server-side scripting, IMDGs can use a separate worker process to host computations that extend grid-based data structures. On each grid server, this worker process runs as a companion to the grid service process which holds the grid's objects, manages data accesses, and orchestrates server-side computations. User-defined code can be downloaded to the worker processes on all grid hosts and invoked by client applications to perform computations on specific objects. The IMDG's set of worker processes together can be called an "invocation grid" (IG), as illustrated in the following diagram which shows a three-server IMDG with grid service processes directing computation requests to IG workers:
The power of this approach lies in both its generality and security. Because worker processes can use an arbitrary runtime environment (such as a JVM for Java or a .NET runtime for C#), IMDGs can support many different languages for implementing user-defined code. For example, the Redis list data structure could be implemented in C#, as illustrated by the following code snippet:
The IMDG invokes the appropriate method by supplying an API (called SingleObjectInvoke below) which specifies the key of the object in the namespace, which holds the list data structure, selects the method, passes parameters to the invocation, and returns a result to the client:
This IMDG directs this method invocation to the grid server which hosts the object, signals the worker process to execute the method, and supplies the object’s data and invocation parameters to the worker process for use by the method. The IMDG also is responsible for managing the worker processes and prestaging the application code. Note that all of these details can be hidden from the application’s methods, which access the object’s data structure as if it were locally hosted.
The use of worker processes has other advantages. Since user-defined methods need not follow the limitations of a server-side script, they are not confined to accessing local objects and can call the IMDG's APIs to access objects distributed throughout the grid. Also, these methods run in a separate process, which is isolated from the grid service, keeping them isolated from IMDG data structures and unable to interfere with grid operations.
There is a price to be paid for this added flexibility and security. Instead of allowing server-side scripts to directly access deserialized data structures stored in the grid service, worker processes must retrieve objects across an inter-process boundary within the grid server. This requires that data structures be stored as serialized data, transmitted to the worker process, and then deserialized for access. Also, method invocations require extra steps to communicate with the worker process to start up, pass parameters, and return results. Despite this added overhead, the performance gains from moving computations from the client into the grid are significant.
The Final Frontier: Data-Parallel Computation
The infrastructure provided by an invocation grid makes it easy to extend its use to data-parallel computations, which harness the IMDG's full power to migrate computation to where the data lives. Instead of simply invoking a method on a specified object within the IMDG, the invocation grid can execute this method in parallel on a collection of objects hosted across all grid servers. A second method can be defined to merge the results and return it to the invoking client. With parallel computation, a single method invocation from the client can trigger the analysis of a large collection of data, providing a simple mechanism to harness the IMDG's scalable computing power without creating a network bottleneck. Here is an example of a data-parallel method invocation which illustrates its simplicity; the IMDG's API, called ParallelObjectInvoke below, lets the application specify a namespace of objects to be analyzed with an Eval method and the results merged with a Merge method:
For example, consider a financial services calculation that analyzes a collection of portfolios in various market sectors based on incoming market prices changes. This calculation could run in parallel on an IMDG to quickly determine which portfolios need to be reviewed by a trader, with analysis summaries merged and returned to the user. The IMDG’s ability to perform this analysis within the grid eliminates the network bottleneck of sending the portfolios to an external client application for analysis. Because the IMDG evenly distributes the portfolios across all grid servers, the parallel computation is automatically load-balanced within the grid. This results in scalable throughput that keeps computation time as fast as possible even as the workload grows.
This parallel computing technique can be used to create data-parallel data structures within the IMDG that are capable of performing useful computations. For example, the IMDG can host a distributed Java concurrent map which provides standard Java APIs to access a large collection of key-value pairs stored within the IMDG. Like Spark, this concurrent map can implement MapReduce or other data-parallel operators that are made available to client applications. Together, these capabilities allow an IMDG to store fast-changing data with high availability while simultaneously providing data-parallel analysis of the data set.
Ultimately, operational intelligence relies on the ability to quickly analyze fast-changing data to provide immediate feedback to a live system. By moving computation into an IMDG, large volumes of fast-changing data can both be hosted in memory and analyzed in place — where the data lives. This avoids data motion, which kills performance, and leverages the full power of an IMDG to perform computations with minimum latency. While building predefined data structures into grid servers provides an important first step in migrating computation into an IMDG, the use of worker processes hosted by a separate invocation grid add the flexibility needed to implement complex, user-defined methods and ultimately perform data-parallel computations. These techniques enable IMDGs to provide a compelling platform for delivering operational intelligence.
Opinions expressed by DZone contributors are their own.