Compute Grids vs. Data Grids
Compute GridCompute Grids allow you to take a computation, optionally split it into multiple parts, and execute them on different grid nodes in parallel. The obvious benefit here is that your computation will perform faster as it now can use resources from all grid nodes in parallel. One of the most common design patterns for parallel execution is MapReduce. However, Compute Grids are useful even if you don't need to split your computation - they help you improve overall scalability and fault-tolerance of your system by offloading your computations onto most available nodes. Some of the "must have" Compute Grid features are:
- Automatic Deployment - allows for automatic deployment of classes and resources onto grid without any extra steps from user. This feature alone provides one of the largest productivity boosts in distributed systems. Users usually are able to simply execute a task from one grid node and as task execution penetrates the grid, all classes and resources are also automatically deployed.
- Topology Resolution - allows to provision nodes based on any node characteristic or user-specific configuration. For example, you can decide to only include Linux nodes for execution, or to only include a certain group of nodes within certain time window. You should also be able to choose all nodes with CPU loaded, say, under 50% that have more than 2GB of available Heap memory.
- Collision Resolution - allows users to control which jobs get executed, which jobs get rejected, how many jobs can be executed in parallel, order of overall execution, etc.
- Load Balancing - allows to balance properly balance your system load within grid. Usually range of load balancing policies varies within products. Some of the most common ones are Round Robin, Random, or Adaptive. More advanced vendors also provide Affinity Load Balancing where grid jobs always end up on the same node based on job's affinity key. This policy works well with Data Grids described below.
- Fail-over - grid jobs should automatically fail-over onto other nodes in case of node crash or some other job failure.
- Checkpoints - long running jobs should be able to periodically store their intermediate state. This is useful for fail-overs, when a failed job should be able to pick up its execution from the latest checkpoint, rather than start from scratch.
- Grid Events - a querying mechanism for all grid events is essential. Any grid node should be able to query all events that happened on remote grid nodes during grid task execution.
- Node Metrics - a good compute grid solution should be able to provide dynamic grid metrics for all grid nodes. Metrics should include vital node statistics, from CPU load to average job execution time. This is especially useful for load balancing, when the system or user need to pick the least loaded node for execution.
- Pluggability - in order to blend into any environment a good compute grid should have well thought out pluggability points. For example, if running on top of JBoss, a compute grid should totally reuse JBoss communication and discovery protocols.
- Data Grid Integration - it is important that Compute Grid are able to natively integrate with Data Grids as quite often businesses will need both, computational and data features working within same application.
Data Grids allow you to distribute your data across the grid. Most of us are used to the term Distributed Cache rather than Data Grid (data grid does sound more savvy though). The main goal of Data Grid is to provide as much data as possible from memory on every grid node and to ensure data coherency. Some of the important Data Grid features include:
- Data Replication - all data is fully replicated to all nodes in the grid. This strategy consumes the most resources, however it is the most effective solution for Read-Mostly scenarios, as data is available everywhere for immediate access.
- Data Invalidation - in this scenario, nodes load data on demand. Whenever data changes on one of the nodes, then the same data on all other nodes is purged (invalidated). Then this data will be loaded on-demand the next time it is accessed.
- Distributed Transactions - transactions are required to ensure Data Coherency. Cache updates must work just like database updates - whenever an update failed, then the whole transaction must be rolled back. Most Data Grid support various Transaction Policies, such as Read Committed, Write Committed, Serializable, etc...
- Data Backups - useful for fail-over. Some Data Grid products provide ability to assign backup nodes for the data. This way whenever a node crashes, the data is immediately available from another node.
- Data Affinity/Partitioning - data affinity allows you to split/partition your whole data set into multiple subsets and assign every subset to a grid node. In the purest form, data is not replicated between nodes at all, every node is only responsible for it's own subset of data. However, various Data Grid products may provide different flavors of Data Affinity, such as replication only to back up nodes for example.
Data Affinity is one of the more advanced features, and is not provided by every vendor. To my knowledge, according to product websites, out of commercial vendors Oracle Coherence and GemStone have it (there may be others). In Professional Open Source space you can take a look at combination of GridGain With Affinity Load Balancing and JBossCache.