Collocation: The First Rule of Distributed Programming

DZone 's Guide to

Collocation: The First Rule of Distributed Programming

· Java Zone ·
Free Resource

I am pleased to announce that we just recently released GridGain 3.0.4. The last couple of releases have been focused, among other things, around convenient and effective collocation of computations and data, and also grouping of data that is usually accessed together on the same nodes. Sending computations exactly to the nodes where the accessed data is residing is one of the key components in achieving better scalability. Without collocation, nodes fetch various data from other nodes for brief periods of time, just to perform often a quick computation and discard it almost immediately thereafter. This creates unnecessary data traffic, a.k.a. data noise, and can at times bring a server to its knees.

In my previous blog post I showed how to collocate computations and data using direct API via GridCache.mapKeyToNode(..) method. We have also added analogous methods on Grid API to provide capability of finding data affinity on the nodes that do not cache any data themselves. In our latest 3.0.4 release we have also added a very convenient way to provide collocation via @GridCacheAffinityMapped annotation.

Say you have 2 types of objects, Person and Company. Multiple persons can work for the same company. This means that you generally may wish to access Person objects together with the Company for which they work. To do that in a scalable fashion, you may wish to ensure that all people working for the same company are cached on the same node. This way you can send computations to that node and access multiple people from the same company locally. Here is how it can be done in GridGain.

    public class PersonKey {  
// Person ID used to identify a person.
private String personId;

// Company ID which will be used for data affinity.
private String companyId;
// Instantiate person keys with same company ID.
Object personKey1 = new PersonKey("myPersonId1", "myCompanyId");
Object personKey2 = new PersonKey("myPersonId2", "myCompanyId");

// Both, the company and the person objects will be cached on the same node.
cache.put("myCompanyId", new Company(..));
cache.put(personKey1, new Person(..));
cache.put(personKey2, new Person(..));
Now, if you want to perform a computation which involves multiple people working for the same company, all you have to do is send a grid job to the node where those people are cached. Here is how you would send a computation to the node which caches all people for the company with ID "myCompanyId".

Now, when you properly collocate all your data within your data grid and then route your computations to the nodes where your data is cached, all cache operations become LOCAL, hence achieving best performance and scalability without any data noise. Kind of goes inline with the first rule of distributed programming, which is DO NOT DISTRIBUTE.


From http://gridgain.blogspot.com/2011/01/collocation-first-rule-of-distributed.html


Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}