Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Aggregations in Distributed In-Memory Grids

DZone's Guide to

Aggregations in Distributed In-Memory Grids

· Cloud Zone
Free Resource

Site24x7 - Full stack It Infrastructure Monitoring from the cloud. Sign up for free trial.

[This article was written by Ron Zavner.]

Today we see more and more applications that are no longer built on relational databases but now built on distributed environments. This occurs because they need to be scalable, highly available and they also need to be able to provide high throughput and low latency that cannot be achieved with older relational databases. Today, distributed environments and in memory data grids are more advanced than they were a few years ago but still more complex than relational databases.

Since distributed data grids store the data in a distributed manner, creating a distributed database, there are some operations that are not so intuitive, for instance join queries and aggregations. Let’s say we want to fetch an employee object together with its department object. “In a database, this would be easily accomplished with a simple query. However, with a distributed in memory data grid we don’t even know that the employee and the department object would be on the same node (unless of course we route them together which is not always the best practice).

For aggregations it is even more difficult – let’s say we would like to fetch the average, minimum and maximum salaries of all employees. In SQLit would be as simple as:

Select avg(salary), min(salary), max(salary) from employees.

We could try to make it even more complicated, for example, the average salary for each department:

Select avg(salary) from employees group by department_id

Or only from departments that have a salary higher than X:

Select avg(salary) from employees group by department_id having avg(salary) > X

How can we execute such tasks in distributed data grids? The data is partitioned across the nodes. One way to achieve that would be a map reduce class. The map function will run on each one of the nodes, calculate the average salary of the employees on that node only and return the result back to the reducer. The reducer runs on the client and then aggregates all the results it got from the different node. This way is very efficient since the actual business logic is running on the server side (help reducing the latency) and this way we return only the aggregated data from each node to the client (this is much less data). The disadvantage with the map reduce is that it is not as intuitive as SQL query. We need to create the class with the business logic for operations so we can easily describe with simple API or SQL queries.

Ideally, we could just write a code similar to the following:

query = new SQLQuery(Person.class, “”);
groupByResult = groupBy(gigaSpace, query, new GroupByAggregator()
.groupBy(“department”)
.selectAverage(“salary”));
Or the more complicated query:
groupByResult = groupBy(gigaSpace, query, new GroupByAggregator()
.groupBy(“department”)
.selectAverage(“salary”)
.selectCount()
.having(new GroupByFilter() {
@Override
public boolean process(GroupByValue groupByValue) {
return groupByValue.getDouble(“avg(salary)”) > 18000;
}
}));

Overall, if we want to run an operation, such as an aggregation, we need to overcome the non-intuitive limitation of a distributed data grid.

Site24x7 - Full stack It Infrastructure Monitoring from the cloud. Sign up for free trial.

Topics:

Published at DZone with permission of Nati Shalom, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}