How to Execute Distributed MapReduce on Java Over Data Stored in Redis
What is MapReduce and why is it helpful?
Join the DZone community and get the full member experience.
Join For FreeMapReduce is a framework that programmers today can use to write applications that are able to process massive quantities of data using a modern distributed data processing approach. This processing approach is very popular among organizations today. As it allows the processing of data in parallel on large commodity hardware clusters, using MapReduce
can significantly speed up data processing. In this post, we will take a look at how to execute MapReduce
over data stored in Redis using Redisson - Redis based In-Memory Data Grid for Java.
What Is MapReduce?
MapReduce
is a program model for distributed computing that could be implemented in Java. The algorithm contains two key tasks, which are known as Map
and Reduce
.
The purpose of the Map
task is to convert a dataset into another dataset where elements are broken down into key/value pairs known as tuples. TheReduce
task combines those data tuples into a small set of tuples, using the output from a map as its input.
The distributed computing means splitting a task into several separate processes, which can then be carried out in parallel on large commodity hardware clusters. Once MapReduce
has broken down the various elements of the large dataset into tuples and then further reduced them into a smaller set, the data that remains can be processed in parallel, which can significantly speed up the processing that needs to be carried out on the data.
When Do You Need to Process Redis Data With MapReduce?
There are many situations in which it is helpful to use MapReduce
to process Redis data. In general, what they all have in common is that the amount of data that you need to process is very large.
As a simple example, you can consider a situation in which you have monthly energy consumption figures for a very large number of organizations. Now imagine that you need to process this data to generate results such as the year of maximum usage, year of minimum usage, and so on, for each organization. While writing an algorithm to perform this kind of processing is not difficult for an experienced programmer, many such algorithms will take a long time to execute if they have to run over vast amounts of data.
As a solution to the problem of long processing times, you can use MapReduce
to reduce the overall size of the dataset and, therefore, make it much faster to process. This reduction in processing time could be very important for many organizations, as it frees up the hardware so that it is available to use for other computing tasks.
There are many more examples of situations in which using distributed MapReduce
on data that is stored in Redis using Redisson can be an extremely helpful thing to do. For example, using MapReduce
can be particularly useful if you need to quickly, reliably, and accurately calculate a word count for a very large file or collection of files.
Examples of Executing Distributed MapReduce on Java Over Data Stored in Redis
Here is an example of how to use MapReduce
to create an efficient algorithm that produces an accurate word count. This might seem like a very simple task, but using MapReduce
is very important to cut down the processing time for very large blocks of text or large collections of files.
Take a look at the following code to see how this algorithm uses MapReduce
provided by Redisson to take text data and process it to reliably produce an accurate word count.
Step 1
Create a Redisson config:
// from JSON
Config config = Config.fromJSON(...)
// from YAML
Config config = Config.fromYAML(...)
// or dynamically
Config config = new Config();
…
Step 2
Create a Redisson instance:
RedissonClient redisson = Redisson.create(config);
Step 3
Define the Mapper
object. This is applied to each Map
entry and split value by space to separate words:
public class WordMapper implements RMapper<String, String, String, Integer> {
@Override
public void map(String key, String value, RCollector<String, Integer> collector) {
String[] words = value.split("[^a-zA-Z]");
for (String word : words) {
collector.emit(word, 1);
}
}
}
}
Step 4
Define theReducer
object. This calculates the total sum for each word.
public class WordReducer implements RReducer<String, Integer> {
@Override
public Integer reduce(String reducedKey, Iterator<Integer> iter) {
int sum = 0;
while (iter.hasNext()) {
Integer i = (Integer) iter.next();
sum += i;
}
return sum;
}
}
Step 5
Define the Collator
object (optional). This counts the total amount of words.
public class WordCollator implements RCollator<String, Integer, Integer> {
@Override
public Integer collate(Map<String, Integer> resultMap) {
int result = 0;
for (Integer count : resultMap.values()) {
result += count;
}
return result;
}
}
Step 6
Here is how to run it all together:
RMap<String, String> map = redisson.getMap("wordsMap");
map.put("line1", "Alice was beginning to get very tired");
map.put("line2", "of sitting by her sister on the bank and");
map.put("line3", "of having nothing to do once or twice she");
map.put("line4", "had peeped into the book her sister was reading");
map.put("line5", "but it had no pictures or conversations in it");
map.put("line6", "and what is the use of a book");
map.put("line7", "thought Alice without pictures or conversation");
RMapReduce<String, String, String, Integer> mapReduce
= map.<String, Integer>mapReduce()
.mapper(new WordMapper())
.reducer(new WordReducer());
// count occurrences of words
Map<String, Integer> mapToNumber = mapReduce.execute();
// count total words amount
Integer totalWordsAmount = mapReduce.execute(new WordCollator());
MapReduce
is also available for collection type objects, including Set
, SetCache
, List
, SortedSet
, ScoredSortedSet
, Queue
, BlockingQueue
, Deque,
, BlockingDeque
, PriorityQueue
, and PriorityDeque
.
How to Use Redisson to Execute MapReduce on Java Over Data Stored in Redis
Redisson is a state-of-the-art Redis client that opens up near infinite possibilities for programming and data processing using Java. A wide range of companies, from the biggest enterprises to the smallest startups, use Redisson to empower their Java applications with Redis.
As a highly sophisticated Redis client, Redisson provides a distributed implementation of services, objects, collections, locks, and synchronizers. It supports a range of Redis configurations, including a single, cluster, sentinel, or master and slave configuration.
Using MapReduce
is an excellent option if you are already using Redisson to store large amounts of data in Redis. Redisson provides a Java-based MapReduce
programming model that can make it easy to process a very large amount of data stored in Redis.
Opinions expressed by DZone contributors are their own.
Comments