How to Execute Distributed MapReduce on Java Over Data Stored in Redis

What is MapReduce and why is it helpful?

Nikita Koksharov

Nov. 21, 18 · Tutorial

Like (4)

Save

14.8K Views

MapReduce is a framework that programmers today can use to write applications that are able to process massive quantities of data using a modern distributed data processing approach. This processing approach is very popular among organizations today. As it allows the processing of data in parallel on large commodity hardware clusters, using MapReduce can significantly speed up data processing. In this post, we will take a look at how to execute MapReduce over data stored in Redis using Redisson - Redis based In-Memory Data Grid for Java.

What Is MapReduce?

MapReduce is a program model for distributed computing that could be implemented in Java. The algorithm contains two key tasks, which are known as Map and Reduce.

The purpose of the Map task is to convert a dataset into another dataset where elements are broken down into key/value pairs known as tuples. TheReduce task combines those data tuples into a small set of tuples, using the output from a map as its input.

The distributed computing means splitting a task into several separate processes, which can then be carried out in parallel on large commodity hardware clusters. Once MapReduce has broken down the various elements of the large dataset into tuples and then further reduced them into a smaller set, the data that remains can be processed in parallel, which can significantly speed up the processing that needs to be carried out on the data.

When Do You Need to Process Redis Data With MapReduce?

There are many situations in which it is helpful to use MapReduce to process Redis data. In general, what they all have in common is that the amount of data that you need to process is very large.

As a simple example, you can consider a situation in which you have monthly energy consumption figures for a very large number of organizations. Now imagine that you need to process this data to generate results such as the year of maximum usage, year of minimum usage, and so on, for each organization. While writing an algorithm to perform this kind of processing is not difficult for an experienced programmer, many such algorithms will take a long time to execute if they have to run over vast amounts of data.

As a solution to the problem of long processing times, you can use MapReduce to reduce the overall size of the dataset and, therefore, make it much faster to process. This reduction in processing time could be very important for many organizations, as it frees up the hardware so that it is available to use for other computing tasks.

There are many more examples of situations in which using distributed MapReduce on data that is stored in Redis using Redisson can be an extremely helpful thing to do. For example, using MapReduce can be particularly useful if you need to quickly, reliably, and accurately calculate a word count for a very large file or collection of files.

Examples of Executing Distributed MapReduce on Java Over Data Stored in Redis

Here is an example of how to use MapReduce to create an efficient algorithm that produces an accurate word count. This might seem like a very simple task, but using MapReduce is very important to cut down the processing time for very large blocks of text or large collections of files.

Take a look at the following code to see how this algorithm uses MapReduce provided by Redisson to take text data and process it to reliably produce an accurate word count.

Step 1

Create a Redisson config:

// from JSON

Config config = Config.fromJSON(...)

// from YAML

Config config = Config.fromYAML(...)

// or dynamically

Config config = new Config();

…

Step 2

Create a Redisson instance:

RedissonClient redisson = Redisson.create(config);

Step 3

Define the Mapper object. This is applied to each Map entry and split value by space to separate words:

public class WordMapper implements RMapper<String, String, String, Integer> {

    @Override
    public void map(String key, String value, RCollector<String, Integer> collector) {
            String[] words = value.split("[^a-zA-Z]");
            for (String word : words) {
                collector.emit(word, 1);
            }
        }
    }

}

Step 4

Define theReducer object. This calculates the total sum for each word.

public class WordReducer implements RReducer<String, Integer> {

     @Override
     public Integer reduce(String reducedKey, Iterator<Integer> iter) {
         int sum = 0;
         while (iter.hasNext()) {
            Integer i = (Integer) iter.next();
            sum += i;
         }
         return sum;
     }
}

Step 5

Define the Collator object (optional). This counts the total amount of words.

public class WordCollator implements RCollator<String, Integer, Integer> {

     @Override
     public Integer collate(Map<String, Integer> resultMap) {
        int result = 0;
        for (Integer count : resultMap.values()) {
            result += count;
        }

        return result;
     }
}

Step 6

Here is how to run it all together:

    RMap<String, String> map = redisson.getMap("wordsMap");
    map.put("line1", "Alice was beginning to get very tired");
    map.put("line2", "of sitting by her sister on the bank and");
    map.put("line3", "of having nothing to do once or twice she");
    map.put("line4", "had peeped into the book her sister was reading");
    map.put("line5", "but it had no pictures or conversations in it");
    map.put("line6", "and what is the use of a book");
    map.put("line7", "thought Alice without pictures or conversation");

    RMapReduce<String, String, String, Integer> mapReduce
             = map.<String, Integer>mapReduce()
                  .mapper(new WordMapper())
                  .reducer(new WordReducer());

    // count occurrences of words
    Map<String, Integer> mapToNumber = mapReduce.execute();

    // count total words amount
    Integer totalWordsAmount = mapReduce.execute(new WordCollator());

MapReduce is also available for collection type objects, including Set, SetCache, List, SortedSet, ScoredSortedSet, Queue, BlockingQueue, Deque,, BlockingDeque, PriorityQueue, and PriorityDeque.

How to Use Redisson to Execute MapReduce on Java Over Data Stored in Redis

Redisson is a state-of-the-art Redis client that opens up near infinite possibilities for programming and data processing using Java. A wide range of companies, from the biggest enterprises to the smallest startups, use Redisson to empower their Java applications with Redis.

As a highly sophisticated Redis client, Redisson provides a distributed implementation of services, objects, collections, locks, and synchronizers. It supports a range of Redis configurations, including a single, cluster, sentinel, or master and slave configuration.

Using MapReduce is an excellent option if you are already using Redisson to store large amounts of data in Redis. Redisson provides a Java-based MapReduce programming model that can make it easy to process a very large amount of data stored in Redis.

Data processing Redis (company) Java (programming language) MapReduce hadoop

Opinions expressed by DZone contributors are their own.

Related

Trending