DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Calculating a Co-Occurrence Matrix with Hadoop

Calculating a Co-Occurrence Matrix with Hadoop

Bill Bejeck user avatar by
Bill Bejeck
·
Jan. 14, 13 · Interview
Like (1)
Save
Tweet
Share
9.60K Views

Join the DZone community and get the full member experience.

Join For Free

This post continues with our series of implementing the MapReduce algorithms found in the Data-Intensive Text Processing with MapReducebook. This time we will be creating a word co-occurrence matrix from a corpus of text. Previous posts in this series are:

  1. Working Through Data-Intensive Text Processing with MapReduce
  2. Working Through Data-Intensive Text Processing with MapReduce – Local Aggregation Part II

A co-occurrence matrix could be described as the tracking of an event, and given a certain window of time or space, what other events seem to occur. For the purposes of this post, our “events” are the individual words found in the text and we will track what other words occur within our “window”, a position relative to the target word. For example, consider the phrase “The quick brown fox jumped over the lazy dog”. With a window value of 2, the co-occurrence for the word “jumped” would be [brown,fox,over,the]. A co-occurrence matrix could be applied to other areas that require investigation into when “this” event occurs, what other events seem to happen at the same time. To build our text co-occurrence matrix, we will be implementing the Pairs and Stripes algorithms found in chapter 3 of Data-Intensive Text Processing with MapReduce. The body of text used to create our co-occurrence matrix is the collective works of William Shakespeare.

Pairs

Implementing the pairs approach is straightforward. For each line passed in when the map function is called, we will split on spaces creating a String Array. The next step would be to construct two loops. The outer loop will iterate over each word in the array and the inner loop will iterate over the “neighbors” of the current word. The number of iterations for the inner loop is dictated by the size of our “window” to capture neighbors of the current word. At the bottom of each iteration in the inner loop, we will emit a WordPair object (consisting of the current word on the left and the neighbor word on the right) as the key, and a count of one as the value. Here is the code for the Pairs implementation:

public class PairsOccurrenceMapper extends Mapper<LongWritable, Text, WordPair, IntWritable> {
    private WordPair wordPair = new WordPair();
    private IntWritable ONE = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int neighbors = context.getConfiguration().getInt("neighbors", 2);
        String[] tokens = value.toString().split("\\s+");
        if (tokens.length > 1) {
          for (int i = 0; i < tokens.length; i++) {
              wordPair.setWord(tokens[i]);

             int start = (i - neighbors < 0) ? 0 : i - neighbors;
             int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
              for (int j = start; j <= end; j++) {
                  if (j == i) continue;
                   wordPair.setNeighbor(tokens[j]);
                   context.write(wordPair, ONE);
              }
          }
      }
  }
}

The Reducer for the Pairs implementation will simply sum all of the numbers for the given WordPair key:

public class PairsReducer extends Reducer<WordPair,IntWritable,WordPair,IntWritable> {
    private IntWritable totalCount = new IntWritable();
    @Override
    protected void reduce(WordPair key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        for (IntWritable value : values) {
             count += value.get();
        }
        totalCount.set(count);
        context.write(key,totalCount);
    }
}

Stripes

Implementing the stripes approach to co-occurrence is equally straightforward. The approach is the same, but all of the “neighbor” words are collected in a HashMap with the neighbor word as the key and an integer count as the value. When all of the values have been collected for a given word (the bottom of the outer loop), the word and the hashmap are emitted. Here is the code for our Stripes implementation:

public class StripesOccurrenceMapper extends Mapper<LongWritable,Text,Text,MapWritable> {
  private MapWritable occurrenceMap = new MapWritable();
  private Text word = new Text();

  @Override
 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   int neighbors = context.getConfiguration().getInt("neighbors", 2);
   String[] tokens = value.toString().split("\\s+");
   if (tokens.length > 1) {
      for (int i = 0; i < tokens.length; i++) {
          word.set(tokens[i]);
          occurrenceMap.clear();

          int start = (i - neighbors < 0) ? 0 : i - neighbors;
          int end = (i + neighbors >= tokens.length) ? tokens.length - 1 : i + neighbors;
           for (int j = start; j <= end; j++) {
                if (j == i) continue;
                Text neighbor = new Text(tokens[j]);
                if(occurrenceMap.containsKey(neighbor)){
                   IntWritable count = (IntWritable)occurrenceMap.get(neighbor);
                   count.set(count.get()+1);
                }else{
                   occurrenceMap.put(neighbor,new IntWritable(1));
                }
           }
          context.write(word,occurrenceMap);
     }
   }
  }
}

The Reducer for the Stripes approach is a little more involved due to the fact we will need to iterate over a collection of maps, then for each map, iterate over all of the values in the map:

public class StripesReducer extends Reducer<Text, MapWritable, Text, MapWritable> {
    private MapWritable incrementingMap = new MapWritable();

    @Override
    protected void reduce(Text key, Iterable<MapWritable> values, Context context) throws IOException, InterruptedException {
        incrementingMap.clear();
        for (MapWritable value : values) {
            addAll(value);
        }
        context.write(key, incrementingMap);
    }

    private void addAll(MapWritable mapWritable) {
        Set<Writable> keys = mapWritable.keySet();
        for (Writable key : keys) {
            IntWritable fromCount = (IntWritable) mapWritable.get(key);
            if (incrementingMap.containsKey(key)) {
                IntWritable count = (IntWritable) incrementingMap.get(key);
                count.set(count.get() + fromCount.get());
            } else {
                incrementingMap.put(key, fromCount);
            }
        }
    }
}

Conclusion

When looking at the two approaches, we can see that the Pairs algorithm will generate more key value pairs compared to the Stripes algorithm. Also, the Pairs algorithm captures each individual co-occurrence event while the Stripes algorithm captures all co-occurrences for a given event. Both the Pairs and Stripes implementations would benefit from using a Combiner. Because both produce commutative and associative results, we can simply re-use each Mapper’s Reducer as the Combiner. As stated before, creating a co-occurrence matrix has applicability to other fields beyond text processing, and represent useful MapReduce algorithms to have in one’s arsenal. Thanks for your time.

Resources

  • Data-Intensive Processing with MapReduce by Jimmy Lin and Chris Dyer
  • Hadoop: The Definitive Guide by Tom White
  • Source Code and Tests from blog
  • Hadoop API
  • MRUnit for unit testing Apache Hadoop map reduce jobs

hadoop Matrix (protocol)

Published at DZone with permission of Bill Bejeck, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The Importance of Delegation in Management Teams
  • SAST: How Code Analysis Tools Look for Security Flaws
  • Cloud Native London Meetup: 3 Pitfalls Everyone Should Avoid With Cloud Data
  • Playwright vs. Cypress: The King Is Dead, Long Live the King?

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: