java,architecture,mapreduce

# Map/Reduce to recommend people connection

Once common feature in Social Network site is to recommend people
connection. e.g. "People you may know" from Linkedin. The basic idea
is very simple; if person A and person B doesn't know each other but
they have a lot of common friends, then the system should recommend
person B to person A and vice versa.

From a graph theory perspective, for each person who is 2-degree reachable from person A, we count how many distinct paths (with 2 connecting edges) exist between this person and person A. Rank this list in terms the number of paths and show the top 10 persons that person A should connect with.

We should how we can use Map/Reduce to compute this top-10 connection list for every person. The problem can be stated as: For every person X, we determine a list of person X1, X2 ... X10 which is the top 10 persons that person X has common friends with.

The social network graph is generally very sparse. Here we assume the input records is an adjacency list sorted by name.

We use two rounds of Map/Reduce job to compute the top-10 list

The purpose of this MR job is to compute the number of distinct path between all pairs of people who is 2 degree separated from each other.

The purpose of this MR job is to rank the connections for every person by the number of distinct path between them.

From a graph theory perspective, for each person who is 2-degree reachable from person A, we count how many distinct paths (with 2 connecting edges) exist between this person and person A. Rank this list in terms the number of paths and show the top 10 persons that person A should connect with.

We should how we can use Map/Reduce to compute this top-10 connection list for every person. The problem can be stated as: For every person X, we determine a list of person X1, X2 ... X10 which is the top 10 persons that person X has common friends with.

The social network graph is generally very sparse. Here we assume the input records is an adjacency list sorted by name.

"ricky" => ["jay", "peter", "phyllis"]

"peter" => ["dave", "jack", "ricky", "susan"]

We use two rounds of Map/Reduce job to compute the top-10 list

## First Round MR Job

The purpose of this MR job is to compute the number of distinct path between all pairs of people who is 2 degree separated from each other.

- In Map(), we do a cartesian product for all pairs of friends (since these friends may be connected in 2-dgrees). We also need to eliminate the pairs if they already have a direct connection. Therefore, the The Map() function should also emit pairs of direct connected persons. We need to order the key space such that all keys with the same pair of people with go to the same reducer. On the other hand, we need the pair of direct connection come before the pairs of 2 degree of separations.
- In Reduce(), we know all the key pairs reaching the same reducer will be sorted. So the direct connect pair will come before the 2-degree pairs. So the reducer just need to check if the first pair is a direct connected one and if so skip the rest.

Input record ... person -> connection_list

e.g. "ricky" => ["jay", "john", "mitch", "peter"]

also the connection list is sorted by alphabetical order

def map(person, connection_list)

# Compute a cartesian product using nested loops

for each friend1 in connection_list

# Eliminate all 2-degree pairs if they already

# have a one-degree connection

emit([person, friend1, 0])

for each friend2 > friend1 in connection_list

emit([friend1, friend2, 1], 1)

def partition(key)

#use the first two elements of the key to choose a reducer

return super.partition([key[0], key[1]])

def reduce(person_pair, frequency_list)

# Check if this is a new pair

if @current_pair != [person_pair[0], person_pair[1]]

@current_pair = [person_pair[0], person_pair[1]]

# Skip all subsequent pairs if these two person

# already know each other

@skip = true if person_pair[2] == 0

if !skip

path_count = 0

for each count in frequency_list

path_count += count

emit(person_pair, path_count)

Output record ... person_pair => path_count

e.g. ["jay", "john"] => 5

## Second Round MR Job

The purpose of this MR job is to rank the connections for every person by the number of distinct path between them.

- In Map(), we rearrange the input records so it will be sorted before reaching the reducer
- In Reduce(), all the connections from the person is sorted, we just need to aggregate the top 10 to a list and then write the list out.

Input record = Output record of round 1

def map(person_pair, path_count)

emit([person_pair[0], path_count], person_pair[1])

def partition(key)

#use the first element of the key to choose a reducer

return super.partition(key[0])

def reduce(connection_count_pair, candidate_list)

# Check if this is a new person

if @current_person != connection_count_pair[0]

emit(@current_person, @top_ten)

@top_ten = []

@current_person = connection_count_pair[0]

#Pick the top ten candidates to connect with

if @top_ten.size < 10

for each candidate in candidate_list

@top_ten.append([candidate, connection_count_pair[1]])

break if @pick_count > 10

Output record ... person -> candidate_count_list

e.g. "ricky" => [["jay", 5], ["peter", 3] ...]

Published at DZone with permission of {{ articles[0].authors[0].realName }}, DZone MVB. (source)

Opinions expressed by DZone contributors are their own.

{{ nComments() }}