Platinum Partner
architects,nosql,tool,neo4j,graph database

My Entry for the HCIR Challenge

A tweet from RiparianData caught my eye the other day:

I built getvouched.com with this idea of “expert and expertise discovery” using skill based vouching adjusted by the distance from searcher to target as a way to find rank. So I dug in and found out that Human-computer Information Retrieval (HCIR) combines research from the fields of human-computer interaction (HCI) and information retrieval (IR), placing an emphasis on human involvement in search activities.

The HCIR challenge for this years symposium includes “hiring,” “assembling a conference program,” and “finding people to deliver patent research or expert testimony” as summarized by Patrick Durusau.

I was late to the party (as the deadline to get access to the Mendeley data had passed) but William Gunn and Daniel Tunkelang were kind enough to grant me access.

I got the data via Dropbox, it is mostly tab separated data with one exception which is a JSON dump of publications.

I needed to import this into Neo4j, so I followed the examples from Batch Importer Part 2, and Batch Importer Part 3 to do some ETL, but first I needed to load the data into Postgresql so I could match up the two formats. I’ve outlined how I did this on the HCIR github repo.

What I ended up with was this graph:

publication -[:by_discipline]->      discipline 
publication -[:by_country]->         country
publication -[:by_academic_status]-> academic_status
publication -[:authored_by]->        author
publication -[:published_in]->       journal
author      -[:has_profile]->        profile
profile     -[:interested_in]->      discipline      
profile     -[:member_of]->          group
profile     -[:knows]->              profile

I also created a “vertices” full text index and an “edges” full text index to make my life easier. Just to make sure it imported ok I tested with:

START authors = node:vertices('type:author')
RETURN authors.name
LIMIT 3;

Looking good:

==> +------------------+
==> | authors.name     |
==> +------------------+
==> | "Dominik Papies" |
==> | "Felix Eggers"   |
==> | "Nils Wlömert"   |
==> +------------------+

I wonder who the most prolific author is?

START authors = node:vertices('type:author') 
MATCH authors <-[:authored_by]- publication
RETURN authors.name, count(publication) AS cnt
ORDER BY cnt DESC
LIMIT 5;

“Timothy E Hewett” has authored the most publications in our sample data set.

==> +--------------------------+
==> | authors.name       | cnt |
==> +--------------------------+
==> | "Timothy E Hewett" | 339 |
==> | "Gregory D Myer"   | 226 |
==> | "Kevin R Ford"     | 202 |
==> | "Felix Gugerli"    | 144 |
==> | "K Darowicki"      | 143 |
==> +--------------------------+

I wonder how many co-authors he has?

START author = node:vertices('type:author AND name:"Timothy E Hewett"') 
MATCH author <-[:authored_by]- publication -[:authored_by]-> co_authors
RETURN count(DISTINCT co_authors);

That’s a ton of co-authors.

==> +----------------------------+
==> | count(DISTINCT co_authors) |
==> +----------------------------+
==> | 280                        |
==> +----------------------------+

Let’s pick one author from above and focus in on them.

START me = node:vertices('name:"Felix Eggers"')
RETURN me;

Looks like we have him as an author, and we have his profile as well.

==> +-------------------------------------------------------------------+
==> | me                                                                |
==> +-------------------------------------------------------------------+
==> | Node[17]{name:"Felix Eggers",type:"author",node_id:"17"}          |
==> | Node[400573]{name:"Felix Eggers",type:"profile",node_id:"400573"} |
==> +-------------------------------------------------------------------+

So let’s say that Felix is trying to hire someone like him or assemble a conference program of a research topic he is interested in. We can try to find people who are like Felix a number of different ways:

By Contacts:

We can start with the simple case of who does Felix know?

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:knows]-> profiles
RETURN DISTINCT profiles.name
LIMIT 5; 

5 out of the 7 authors Felix knows:

==> +-------------------+
==> | profiles.name     |
==> +-------------------+
==> | "Jens Hogreve"    |
==> | "Mathias Lin"     |
==> | "Fabian Eggers"   |
==> | "Tillmann Wagner" |
==> | "Andreas Neus"    |
==> +-------------------+

Felix doesn’t know a whole lot of other authors, let’s expand his network one more level.

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:knows]-> () -[:knows]-> profiles
RETURN DISTINCT profiles.name;
LIMIT 5; 

5 out of the 16 contacts his contacts know:

==> +------------------------+
==> | profiles.name          |
==> +------------------------+
==> | "Victor Henning"       |
==> | "Jens Hogreve"         |
==> | "Charles Hofacker"     |
==> | "Stephanie Feiereisen" |
==> | "Alexander Stich"      |
==> +------------------------+

Members of the same groups:

Let see what research groups Felix is in, and who else is in those groups.

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT group.name)
ORDER BY COUNT(*) DESC
LIMIT 5; 

We find Jeremy and Michael are in the same group as Felix.

==> +-----------------------------------------------------------------------------+
==> | other_profiles.name | COLLECT(DISTINCT group.name)                          |
==> +-----------------------------------------------------------------------------+
==> | "Jeremy Chen"       | ["Conjoint Analysis and Discrete Choice Experiments"] |
==> | "Michael Waltinger" | ["Conjoint Analysis and Discrete Choice Experiments"] |
==> +-----------------------------------------------------------------------------+

Are they in any other groups that can help us expand Felix’s network?

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- team_members 
         -[:member_of]-> other_group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT other_group.name)
ORDER BY COUNT(*) DESC
LIMIT 5; 

Some of those folks are in a ton of groups, let’s just count them so it will be easier to display.

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- team_members 
         -[:member_of]-> other_group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COUNT( DISTINCT other_group.name)
ORDER BY COUNT(*) DESC
LIMIT 5; 
==> +-----------------------------------------------------------+
==> | other_profiles.name   | COUNT( DISTINCT other_group.name) |
==> +-----------------------------------------------------------+
==> | "ABDUL SALAM YUSSIF"  | 12                                |
==> | "Nicholas Overton"    | 9                                 |
==> | "Ashley Cooke"        | 7                                 |
==> | "Joe Reevy"           | 6                                 |
==> | "Moeez Khademhoseiny" | 6                                 |
==> +-----------------------------------------------------------+

Co-Authors:

START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- publication -[:authored_by]-> co_authors 
RETURN DISTINCT co_authors.name; 

These folks co-authored a publication with Felix, so they must like working together, and share similar research interests.

==> +--------------------------+
==> | co_authors.name          |
==> +--------------------------+
==> | "Victor Henning"         |
==> | "Thorsten Hennig-Thurau" |
==> | "Dominik Papies"         |
==> | "Fabian Eggers"          |
==> | "Henrik Sattler"         |
==> | "Mark B Houston"         |
==> | "Nils Wlömert"           |
==> +--------------------------+

That’s not a ton of people, let’s try his 2nd level co-author network:

START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:authored_by]-> co_authors 
         <-[:authored_by]- their_publications -[:authored_by]-> their_co_authors
WHERE me <> their_co_authors 
  AND NOT(me <-[:authored_by]- my_publications -[:authored_by]-> their_co_authors)
RETURN DISTINCT their_co_authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5; 

We are excluding Felix and his co-authors from the result. I found 18, but here are the top 5:

==> +-----------------------------+
==> | their_co_authors.name | cnt |
==> +-----------------------------+
==> | "Jan Reichelt"        | 27  |
==> | "Jason J Hoyt"        | 21  |
==> | "James Hammerton"     | 15  |
==> | "Kris Jack"           | 15  |
==> | "Dan Harvey"          | 15  |
==> +-----------------------------+

In the same Journal:
We can also take look at authors who appeared in the same Journal as Felix since Journals are usually topic specific and curated for high quality content.

START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:published_in]-> journal 
         <-[:published_in]- other_publications -[:authored_by]-> authors 
RETURN DISTINCT authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5; 

I found 35 authors who were published in the same journals, but here are the top 5:

==> +--------------------------------+
==> | authors.name             | cnt |
==> +--------------------------------+
==> | "Thorsten Hennig-Thurau" | 7   |
==> | "Victor Henning"         | 4   |
==> | "Henrik Sattler"         | 4   |
==> | "Tillmann Wagner"        | 4   |
==> | "Richard J Lutz"         | 2   |
==> +--------------------------------+

We can go to a 2nd level here by using his co-authors:

START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:authored_by]-> co_authors 
         <-[:authored_by]- their_publications -[:published_in]-> journal 
         <-[:published_in]- other_publications -[:authored_by]-> authors 
WHERE me <> authors 
  AND NOT(me <-[:authored_by]- my_publications -[:authored_by]-> authors)
RETURN DISTINCT authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5; 

That query returns 19.5k authors, here are the top 5:

==> +-------------------------+
==> | authors.name      | cnt |
==> +-------------------------+
==> | "Thomas Cochrane" | 444 |
==> | "Amanda Peters"   | 360 |
==> | "David Jones"     | 336 |
==> | "DJ Riddell"      | 312 |
==> | "J Lavoué"        | 300 |
==> +-------------------------+

This list represents authors who have appeared in the same journals as his co-authors ordered by the number of paths that exist to them.

Interested the same Disciplines:

We can actually go multiple ways here.

From his profile we can go to disciplines and find other profiles who are into the same disciplines.

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:interested_in]-> disciplines <-[:interested_in]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT disciplines.name)
ORDER BY COUNT(*) DESC
LIMIT 5; 

That’s going to return a ton of people who are also into Business Administration, here are 5 of them:

==> +----------------------------------------------------------+
==> | other_profiles.name | COLLECT(DISTINCT disciplines.name) |
==> +----------------------------------------------------------+
==> | "John Smith"        | ["Business Administration"]        |
==> | "Andreas Müller"    | ["Business Administration"]        |
==> | "abc abc"           | ["Business Administration"]        |
==> | "abc def"           | ["Business Administration"]        |
==> | "Luis Farinha"      | ["Business Administration"]        |
==> +----------------------------------------------------------+

Since we know Felix is interested in Business Administration, we can also go from disciplines to publications, to other authors who may not have a profile in the system.

START me = node:vertices('type:discipline AND name:"Business Administration"')
MATCH me -[:by_discipline]- publications -[:authored_by]- author
RETURN author.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5;
==> +---------------------------+
==> | author.name         | cnt |
==> +---------------------------+
==> | "Null Mancas Matei" | 11  |
==> | "Joanne Dyer"       | 8   |
==> | "Nicholas J Turro"  | 8   |
==> | "Steffen Jockusch"  | 4   |
==> | "Angel A Martí"     | 4   |
==> +---------------------------+

Anyway, that was just bit of exploring of the data with Neo4j and Cypher. I’ll try to build a website that makes use of these queries before the August 31st deadline. Leave a comment if you have any ideas or want to help.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}