Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Exploring LinkedIn in Neo4j

DZone's Guide to

Exploring LinkedIn in Neo4j

· Java Zone
Free Resource

Build vs Buy a Data Quality Solution: Which is Best for You? Gain insights on a hybrid approach. Download white paper now!

Originally authored by Rik Van Bruggen

Ever since I have been working for Neo, we have been trying to give our audience as many powerful examples of places where graph databases could really shine. And one of the obvious places has always been social networks. That’s why I’ve written a post about Facebook, and why many other graphistas have been looking at Facebook and others to explain what could be done.  

But while Facebook is probably the best-known social network, the one I use professionally is LinkedIn. Some call it the creepiest network, but the fact of the matter is that this professional network is, and has always been, a very useful way to get and stay in contact with other people from other organizations. And guess what: LinkedIn does some fantastic stuff with its own, custom-developed graphs. One of these things is InMaps - a fantastic visualization and color-coded analysis of your professional network. In fact, that’s where this blogpost got its inspiration.


An Interactive InMap

The thing is, the InMap above is a “static” picture of your network. You can’t really *do* anything with it. You can’t browse through it. You can’t query it. So there began my quest for a way to get the data out of the InMap, and into Neo4j. I expected it to take days or weeks - but from the first Google search to publishing this post, it took literally just 2-3 hours of work. It’s dead easy.

Well I should qualify that. Of course you have to have some place to start; I needed a place that me get the data from LinkedIn into a format that I could work with to import into neo4j. So after five minutes of Googling, I came across this Dataikublog post by Thomas Cabrol that had written a couple of simple python scripts to get me going.

Step 1: Access LinkedIn API

Thomas’ scripts run against the LinkedIn developer API. This API requires you to authenticate, and therefore you actually need to register an application (in this case the python scripts) in your configuration. Easy to do: Just go to LinkedIn's developer site and register an app.
The important thing here is that this process will generate a number of OAuth keys, tokens and secrets that you will need to insert into your scripts. Easy peasy - but you need to do so before proceeding.

Thomas’ work was based on three scripts (one to get the OAuth credentials, one to extract the data from LinkedIn, and one to clean up some duplicates), but for the purpose of this blog post (and because of some changes in LinkedIn’s OAuth policies), you really only need one script. Therefore, I've had to fork Thomas’ script and I put my version over here. Note that you will need to insert your own credentials and name for this script to make sense. ProTip: Make sure that you have installed oauth2 and urlparse, the imported modules - otherwise the python script won’t work.

Once you have the script, just open up a Terminal, run the python script (python linkedin-query.py) - and then wait a couple of minutes. If all goes well, it will generate a linked.csv file that holds all the connections (name pairs) that you need - in your 1st degree network. That means:
  • connections from yourself to people in your network
  • connections from people in your network, to other people in your network.
Exactly what InMaps shows. In my particular case, that meant 1516 nodes (I admit, I am a proficient user of LinkedIn ;)), and 5345 connections/relationships. 
Important note: LinkedIn actually limits the amount of API calls that you can make to its servers. You can read more about it on LinkedIn's developer pages. For my network, it meant that I could only run the python script once per day. So beware!

Step 2 Is the Easy Part - Importing the .csv into Neo4j

Now all we need to do is get the CSV file into neo4j - and as some of you probably know, there’s more than one way to do that. Whichever way you choose, you will always need to have a simple data-model, and in this case, it could not be simpler:
Because this dataset is still quite small, I chose the spreadsheet way to import the data. All I had to do was import the csv into a spreadsheet, dedupe some of the nodes, and create the cypher statements to inject the data into Neo4j.
Make sure to configure the Neo4j auto-index for the name property for this to work.(Set  node_auto_indexing=true and node_keys_indexable=name in conf/neo4j.properties).
The resulting statements look like this:
create ({name:'Rik Van Bruggen'});
To create the nodes. And:
start 
       n=node:node_auto_index(name='Tareq Abedrabbo'), 
       m=node:node_auto_index(name='Yuri Bukhan') 
create unique n-[:CONNECTED_TO]->m;
to create the relationships. Again: easy.

Start the Neo4j server. Take the spreadsheet, copy/paste the cypher queries into the console that connected to the server (bin/neo4j-shell) and you are all set. If you want to skip all that, you could also just download the imported neo4j graph database and take a look at my professional network - but to be honest it’s a lot more fun if you do it with your own network.

You can then just browse to the webadmin console, and there we have our interactive InMap equivalent. No color codes (yet) - but plenty of interactivity to go around.


Step 3: Querying the Interactive InMap

The nice thing of having this dataset into Neo4j now is, of course, that we can interact with it. I personally found it very nice and cool - as is almost always the case with graphs - to “take a walk on the data”. Just grab a node in the webadmin, select some of its connections, and just interactively browse from node to node - and find out stuff about your LinkedIn network that you may not have known before.

And then, of course, you can also “programmatically” interact with the data - and Cypher is the prime tool for that. Here’s a couple of queries that I made up, but I am sure that you can make up some more.

// Find Shortest paths between two contacts

START
	n=node:node_auto_index(name='Emil Eifrem'),
	m=node:node_auto_index(name='Steven Noels')
MATCH
	p = AllShortestPaths(n-[*]-m)
RETURN p;

// Find all the relationships between two "first degree" contacts of RVB

START
	rik=node:node_auto_index(name="Rik Van Bruggen")
MATCH
	rik-[:CONNECTED_TO]-firstdegree-[r:CONNECTED_TO]-otherfirstdegree-[:CONNECTED_TO]-rik
RETURN distinct firstdegree.name, r, otherfirstdegree.name;

// Find all the shared contacts between Rik and Lars

START
	rik=node:node_auto_index(name="Rik Van Bruggen"),
	lars=node:node_auto_index(name="Lars Nordwall")
MATCH
	rik-[:CONNECTED_TO]-rixsharedcontacts-[:CONNECTED_TO]-lars
RETURN distinct rixsharedcontacts.name as SharedContacts;

I f you're into it, you can also put your own interactive visualization in front of the Neo4j graph database, something like Max de Marzi's Neovigator would be interesting.

If you have any questions regarding use-cases for Neo4j or how to use Neo4j in your project, don't hesitate to contact me.

Build vs Buy a Data Quality Solution: Which is Best for You? Maintaining high quality data is essential for operational efficiency, meaningful analytics and good long-term customer relationships. But, when dealing with multiple sources of data, data quality becomes complex, so you need to know when you should build a custom data quality tools effort over canned solutions. Download our whitepaper for more insights into a hybrid approach.

Topics:

Published at DZone with permission of Andreas Kollegger, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}