Curator's Note: The content of this article was written by Rik Van Bruggen on the Neo4j blog.
A
bunch of different graphistas have pointed out to me in recent months
that there is something funny about Graphs and email. Specifically,
about graphs and email analysis. From my work in previous years at
security companies, I know that Email Forensics is actually big
business. Figuring out who emails whom, about what topics, with what
frequency, at what times - is important. Especially when the proverbial
sh*t hits the fan and fraud comes to light - like in the Enron case.
How do I get insight into email traffic? How do I know what was
communicated to who? And how do I get that insight, without spending a
true fortune?
So a couple of days ago I came across an
article or
two that features
Gmail Meter. This is of course not one of these powerful enterprise-ready forensics tools, but it is interesting. Written by
Romain Vialard,
it provides you with a really straightforward way to get all kinds of
stats and data about your use of Gmail. It does so using a
Google Apps Script,
that is available to anyone using Google Docs’ spreadsheet
functionality. In this blog post, we’ll take a look at how we can
actually use Romain’s output, and generate a
neo4j
database that will allow us to visually and graph-ically explore the
email traffic in our gmail inboxes - without doing doing any coding of
course. Because I don’t know how to do that - but I do do spreadsheets,
as you know by now.
Using Gmail Meter to create the dataset
The
first thing you need to do to get going is to get Gmail Meter
installed. To do that, you just create a Google Doc Spreadsheet, and
insert Gmail Meter from the Script Gallery.
You
will need to give it permission to analyse your mailbox (and
potentially remove that when you want to uninstall it), but that’s easy.
More instructions can be found over here
should you need that. But really it’s dead easy. Once you have given it
permission that, the Apps Scripts starts churning away at your mailbox.
The end result is a Google spreadsheet with two tabs:
the first
tab (Sheet1) contains information about which email addresses you have
been exchanging with, and how many emails you have been exchanging with
them (sending, and receiving)
the second tab (Sheet2) contains more information about the conversations, number of words, etc.
Now all we need to do is create a neo4j database based on this data - and that too is really easy.
Importing the Gmail Meter data into Neo4j
For the graph import that I will illustrate here, we will only be using the first sheet. Basically just getting to grips with
the people that we are emailing to
the people that we are receiving emails from
the frequency that we are emailing to/from these contacts
There’s definitely more data here - but let’s start with this.
The
way I have done it - which is probably not the only way to do it, but
still - is to add two sheets to the workbook coming from Gmail Meter.
“Graph”: to convert the data from GmailMeter’s first worksheet into nodes and relationships
“Cypher”: to generate the Cypher statements that we can use to generate the Neo4j database and start playing around .
I have shared the worksheet over here - so please take a look and customize it whereever necessary. You should end up with something like this sheet:
Note
that - as mentioned above - in the Graph sheet we have, additional to
the nodes and relationships, also added the number of emails to the
“EMAIL” relationship as a property/weight. This will come in handy later
when visualising the email traffic. In a larger graph you could
actually use these weights for pathfinding algorithms as well - in case
you would want to find out the volume of email traffic between two
persons at different places in the graph, through other people.
You see that I am using the exact same techniques of my spreadsheet import blog post
to generate the Cypher statements required. All I need to do after that
is to put all statements into the “Cypher” worksheet, wrap it with a
transaction - and copy paste that into the neo4j shell of my empty neo4j
database. That will execute the queries, insert all nodes,
relationships and properties, and voila - we have our Gmail Meter Graph!
Exploring the Gmail Meter Graph
Once
we have generated the database, we can explore it in all the ways that
we can with Neo4j’s traditional visualisation/querying tools. First
thing I did of course is to look at it with Webadmin.
That gives you some ideas already, but things get a lot more
interesting when you can visualise the weights (= numbers of emails
between two persons).
To
do that, I plugged another visualisation tool on top of the database to
get a feel for the weight of certain relationships. Our friends at
Linkurio.us actually have some very neat (and above all: simple AND
powerful) ways to do this - as you can see below it immediately gave me
an idea of where the traffic is coming from and going to.

Obviously we can explore the network, and also query it with Cypher for interesting relationships:
START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail-[email:EMAIL]->otherperson
RETURN
id(otherperson), email.number, otherperson.name;
START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail<-[email:EMAIL]-otherperson
WHERE
email.number > 4
RETURN
id(otherperson), email.number, otherperson.name;
START
mymail=node:node_auto_index(name="myaddress@gmail.com")
MATCH
mymail<-[email:EMAIL]-otherperson,
mymail-[email2:EMAIL]->otherperson
WHERE
email.number > 2
AND
email2.number > 2
RETURN
id(otherperson), email.number as From, email2.number as To,
I
hope you understand that - because we are only looking at data from one
and only one mailbox, the dataset’s power is quite limited. But I am
hoping you get the point that the graph exploration of this dataset is
great. I did a little experiment where I actually put my professional
email data and my personal email data (both use Gmail) into one neo4j
database - and that was really interesting.
For those of you wanting more detailed info on this topic, I would encourage you to take a look at the wonderful
Graph Databases book, that has a specific chapter about email analysis.
Hoping this was useful - it sure was a learning experience (again) for me.
Comments