Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Extracting Social Network Graphs from DNC Emails

DZone's Guide to

Extracting Social Network Graphs from DNC Emails

Data analysis doesn't need big expensive tools. Simple utilities and a little creativity can extract interesting relationships from datasets too. Here, find out how you could do this yourself, using the recent DNC email leaks as a sample set.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

Recently, a data set was released on Wikileaks consisting of about 23 thousand emails sent within the Democratic National Committee that would demonstrate how the DNC was actively trying to prevent Bernie Sanders from being the democratic candidate for the General public election. I am interested in who are the people with a lot of influence so I decided to have a closer look at the data.

Yesterday I crawled the dataset and processed it. I extracted two graphs in the Konect format. Since I am not sure if I am legally allowed to publish the processed data sets I will only link to the source code so you can generate the data sets yourself, if you don’t know how to run the code but need the information drop me a mail. I Also hope that Jérôme Kunegis will do an analysis of the networks and include them to Konect.

First, We Have the Temporal Graph

This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.

rpickhardt$ wc -l temporalGraph.tsv
39338 temporalGraph.tsv
rpickhardt$ head -5 temporalGraph.tsv
GardeM@dnc.org DavisM@dnc.org 1 17 May 2016 19:51:22
ShapiroA@dnc.org KaplanJ@dnc.org 1 4 May 2016 06:58:23
JacquelynLopez@perkinscoie.com EMail-Vetting_D@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com LykinsT@dnc.org 1 13 May 2016 21:27:16
JacquelynLopez@perkinscoie.com ReifE@dnc.org 1 13 May 2016 21:27:16

Clearly, the format is: sender TAB receiver TAB 1 TAB date

The data is currently not sorted by the fourth column but this can easily be done. Clearly, an email network is directed and can have multi-edges.

Second, We Have the Weighted Co-recipient Network

Looking at the data I have discovered that many emails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.

rpickhardt$ wc -l weightedCCGraph.tsv
20864 weightedCCGraph.tsv
rpickhardt$ head -5 weightedCCGraph.tsv
PaustenbachM@dnc.orgMirandaL@dnc.org848
MirandaL@dnc.orgPaustenbachM@dnc.org848
WalkerE@dnc.orgPaustenbachM@dnc.org624
PaustenbachM@dnc.orgWalkerE@dnc.org624
WalkerE@dnc.orgMirandaL@dnc.org596

Clearly the format is: recipient1 TAB recipient2 TAB count where count counts how often recipient1 and recipient2 have been together in emails.

Simple Statistics

There have been

  • 1226 senders
  • 1384 recipients
  • 2030 people

included in the emails. In total, I found 1226 different senders and 1384 different receivers. The top 7 Senders are:

MirandaL@dnc.org1482
ComerS@dnc.org1449
ParrishD@dnc.org750
DNCPress@dnc.org745
PaustenbachM@dnc.org608
KaplanJ@dnc.org600
ManriquezP@dnc.org567

And the top 7 recievers are:

MirandaL@dnc.org2951
Comm_D@dnc.org2439
ComerS@dnc.org1841
PaustenbachM@dnc.org1550
KaplanJ@dnc.org1457
WalkerE@dnc.org1110
kaplanj@dnc.org987

Still, at first glimse, the data looks pretty natural. In the following, I provide a diagram showing the rank frequency plot of senders and receivers. One can see that some people are way more active than other people. Also, the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 recipient.

Also, you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explanation for this. Maybe this is due to the fact that the data dump is not complete. Still, I find the data looks pretty natural to me so further investigations might make sense.

Code

The crawler code is a two-liner. just some wget and sleep magic.

The python code for processing the emails builds upon the python email library by Alain Spineux which is released under the LGPL license. My Code on top is released under GPLv3 and can be found on github.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
data ,analysis

Published at DZone with permission of René Pickhardt, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}