Extracting Social Network Graphs from DNC Emails
Data analysis doesn't need big expensive tools. Simple utilities and a little creativity can extract interesting relationships from datasets too. Here, find out how you could do this yourself, using the recent DNC email leaks as a sample set.
Join the DZone community and get the full member experience.Join For Free
Recently, a data set was released on Wikileaks consisting of about 23 thousand emails sent within the Democratic National Committee that would demonstrate how the DNC was actively trying to prevent Bernie Sanders from being the democratic candidate for the General public election. I am interested in who are the people with a lot of influence so I decided to have a closer look at the data.
Yesterday I crawled the dataset and processed it. I extracted two graphs in the Konect format. Since I am not sure if I am legally allowed to publish the processed data sets I will only link to the source code so you can generate the data sets yourself, if you don’t know how to run the code but need the information drop me a mail. I Also hope that Jérôme Kunegis will do an analysis of the networks and include them to Konect.
First, We Have the Temporal Graph
This graph consists of 39338 edges. There is a directed edge for each email sent from one person to another person and a timestamp when this has happened. If a person puts n recipients in CC there will be n edges added to the graph.
rpickhardt$ wc -l temporalGraph.tsv 39338 temporalGraph.tsv rpickhardt$ head -5 temporalGraph.tsv GardeM@dnc.org DavisM@dnc.org 1 17 May 2016 19:51:22 ShapiroA@dnc.org KaplanJ@dnc.org 1 4 May 2016 06:58:23 JacquelynLopez@perkinscoie.com EMail-Vetting_D@dnc.org 1 13 May 2016 21:27:16 JacquelynLopez@perkinscoie.com LykinsT@dnc.org 1 13 May 2016 21:27:16 JacquelynLopez@perkinscoie.com ReifE@dnc.org 1 13 May 2016 21:27:16
Clearly, the format is: sender TAB receiver TAB 1 TAB date
The data is currently not sorted by the fourth column but this can easily be done. Clearly, an email network is directed and can have multi-edges.
Second, We Have the Weighted Co-recipient Network
Looking at the data I have discovered that many emails have more than one recipient so I thought it would be nice to see the social network structure by looking at how often two people occur in the recipient list for an email. This can reveal a lot about the social network structure of the DNC.
rpickhardt$ wc -l weightedCCGraph.tsv 20864 weightedCCGraph.tsv rpickhardt$ head -5 weightedCCGraph.tsv PaustenbachM@dnc.orgMirandaL@dnc.org848 MirandaL@dnc.orgPaustenbachM@dnc.org848 WalkerE@dnc.orgPaustenbachM@dnc.org624 PaustenbachM@dnc.orgWalkerE@dnc.org624 WalkerE@dnc.orgMirandaL@dnc.org596
Clearly the format is: recipient1 TAB recipient2 TAB count where count counts how often recipient1 and recipient2 have been together in emails.
There have been
- 1226 senders
- 1384 recipients
- 2030 people
included in the emails. In total, I found 1226 different senders and 1384 different receivers. The top 7 Senders are:
MirandaL@dnc.org1482 ComerS@dnc.org1449 ParrishD@dnc.org750 DNCPress@dnc.org745 PaustenbachM@dnc.org608 KaplanJ@dnc.org600 ManriquezP@dnc.org567
And the top 7 recievers are:
MirandaL@dnc.org2951 Comm_D@dnc.org2439 ComerS@dnc.org1841 PaustenbachM@dnc.org1550 KaplanJ@dnc.org1457 WalkerE@dnc.org1110 firstname.lastname@example.org
Still, at first glimse, the data looks pretty natural. In the following, I provide a diagram showing the rank frequency plot of senders and receivers. One can see that some people are way more active than other people. Also, the recipient curve is above the sender curve which makes sense since every mail has one sender but at least 1 recipient.
Also, you can see the rank co-occurence count diagram of the co-occurence network. This when the ranks are above 2000 the standard network structure picture changes a little bit. I have no plausible explanation for this. Maybe this is due to the fact that the data dump is not complete. Still, I find the data looks pretty natural to me so further investigations might make sense.
The crawler code is a two-liner. just some wget and sleep magic.
The python code for processing the emails builds upon the python email library by Alain Spineux which is released under the LGPL license. My Code on top is released under GPLv3 and can be found on github.
Published at DZone with permission of René Pickhardt, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.