Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Analyzing the BuzzFeed TrumpWorld Dataset With Neo4j

DZone's Guide to

Analyzing the BuzzFeed TrumpWorld Dataset With Neo4j

Using Neo4j, you can pick apart the TrumpWorld dataset to help better understand the connection. This seriously interesting read will help make data mining great again!

· Database Zone ·
Free Resource

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.  

I came across this tweet by Sanchez Castro, which I am more than happy to support.

As part of the Buzzfeed article Help Us Map TrumpWorld, the four investigative journalists, John Templon, Alex Campbell, Anthony Cormier, and Jeremy Singer-Vine asked the public to help them map and analyze the data that they investigated, confirmed, and published surrounding the business connections of Donald J. Trump:

Now we are asking the public to use our data to find connections we may have missed, and to give us context we don’t currently understand. We hope you will help us — and the public — learn more about TrumpWorld and how this unprecedented array of businesses might affect public policy.

As you probably know, Neo4j has been used in investigative journalism before. Most notably in the collaborative work of the International Consortium of Investigative Journalists (ICIJ) on the Panama Papers investigations, but also by other investigative organizations and journalists. To support this kind of work, we launched the Data Journalism Accelerator Program last year to help investigative data journalists better leverage graph database technology.

Image title

So, this below is our small contribution toward mapping and analyzing the first part of the TrumpWorld dataset. Much more connected data analysis is possible, which I’ll outline at the end of the article.

Digging Into the TrumpWorld Dataset

The Buzzfeed article points to a Google Spreadsheet with collected, verified and researched data of the connections of organizations and people related to Trump with each group listed on a separate tab.

The data is, fortunately, available as a publicly readable document, so we can use our old trick of downloading it as CSV/TSV and using the download URL with LOAD CSV to import the dataset into Neo4j.

If you want to, download and start Neo4j (it only takes a minute). Then you can run these queries in your local Neo4j instance, or have a look at the interactive variant of this post as a Graph Gist (our data science notebooks for graphs). All example queries in this blog post use the Cypher graph query language.

Image title

Let’s have a look at the data first. Here is a sample:

WITH
'https://docs.google.com/spreadsheets/...' AS url
LOAD CSV WITH HEADERS FROM url AS row FIELDTERMINATOR '\t'
RETURN row.`Organization A`,row.`Organization B`,row.Connection, row.`Source(s)`
LIMIT 5


row.`Organization A` row.`Organization B` row.Connection row.`Source(s)`
4 SHADOW TREE LANE MEMBER CORP. 4 SHADOW TREE LANE LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html
40 WALL DEVELOPMENT ASSOCIATES LLC 40 WALL STREET LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html
40 WALL STREET LLC 40 WALL STREET COMMERCIAL LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html
40 WALL STREET MEMBER CORP. 40 WALL STREET LLC Ownership https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html
401 MEZZ VENTURE LLC 401 NORTH WABASH VENTURE LLC Ownership

https://www.documentcloud.org/documents/2838696-Trump-2016-Financial-Disclosure.html


You can see that it’s quite straightforward: two columns for the node-entities (i.e., organizations), one each for the type of connection, along with the source of the information which represent the relationships.

As the connection information is quite verbose and not regular enough to serve as relationship-type, we either have to simplify it, as is demonstrated in this more detailed TrumpWorld graph.

Or we can use one generic relationship-type per mapping, which we will do here for the sake of simplicity:

  • (:Organization)-[:CONNECTED_TO]->(:Organization)
  • (:Person)-[:INVOLVED_WITH]->(:Organization)
  • (:Person)-[:RELATED_TO]->(:Person)

We store the “connection” and “source” information on each relationship.

Image title

With this simple model at hand, we can now go ahead and import the data into our graph, following the same basic pattern:

  1. We get the data from the CSV file
  2. Create or find (MERGE) the two nodes
  3. Connect them with a relationship

Data Import

Data Constraints

As we don’t want duplicate entries for organizations and people, we set up two constraints:

CREATE CONSTRAINT ON (o:Organization) ASSERT o.name IS UNIQUE;
CREATE CONSTRAINT ON (p:Person) ASSERT p.name IS UNIQUE;


Connect Organizations With Other Organizations From the First Tab

WITH 'https://docs.google.com/spreadsheets/...' AS url
LOAD CSV WITH HEADERS FROM url AS row FIELDTERMINATOR '\t'
MERGE (o1:Organization {name:row.`Organization A`})
MERGE (o2:Organization {name:row.`Organization B`})
CREATE (o1)-[con:CONNECTED_TO]->(o2)
SET con.connection=row.Connection, con.source=row.`Source(s)`


Connect People With Organizations From the Second Tab

WITH 'https://docs.google.com/spreadsheets/...' AS url
LOAD CSV WITH HEADERS FROM url AS row FIELDTERMINATOR '\t'
MERGE (p:Person {name:row.Person})
MERGE (o:Organization {name:row.Organization})
CREATE (p)-[con:INVOLVED_WITH]->(o)
SET con.connection=row.Connection, con.source=row.`Source(s)`


Connect People With Other People From the Third Tab

WITH 'https://docs.google.com/spreadsheets/...' AS url
LOAD CSV WITH HEADERS FROM url AS row FIELDTERMINATOR '\t'
MERGE (p1:Person {name:row.`Person A`})
MERGE (p2:Person {name:row.`Person B`})
CREATE (p2)-[con:RELATED_TO]->(p1)
SET con.connection=row.Connection, con.source=row.`Source(s)`


Import Results

So, in total, we imported 1,514 nodes and 1,857 relationships.

And this is the full TrumpWorld graph in all its “beauty,” after querying for all data (raise the view limits in the sidebar config), go to fullscreen and then zoom out.

Image title

Running Some Simple Graph Database Queries

Most-Connected Organizations

What are the organizations that have the most connections within the TrumpWorld dataset, and which are those mainly?

MATCH (o:Organization)-[r]-()
RETURN o.name, count(*), collect(distinct type(r)) AS types
ORDER BY count(*) DESC
LIMIT 5


╒═════════════════════════════════════╤═══════════╤═══════════════════════════════════╕
│"o.name"                          │"count(*)"│"types"                         │
╞═════════════════════════════════════╪═══════════╪═══════════════════════════════════╡
│"THRIVE CAPITAL"                  │86        │["CONNECTED_TO","INVOLVED_WITH"]│
├─────────────────────────────────────┼───────────┼───────────────────────────────────┤
│"40 WALL STREET LLC"              │41        │["INVOLVED_WITH","CONNECTED_TO"]│
├─────────────────────────────────────┼───────────┼───────────────────────────────────┤
│"DJT HOLDINGS LLC"                │36        │["INVOLVED_WITH","CONNECTED_TO"]│
├─────────────────────────────────────┼───────────┼───────────────────────────────────┤
│"KUSHNER COMPANIES"               │32        │["INVOLVED_WITH","CONNECTED_TO"]│
├─────────────────────────────────────┼───────────┼───────────────────────────────────┤
│"TRUMP HOTELS & CASINO RESORTS"   │28        │["INVOLVED_WITH","CONNECTED_TO"]│
└─────────────────────────────────────┴───────────┴───────────────────────────────────┘


The Second-Degree Kushner Network

MATCH network = (:Person {name:"JARED KUSHNER"})-[*..2]-()
RETURN network


Image title


Which Banks Are Involved With These Organizations and People?

MATCH (bank:Organization)--(other)
WHERE bank.name contains "BANK"
RETURN *


Image title



Donald Trump’s Connections to Vladimir Putin

And finally, let’s look at the shortest paths between Vladimir Putin and Donald Trump.

MATCH (vp:Person {name:"VLADIMIR PUTIN"}),(dt:Person {name:"DONALD J. TRUMP"})
MATCH path = allShortestPaths( (vp)-[*]-(dt) )
RETURN path


Image title


As expected, the shortest path between Putin and Trump officially goes through Rex Tillerson. But as you can also see, there is a missing direct connection between Trump and Tillerson, the current nominee for U.S. Secretary of State.

As we’ll see next, the other nominees are there, so this is one bit of missing information that I reported back to Buzzfeed.

Other Nominees of Trump’s Cabinet

MATCH (p:Person)-[con:RELATED_TO]->()
WHERE con.connection CONTAINS "Nominee"
RETURN p.name, con.connection
ORDER BY split(p.name," ")[-1] DESC LIMIT 5


p.name con.connection
RYAN ZINKE Nominee for Secretary of Interior
JEFF SESSIONS Nominee for Attorney General
WILBUR ROSS Nominee for Secretary of Commerce
ANDY PUZDER Nominee for Secretary of Labor
SCOTT PRUITT Nominee for Administrator of the Environmental Protection Administration (EPA)


You see that Rex Tillerson should be the second in this list, sorted reverse by last name.

Nominee Influence

And finally we can look at the combined influence that these nominees exert within the TrumpWorld graph – at least from what is in the data we have here:

MATCH (o:Organization)<-[:INVOLVED_WITH]-(p:Person)-[con:RELATED_TO]->()
WHERE con.connection CONTAINS "Nominee"
RETURN o,p


We see that these 18 individuals are involved with 299 organizations that they either work with, own or control, are indebted to or are otherwise involved with Donald Trump.

Image title


Fun Graph Query

To conclude, I want to demonstrate what graph databases are quite good at – quickly finding transitive relationships within data. First, I looked for the longest shortest paths in our graph, checking a cross product of roughly 17,000 pairs of people. This finished in little over three seconds on my laptop.

match (p1:Person), (p2:Person) where id(p1) < id(p2)
match p = shortestPath((p1)-[*]-(p2))
return p1,p2,length(p) order by length(p) desc limit 20


╒══════════════════════════════╤═══════════════════════════╤═════════════╕
│"p1"                        │"p2"                    │"length(p)" │
╞══════════════════════════════╪═══════════════════════════╪═════════════╡
│{"name":"VICTOR HEGGELMAN"} │{"name":"ZHUO RAN"}      │10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"VICTOR HEGGELMAN"} │{"name":"CHRIS CHRISTIE"}│10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"TIM MOL"}          │{"name":"ZHUO RAN"}      │10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"ROMA DOWNEY"}      │{"name":"ZHUO RAN"}      │10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"CHRIS CHRISTIE"}   │{"name":"ROMA DOWNEY"}   │10         │
├──────────────────────────────┼───────────────────────────┼─────────────┤
│{"name":"LEV LEVIEV"}       │{"name":"JOSE GARCIA"}   │9          │
└──────────────────────────────┴───────────────────────────┴─────────────┘


If we pick two of those – Lev Leviev and Roma Downey – who have probably little to do with each other in real life except her wearing diamonds, we can see who connects them in our graph.

MATCH (p1:Person {name:"LEV LEVIEV"}), (p2:Person {name:"ROMA DOWNEY"})
MATCH path = allShortestPaths((p1)-[*]-(p2))
RETURN path


Image title

Extending the TrumpWorld Graph

There are many ways this graph can be extended, and I'll work on some of them in the coming weeks. Here are some ideas:

  • Combine with related data from littlesis.org, "the involuntary registry of the 1%"
  • Add FEC campaign financing information.
  • Add foundation detail data from citizenaudit.org.
  • Add people involved with the election campaign.
  • Add second-degree organizations related to the other people in the graph, not just Donald J. Trump.
  • Add more detailed information on the organizations and their setup (location, leadership, deals), e.g. by querying APIs like OpenCorporates or by integrating with the data from Aleph the OCCRP data portal.

Our partner Linkurious was inspired by the first part of my work to demonstrate how you'd visualize this data with their software.

Get comfortable using NoSQL in a free, self-directed learning course provided by RavenDB. Learn to create fully-functional real-world programs on NoSQL Databases. Register today.

Topics:
cypher ,neo4j ,graph database ,database ,tutorial ,data mining

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}