Over a million developers have joined DZone.

Analyzing the Panama Papers With Neo4j: Data Models, Queries, and More

In this post, we look at the graph data model used by the International Consortium of Investigative Journalists (ICIJ) and show how to construct it using Cypher in Neo4j. We dissect an example from the leaked data, recreating it using Cypher, and show how the model could be extended.

Build fast, scale big with MongoDB Atlas, a hosted service for the leading NoSQL database. Try it now! Brought to you in partnership with MongoDB.

As the world has seen, the International Consortium of Investigative Journalists (ICIJ) has exposed highly connected networks of offshore tax structures used by the world’s richest elites.

These structures were uncovered from leaked financial documents and were analyzed by the journalists. They extracted the metadata of documents using Apache Solr and Tika, then connected all the information together using the leaked databases, creating a graph of nodes and edges in Neo4j and made it accessible using Linkurious’ visualization application.

In this post, we look at the graph data model used by the ICIJ and show how to construct it using Cypher in Neo4j. We dissect an example from the leaked data, recreating it using Cypher, and show how the model could be extended.

Image title

The Steps Involved in the Document Analysis

  1. Acquire documents
  2. Classify documents
    1. Scan/OCR
    2. Extract document metadata
  3. Whiteboard domain
    1. Determine entities and their relationships
    2. Determine potential entity and relationship properties
    3. Determine sources for those entities and their properties
  4. Work out analyzers, rules, parsers, and named entity recognition for documents
  5. Parse and store document metadata and document and entity relationships
    1. Parse by author, named entities, dates, sources, and classification
  6. Infer entity relationships
  7. Compute similarities, transitive cover, and triangles
  8. Analyze data using graph queries and visualizations

Image title

Finding triads in the graph can show inferred connection. Here Bob has an inferred connection to CompanyB through CompanyA.

From Documents to Graph

A simple model of the organizational domain of business inter-relationships in a holding is simple and similar to the model you use in business registries, a common use case for Neo4j. As a minimum you have:
  • Clients
  • Companies
  • Addresses
  • Officers (both natural people and companies)
With these relationships:
  • (:Officer)-[:is officer of]->(:Company)
    • With these classifications:
      • protector
      • beneficiary, shareholder, director
      • beneficiary
      • shareholder
  • (:Officier)-[:registered address]->(:Address)
  • (:Client)-[:registered]->(:Company)
  • (:Officer)-[:has similar name and address]->(:Address)
With these relationships: All these entities have a lot of properties, like document numbers, share amounts, start- and end-dates of involvements, addresses, citizenship, and much more. Two entities of the same name can have very different amounts of information attached to them, though this depends on the relevant information that was extracted from the sources, e.g., some officers have only a name, others have a full record with more than 15 attributes.

Those have specific relationships like a person is the "officer of" a company. This is a basic domain that you can populate from documents about a tax-haven shell company holding, a.k.a. the #PanamaPapers.

Initially, you classify the raw documents by types and subtypes (like contract or invitation). Then you attach as much direct and indirect metadata as you can, either directly from the document types (like the senders and receivers of an email or parties of a contract). Inferred metadata is gained from the content of the documents. There are techniques like natural language processing, named entity recognition or plain text search for well-known terms like distinctive names or roles.

The first step to build your graph model is to extract those named entities from the documents and their metadata. This includes companies, persons, and addresses. These entities become nodes in the graph. For example, from a company registration document, we can extract the company and officer entities.

Some relationships can be directly inferred from the documents. In the previous example, we would model the officer as directly connected to the company:

Other relationships can be inferred by analyzing email records. If we see several emails between a person and a company we can infer that the person is a client of that company:
We can use similar logic to create relationships between entities that share the same address, have family ties or business relationships or that regularly communicate.
    • Direct metadata -> entities -> relationships to documents
      • author, receivers, account-holder, attached to, mentioned, co-located
      • Turn plain entities / names into full records using registries and profile documents
    • Inferred metadata and information from other sources -> Relationships between entities
      • Related to people or organizations from the direct metadata
      • Same addresses / organizations
      • Find peer groups / rings within fraudulent activities
      • Family ties, business relationships
      • Part of the communication chain

Image title

The graph data model used by the ICIJ

Issues with the ICIJ Data Model

There are some modeling and data quality issues with the ICIJ data model.

The ICIJ data contains a lot of duplicates, but only a few of which are connected by a “has similar name or address” relationship, mostly those can be inferred by first and last part of a name together with addresses and family ties. It would also be beneficial for the data model to actually merge those duplicates, then certain duplicate relationships could also be merged.

In the ICIJ data model, shareholder information like number of shares, issue dates, etc. is stored with the “Officer” where the officer can be shareholder in any number of Companies. It would be better to store that shareholder information on the “is officer of – Shareholder” relationship.

Some of the Boolean properties could be represented as labels, eg. “citizenship=yes” could be a Person label.

How Could You Extend the Basic Graph Model Used by the ICIJ?

The domain model used by the ICIJ is really basic, just containing four types of entities (Officer, Client, Company, Address) and four relationships between them. It is more or less a static view on the organizational relationships but doesn’t include interactions or activities. Looking at the source documents and the other activities outlined in the report, there are many more things which can enrich this graph model to make it more expressive.

We can model the original documents and their metadata and the relationships to people. Part of those relationships are also inferred relationships from being part of conversations or being mentioned or the subject of documents. Other interesting relationships are aliases and interpretations of entities that were used during the analysis, which allows other journalists to reproduce the original thought processes.

Also, the sources for additional information like business registries, watch-lists, census records, or other journalistic databases can be added. Human relationships like family or business ties can be created explicitly as well as implicit relationships that infer that the actors are part of the same fraudulent group or ring.

Another aspect that is missing is the activities and the money flow. Examples of activities are opening/closing of accounts, creation or merger of companies, filing records for those companies, or assigning responsibilities. For the money flow, we could track banks, accounts, and intermediaries used with the monetary transactions mentioned, so you can get an overview of the amounts transferred and the patterns of transfers. Those patterns can then be applied to extract additional fraudulent money flows from other transaction systems.

Graph data is very flexible and malleable, as soon as you have a single connection point, you can integrate new sources of data and start finding additional patterns and relationships that you couldn’t trace before.

    • New Entities:
      • Documents: E-Mail, PDF, Contract, DB-Record, …
      • Money Flow: Accounts / Banks / Intermediaries
    • New Relationships
      • Family / business ties
      • Conversations
      • Peer Groups / Rings
      • Similar Roles
      • Mentions / Topic-Of
      • Money Flow

Let’s Look at a Concrete Example

Let’s look at the family of the Azerbaijan’s President Ilham Aliyev who was already the topic of a GraphGist by Linkurious in the past. We see his wife, two daughters, and son depicted in the graphic below.

Image title

Quoting the ICIJ “The Power Players” Publication (emphasis for names added):

The family of Azerbaijan President Ilham Aliyev leads a charmed, glamorous life, thanks in part to financial interests in almost every sector of the economy. His wife, Mehriban, comes from the privileged and powerful Pashayev family that owns banks, insurance and construction companies, a television station and a line of cosmetics. She has led the Heydar Aliyev Foundation, Azerbaijan’s pre-eminent charity behind the construction of schools, hospitals and the country’s major sports complex. Their eldest daughter, Leyla, editor of Baku magazine, and her sister, Arzu, have financial stakes in a firm that won rights to mine for gold in the western village of Chovdar and Azerfon, the country’s largest mobile phone business. Arzu is also a significant shareholder in SW Holding, which controls nearly every operation related to Azerbaijan Airlines (“Azal”), from meals to airport taxis. Both sisters and brother Heydar own property in Dubai valued at roughly $75 million in 2010; Heydar is the legal owner of nine luxury mansions in Dubai purchased for some $44 million.
We took the data from the ICIJ visualization and converted the 2d graph visualization into graph patterns in the Cypher query language. If you squint, you can still see the same structure as in the visualization. We only compressed the “is officer of – Beneficiary, Shareholder, Director” to IOO_BSD and prefixed the other “is officer of” relationships with IOO.

We didn’t add shares, citizenship, reg-numbers, or addresses that were properties of the entities or relationships. You can see them when clicking on the elements of the embedded original visualization.

Cypher Statement to Set Up the Visualized Entities and Relationships

(leyla: Officer {name:"Leyla Aliyeva"})-[:IOO_BSD]->(ufu:Company {name:"UF Universe Foundation"}),
(mehriban: Officer {name:"Mehriban Aliyeva"})-[:IOO_PROTECTOR]->(ufu),
(arzu: Officer {name:"Arzu Aliyeva"})-[:IOO_BSD]->(ufu),
(mossack_uk: Client {name:"Mossack Fonseca & Co (UK)"})-[:REGISTERED]->(ufu),
(mossack_uk)-[:REGISTERED]->(fm_mgmt: Company {name:"FM Management Holding Group S.A."}),

(leyla)-[:IOO_BSD]->(kingsview:Company {name:"Kingsview Developents Limited"}),
(leyla2: Officer {name:"Leyla Ilham Qizi Aliyeva"}),
(leyla3: Officer {name:"LEYLA ILHAM QIZI ALIYEVA"})-[:HAS_SIMILIAR_NAME]->(leyla),
(leyla2)-[:IOO_BENEFICIARY]->(exaltation:Company {name:"Exaltation Limited"}),
(arzu2:Officer {name:"Arzu Ilham Qizi Aliyeva"})-[:IOO_BENEFICIARY]->(exaltation),
(arzu2)-[:HAS_SIMILIAR_NAME]->(arzu3:Officer {name:"ARZU ILHAM QIZI ALIYEVA"}),

(redgold:Company {name:"Redgold Estates Ltd"}),
(:Officer {name:"WILLY & MEYRS S.A."})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"LONDEX RESOURCES S.A."})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"FAGATE MINING CORPORATION"})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"GLOBEX INTERNATIONAL LLP"})-[:IOO_SHAREHOLDER]->(redgold),
(:Client {name:"Associated Trustees"})-[:REGISTERED]->(redgold)
Image title

Interesting Queries

Family Ties via Last Name

MATCH (o:Officer) 
WHERE toLower(o.name) CONTAINS "aliyev"
Image title

Family Involvements

MATCH (o:Officer) WHERE toLower(o.name) CONTAINS "aliyev"
MATCH (o)-[r]-(c:Company)
RETURN o,r,c
Image title

Who Are the Officers of a Company and Their Roles

MATCH (c:Company)-[r]-(o:Officer) WHERE c.name = "Exaltation Limited"
Image title

Show Joint Company Involvements of Family Members

MATCH (o1:Officer)-[r1]->(c:Company)<-[r2]-(o2:Officer)
WITH o1.name AS first, o2.name AS second, count(*) AS c, 
     collect({ name: c.name, kind1: type(r1), kind2:type(r2)}) AS involvements
WHERE c > 1 AND first < second
RETURN first, second, involvements, c
Joint Company Involvement of Family Members in the Azerbaijan Data

Resolve Duplicate Entities

MATCH (o:Officer) 
RETURN toLower(split(o.name," ")[0]), collect(o.name) as names, count(*) as count
Resolving Duplicate Entities in the Azerbaijan Data

Resolve Duplicate Entities by First and Last Part of the Name

MATCH (o:Officer)
WITH split(toLower(o.name), " ") AS name_parts, o
WITH name_parts[0] + " " + name_parts[-1] as name,  collect(o.name) AS names, count(*) AS count
WHERE count > 1
RETURN name, names, count
Resolve Duplicate Data Entities by First and Last Part of Name

Transitive Path from Mossack to the Officers in that Example

MATCH path=(:Client {name: "Mossack Fonseca & Co (UK)"})-[*]-(o:Officer)
WHERE none(r in rels WHERE type(r) = "HAS_SIMILIAR_NAME")
RETURN [n in nodes(path) | n.name] as hops, length(path)
The Transitive Path between Mossack Fonseca and Company Officers in the Panama Papers Data

Shortest Path Between Two People

MATCH (a:Officer {name: "Mehriban Aliyeva"})
MATCH (b:Officer {name: "Arzu Aliyeva"}) 
MATCH p=shortestPath((a)-[*]-(b))
Finding a Shortest Path in Neo4j in the Azerbaijan Data

Further Work – Extension of the Model

Merge Duplicates

Create a person node and connect all officers to that single person. Reuse our statement from the duplicate detection.
MATCH (o:Officer)
WITH split(toLower(o.name), " ") AS name_parts, o
WITH name_parts[0]+ " " + name_parts[-1] AS name, collect(o) AS officers

// originally natural people have a “citizenship” property
WHERE name CONTAINS "aliyev"

CREATE (p:Person { name:name })
FOREACH (o IN officers | CREATE (o)-[:IDENTITY]->(p))

Introduce Family Ties between Those People

CREATE (ilham:Person {name:"ilham aliyev"})
CREATE (heydar:Person {name:"heydar aliyev"})
WITH ilham, heydar
MATCH (mehriban:Person {name:"mehriban aliyeva"})

MATCH (leyla:Person {name:"leyla aliyeva"})
MATCH (arzu:Person {name:"arzu aliyeva"})

FOREACH (child in [leyla,arzu,heydar] | CREATE (ilham)-[:CHILD_OF]->(child) CREATE (mehriban)-[:CHILD_OF]->(child))
CREATE (leyla)-[:SIBLING_OF]->(arzu)
CREATE (leyla)-[:SIBLING_OF]->(heydar)
CREATE (arzu)-[:SIBLING_OF]->(heydar)
CREATE (ilham)-[:MARRIED_TO]->(mehriban)

Show the Family

MATCH (p:Person) RETURN p
The Aliyev Family in the Azerbaijan Data

Family Ties to Companies

MATCH (p:Person) WHERE p.name CONTAINS "aliyev"
OPTIONAL MATCH (c:Company)<--(o:Officer)-[:IDENTITY]-(p) 
RETURN c,o,p
Family Ties to Companies in the Azerbaijan Data


You can explore the example dataset yourself in this interactive graph model document (called a GraphGist). You can find many more for various use-cases and industries on our GraphGist portal.

Related Information

Now it's easier than ever to get started with MongoDB, the database that allows startups and enterprises alike to rapidly build planet-scale apps. Introducing MongoDB Atlas, the official hosted service for the database on AWS. Try it now! Brought to you in partnership with MongoDB.


Published at DZone with permission of William Lyon. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}