Using the Elastic Graph on Panama Papers Analysis
Using the Elastic Graph on Panama Papers Analysis
Mark Harwood shows how to use the Elastic Graph to analyze connections in data, even when it means sorting through the complexity of the Panama Papers.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
The new Elastic Graph capabilities allow you to analyse connections in data. Whether it is chasing down the tangled web of offshore financial arrangements in the Panama Papers or a high-level overview of click behaviour on a busy ecommerce website, Graph technology helps bring these relationships into focus.
The Graph capability is bundled as part of the commercial X-Pack plugins for the Elastic Stack and includes a Kibana app and a new Elasticsearch API. In this first Graph blog we'll take a brief look at what the combination of the Kibana app and the API can offer.
Forensic Analysis: Panama Papers
The release of financial and legal records from the offshore law firm Mossack Fonseca is one of the most explosive news stories of 2016. The records reveal that many politicians, members of royalty, the rich and their families are exploiting networks of shell companies established in secretive offshore tax regimes. Journalists and financial institutions are now intently focused on this data, but unraveling the connections can be both difficult and time-consuming. The Kibana Graph app makes this process simple for anyone:
Above we see the companies and individuals connected to Vladimir Putin's close friend, Sergei Roldugin. This picture was built up from a few simple steps:
Selecting the Datasource
Initially we select "panama" from our list of indices and then select one or more fields whose values we want to show in the diagram. Each field can be given an icon and colour for the "vertices" that will appear in the diagram.
Running a Search
Now we can run a regular free-text search to match documents containing the name of Putin's friend, "Roldugin."
The terms found in the matching documents are shown as a network — each line representing one or more documents that connect a pair of terms.
The journalists at the ICIJ who are curating the data have tried to give each real-world entity (person/company/address) a unique ID that is attached to every document that references them.
Unfortunately people names and addresses can be awkward to match — the journalists correctly identified three documents that are connected to Person entity 12180773 but we can see that there are two other people with similar names, but they have been assigned different identity codes. Equally there are two addresses that look similar but have been assigned different identity codes. In future blog posts, we will talk about using the Graph API for automated entity resolution. For now let's fix this manually with the grouping tool.
Using the advanced mode tools we can select, then click the group button to merge vertices. This gives us a cleaner picture.
If we wanted, we could further group already grouped items e.g. merging people with multiple identities into single vertices and then merging those into company vertices.
Now what if we wanted to see what else was connected with these entities? We can continue to explore the connections in the data using the "+" button on the toolbar to pull in other related entities.
We can expand out the picture further by pressing "+" repeatedly and use selections to focus on expanding only certain areas of the graph. The undo and redo buttons are important parts of backing out of any uninteresting results. Additionally, delete and blacklist buttons allow control over which vertices are currently visible or can return. Snippets of example documents behind selected vertices can also be shown.
Wisdom of Crowds
The Panama Papers are an example of a detailed "forensic" type investigation where each single document may represent a highly important connection.
However, where the Elastic Graph technology can really shine is in its ability to summarise mass user behaviour such as the data found in click logs. Common phrases used to describe this form of analysis are mining "collective intelligence" or "wisdom of crowds." In these scenarios, each document by itself is not particularly interesting but the emergent patterns from many user behaviours are — they can be used to drive recommendations e.g. "people who bought product X also tend to buy product Y." In these scenarios we need to avoid the one-off documents that make spurious connections and equally avoid overly-obvious associations like people who bought product X also bought milk (most people buy milk). With this in mind the default settings that control graph exploration are tuned for identifying only the most significant associations.
Let's look at a recommendation use case using the LastFM dataset. If we build a user-centric index, we have a single document per listener which contains an array of the music artists they like. Let's query this index for the people who like "Chopin" to see what else they like.
The classical artists returned are obviously strongly related — clicking on a line between two vertices shows us just how many listeners share these interests. Nearly half of all Mendelssohn listeners also like Chopin.
The Graph API has identified only the significant associations. This is an important distinction from other graph exploration technologies.
Popular != Significant
Let's see what happens if we open the settings tab and deliberately turn off the feature that looks for only significant connections.
With the "significant links" checkbox unchecked let's re-run the query for Chopin listeners — the results are very different.
Notice that the (globally) popular artists such as Radiohead and Coldplay have now crept into the results. Among the 5,721 Chopin fans, 1,843 of them like the Beatles. That's certainly a popular choice but what we call "commonly common" — like people who buy milk. When the switch for "significant links" is turned on we tune out noise and focus on signal — what we call the "uncommonly common". For those from an information theory background, the TF-IDF algorithm that has powered search engines for years is based on these same principles.
By reusing these relevance ranking techniques we can stay "on topic" when exploring connections in data. This is an important distinction from graph databases which have no concept of relevance ranking and are typically forced to employ a strategy of just deleting popular items (see the problem of "supernodes").
Note: When performing detailed forensic work such as exploring the Panama Papers, it can help to keep "significant links" turned on to try avoid the super-popular companies but the "certainty" setting should be lowered from its default value of three to one. For wisdom-of-crowds type scenarios we want at least three documents to assert a link before we trust it whereas in forensics every document is potentially interesting.
Hopefully this blog has provided a quick taste of the two main usage modes for Graph:
- Forensics: Every document is potentially interesting. Search is "zoomed in" on individual records and actors and no stone goes unturned.
- Wisdom of crowds: Zoomed-out "big picture" overviews of mass behaviour. There are too many noisy connections to be drawn so the focus is on summarising only the most significant connections in the data.
In future blog posts, we'll look in more depth at specific use cases and get to grips with the Elasticsearch API used behind the scenes by the Kibana app.
Our blog URL is: https://www.elastic.co/blog
Published at DZone with permission of Mark Harwood , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.