Making Your Data Intelligent
Let's take a look at this new database guide article, which explores making your data intelligent as well as knowledge bases.
Join the DZone community and get the full member experience.
Join For FreeBack in 1965, Ted Nelson coined the term “hypertext.” In his vision, documents would have Paths going in both directions connecting them all together. On the web that Tim Berners-Lee built, we only got half that vision since what we know as links today only travel in one direction. It wasn’t until the search engines started mapping these links into a graph that the web started to make sense.
A row of data stored in a relational database has an even worse story. In order to understand how it is connected, you must tell it exactly which tables to join and how to join them. It doesn’t even understand the concept of links — those are reserved for the auxiliary join tables. The data stored in a relational database require your SQL expertise to deliver any value. But what happens when we move that data into a graph database? What happens when your rows become nodes and your join tables become real relationships connecting them together? All of a sudden, your data knows exactly what they are connected to and how they are connected.
Your data becomes intelligent, so your queries don’t have to. You can now simply ask, “Are these two things connected in some way?” and they will tell you so. What about indirectly connected? Three, four, five jumps away? They know. It’s the same data you’ve had all along, but it can now start to reveal its secrets. These are the kinds of questions you may have never even thought to ask, but you will. While some enterprises are still working on data lakes with varying levels of success, leading organizations are looking over the horizon at the next thing: the knowledge base.
Yes, knowledge bases. Just like Bumblebee from the original Transformers, somebody came back in a DeLorean from 1985to make them cool again. But this time, they are bigger, they are more powerful, and they may yet survive the AI winter that killed their earlier incarnations. Only this time, it’s not just about building expert systems. It is about understanding the fabric and structure of enterprise data — understanding the entire business, end to end, and turning that into knowledge. What are the ways these things over here connect to those things over there? For example, given the over 1,000 press releases from today alone, which ones should you read because they are more likely to have an effect on your investment portfolio? Not simply because the press release mentioned one of your investments, but because their customers, partners, suppliers, competitors, subsidiaries, etc. were mentioned two and three levels away?
How would it know? Is it a fortune teller or an oracle? No; there is no magic in computer science. So, how exactly do graph databases make data intelligent, then? The trick is in how the data is stored. The following description is a little abstracted from the underlying mechanics, but the principles are the same: You have an array of nodes, and you have an array of relationships. The relationships point to the node they came from and Using graph databases, your data knows exactly what it's connected to and how. The cost of a graph query is determined by how many relationships are traversed to find the answer, not the size or complexity of the data. The shape of the data can be as important as, if not more important than, the values of the data. Making YourData Intelligent the node they go to, just like a join table would in a relational database — except they don’t use foreign keys that have to be found in indexes. Instead, they use pointers.
The nodes each have sets of relationships grouped by type and direction (incoming or outgoing). Let’s take the example of a social network. Node 1 is a user. It has 150 “friends” relationships,40 “likes” relationships, 80 “posts” relationships, etc. We want to get the emails of their 150 friends. Starting at node 1, we goto the set holding the 150 outgoing friend relationships, jump to each one by following a pointer, then from each of these, follow a pointer to the node on the other side of the relationship. Then, for each one, we grab their email address.
It's called “index-free adjacency” but it really means that traversals are done following pointers instead of looking up keys. This friends query will take the same amount of time regardless of whether there are one million or one billion users in the database. The sets of relationships grouped by type and direction also help us ignore connections we don’t want. Instead of a “posts” relationship type, imagine one per day... POSTS_2018_12_31, POSTS_2019_01_01, etc. If we want to see the things our friends posted on any particular day, we can just follow that day’s relationship type — which means our query scales regardless of how many posts our friends have over a long period of time because we are only looking at one day’s worth of data in our query.
You can play the same trick in relational databases by using partitioning or having one join table between users and posts per day. But that’s like teaching an old dog new tricks; with graph databases, you get this for free. Another thing to note is that following the 150 “friends” relationship is (almost) the same as following a single “parent” relationship 150 times in150 different nodes. The cost of a query is determined by how many relationships are traversed to find the answer, regardless of what or where those relationships take us and regardless of the overall size or complexity of the data.
In many of today’s enterprises, data is kept in silos, even though customer data is connected to products, process, supply chain, manufacturing, staff, legislation, and all the underlying systems that support the entire organization (and, in some cases, partners, suppliers, and competitor data, too). How would the effect of new legislation for your suppliers affect your business? It’s hard to know as long as the data remains separate. Newer master data management projects are using graph databases because they are perfectly suited to handle this kind of project.
As data grows and structure changes, new relationship types and new types of objects can be added on the fly. A key concept in graph databases is that schema is optional or fluid. Nodes of the same type may have different properties and relationship; they don’t store null values. While in relational databases, you end up with wide tables filled with nulls, the graph databases keep only what exists.
Speaking of existence, when building recommendation systems, we look for relationships that don’t exist but should. You should buy this product or watch this movie. We know this because people who have a similar behavior to you have bought that product or watched that movie. In fraud detection systems, we look for relationships that should not exist. Your insurance claim processor just so happens to be the cousin of the husband of the person making a dubious claim. Recommendations and fraud are two sides of the same coin, both addressed very well by a database that makes relationships first class citizens.
It’s not only relationships, though. Having data in a graph makes pulling graph metrics a snap. How many nodes are connected to this node? How many of those are connected to each other? What is the sum of the weight of those relationships? What is the value of the strongest relationship? Which is the most central by betweenness, closeness, harmonics, etc.? What are these metrics two hops or three hops out? What are people doing with these metrics? They are using them to feed machine learning models where the structure of the subgraph of a node may be just as important if not more so than its properties.
This is key and bears repeating: The shape of the data can be just as important as the values in the data. We may not know how old a user is, and they may have friends of all ages, but if many of those aged 35-40 are all friends with each other, chances are that the user is in that same range. The financial transfers may seem normal, but you realize that the recipients of those transfers are all connected somehow through deep and tangled corporate ownership links. Graph databases will allow you to clearly see these relationships that were hiding in the data all along. As your data becomes intelligent, so do you.
Opinions expressed by DZone contributors are their own.
Comments