Moving Toward Smarter Data: Graph Databases and Machine Learning
A brief overview of database solutions, an introduction to using machine learning and graph databases, and real-world use cases for putting context back into your data.
Join the DZone community and get the full member experience.Join For Free
When we used to think about data, it was most often in regard to where data was going to be stored and how we would manage it. Yes, files worked for a while, but when manipulating data became an important business priority across industries, the “file” solution didn’t work so well anymore. To meet these increasing demands, applications were designed and developed, addressing data storage and manipulation needs simultaneously — thus, the “database” was born.
Today, data is viewed quite differently. Beyond data manipulation, organizations are focusing on mining their data for more visibility into and a deeper understanding of the intelligence within that data. Utilizing the insights acquired from their data to help make informed business decisions is a key priority for business leaders and a major concern in the development, evolution, and adoption of database solutions.
A new term emerged in the industry — digital assets. That data the world has been obsessing over... it is now considered one of the most valuable assets of a business, now more than ever before. And its significance is clearer too. We have all heard of the major players in data accumulation, the companies whose largest asset is the data that they have at their disposal — Google, Facebook, Amazon, Netflix, and others. This has A LOT to do with the amount of data being created.
There are numerous (if not endless) ways to organize your data when it comes to database solutions, the choice for which is ultimately between SQL, NoSQL, and graph databases.
- SQL – Offers an implicit relationship between tables. For example, constraints can be created by tying an id in a table that’s being used as an index to another table.
- NoSQL – Can have both implicit and explicit relationships, but it lacks the structure of an SQL schema.
- Graph – Has an explicit structure around relationships between nodes of data, and it can also have implicit relationships using ids to index into other nodes or other databases. Graph databases contain more structural relationships than relational databases.
In mid-2018, a Forbes article cited a collection of stats about global data accumulation, one of which stated that 2.5 quintillion (2.5E30) bytes of data are created every day (original calculation presented in Data Never Sleeps 5.0). Obviously, this is too much data for humans to sift through, make sense of, and use to create products and services. Machine learning has risen to this challenge, helping the major companies noted prior, and many others, create digital assets from their data.
This success has driven scientists and engineers to seek more monumental, effective ways to use big data, like funneling large sets into machine learning algorithms, the output of which is a deeper understanding of what lies in the data.
These insights result in increasingly better datasets to further train on. Organizations can then apply that intelligence to produce higher quality products and services. Simply put, machine learning is increasing the value of companies’ digital assets by enabling them to extract actionable intelligence from their big data. Over the past few years, graph databases have aided this approach.
What is a Graph Database?
According to Neo4j, a graph database is:
A database designed to treat the relationships between data as equally important to the data itself. It is intended to hold data without constricting it to a pre-defined model. Instead, the data is stored like we first draw it out — showing how each individual entity connects with or is related to others.
As stated above, a graph database consists of three components: nodes, edges, and properties. Nodes and edges have labels describing, in general, what the node and edge represent. Visually, nodes and edges can be represented below:
Edges can be thought of as the relationship between nodes. Properties can be associated with nodes and edges. For example, if we have a node that represents a person and an edge describing a friendship, the graph could look like what’s shown below:
In this example, the nodes in a graph have a label called Person and have a name property. The edge (i.e., relationship) that connects the nodes has a label called isFriend and has a bestFriend property. You can query the database by selecting nodes or edges based on their labels and/or their properties.
Key Strengths of Graph Databases
- Handles highly connected data proficiently – Graph databases store relationships as direct data. The relationship isn’t inferred and can be directly interacted with. This makes representing highly connected data easy to do and easy to work with.
- Executes complex queries continuously – Because the relationships between data are stored natively, complex queries are executed very quickly. For context, similar queries in an SQL database would involve complex join statements that can be a very CPU-intensive and time-consuming task.
- Represents data intuitively – Graph databases represent data more intuitively and naturally, as opposed to the more traditional tabular representation used by SQL databases. The connections between the data domain and other domains can be expressed and represented directly.
- Explores data at greater depths – Because of the intuitive representation of data, exploring it is easier and more natural to do. Most data solutions can be directly represented in a graph. This allows for a myriad of individuals to participate in data exploration, not just DBMs and other data science professionals.
- Brings context to data – As graph databases capture relationships between data, new insights can be observed and further understood, and in turn, trigger innovative ideas for adding nodes to the graph — or even an entirely new graph.
Why Data Matters to Machine Learning
All machine learning relies on data. Generally speaking, the more data that you can provide your model, the better the model. Your ML model needs to have high-quality data, which must be related to the problem you aim to solve. So in addition to volume, data quality matters as well. Finding relationships within your data and exposing them in your model’s training data can greatly improve its predictability.
Put candidly, high-quality data creates high-quality training features, producing a high-quality model that can more accurately generalize unseen data. As a result, understanding and explaining what your ML model means, and its behavior, is much easier.
The State of Machine Learning & Big Data Shortcomings
As stated above, there is a tremendous amount of data being created every day. Data science professionals are using this data for training, and the advances they have made in machine learning are truly amazing. However, there is concern in the industry — this feeling that more can be done with the data that is currently captured. Below are some statements sourced from a ZDNet 2018 article that echo what many data professionals are saying today.
Google ponders the shortcomings of machine learning
Scientists of AI at Google’s Google Brain and DeepMind units acknowledge machine learning is falling short of human cognition, and they propose that using models of networks might be a way to find relations between things that allow computers to generalize more broadly about the world.
“The idea is that graph networks are bigger than any one machine-learning approach. Graphs bring an ability to generalize about structure that the individual neural nets don’t have.”
“A benefit of the graphs would also appear to be that they’re potentially more ‘sample efficient,’ meaning, they don’t require as much raw data as strict neural net approaches.”
“Where do the graphs come from that graph networks operate over?”
It is clear that relationships and graphs are being viewed as bringing the next advancement in machine learning. Because when tabular data is moved into a graph and relationships are added, a context gets created that provides a new frame to view the data through.
Graph Databases Bring Context Back to Data
Relationships are the strongest predictors of behavior. This has been found in many studies and is articulated quite well in the following quote from Dr. James Fowler:
“Increasingly we’re learning that you can make better predictions about people by getting all the information from their friends and their friends’ friends than you can from the information you have about the person themselves.”
Extrapolating this more generally, you could say:
You can make better predictions utilizing relationships within the data than you can from just the data alone.
An excellent example of this is that the most powerful predictor of whether someone is going to take up smoking or not is whether or not they have friends that already smoke.
Adding relational context to your data will bring forth this predictive power (it was always there, just hidden below the surface). If your data is stored in tabular form, converting it into a graph database requires you to think about your data domain in new ways — to define the nodes, the relationships between the nodes, and the properties associated with each node. There is a near infinite way to do this. In fact, an iterative loop can be entered where you convert your tabular data to nodes, relationships, and properties. Studying this context enables the discovery of something novel about your data, which then can lead to an entirely new conversion of the tabular data to a graph, or greatly expand the existing graph.
Graph databases bring a certain context to data that allows for new machine learning training features to be created from the data you already have. This improves the value of the training data and as a result, produces a model that makes more accurate predictions.
Graph Databases in Machine Learning
Data plays a significant role in machine learning, and formatting it in ways that a machine learning algorithm can train on is imperative. Data pipelines were created to address this. A data pipeline is a process through which raw data is extracted from the database (or other data sources), is transformed, and is then loaded into a form that a machine learning algorithm can train and test on. A typical machine learning data pipeline looks like the following:
How Graph Databases Fit Into Your Data Pipeline
Graph databases belong in the ‘Load Data’ part of the pipeline. It isn’t going to change how you are currently training your model. In fact, if you are using Apache Spark in the Load Data stage, it already supports the process of non-persistent graphs moving tabular data into a graph.
Extracting Connected Features
Connected features are those features that are inherent in the topology of the graph. For example, how many edges (i.e., relationships) to other nodes does a specific node have? If many nodes are close together in the graph, a community of nodes may exist there. Some nodes will be part of that community while others may not. If a specific node has many outgoing relationships, that node’s influence on other nodes could be higher, given the right domain and context.
Like other features being extracted from the data and used for training and testing, connected features can be extracted by doing a custom query based on the understanding of the problem space. However, given that these patterns can be generalized for all graphs, unsupervised algorithms have been created that extract key information about the topology of your graph data and used as features for training your model.
Attaining the benefits of putting your data into a graph database, like connected features, means your problem domain fits into a graph structure. Please consider, though, that not all problem domains will fit.
Where Graph Databases Show the Most Promise
With graph databases providing such intelligent insights and understanding of data, it’s no surprise that they are used across industries including finance and insurance services (e.g., for fraud detection), pharmaceuticals (e.g., for drug discovery), and cybersecurity (e.g., for network security).
As the amount of data produced daily continues to increase, the need for data science to process and make sense of that data, and use the information held within it, continues to grow as well. Graph databases are proving to be a key part of that process. And because graph databases hold data and the relationships between that data allow for easy and intuitive querying, they will continue to help industries gain deeper insights into their ever-accumulating cache of data to better serve their customers.
Opinions expressed by DZone contributors are their own.