KGCNs: Machine Learning Over Knowledge Graphs With TensorFlow
Learn about machine learning over knowledge graphs with TensorFlow.
Join the DZone community and get the full member experience.Join For Free
This project introduces a novel model: the Knowledge Graph Convolutional Network (KGCN), which is free to use from the GitHub repo under Apache licensing. It's written in Python and available to install via pip from PyPi.
A KGCN can be used to create vector representations (embeddings) of any labeled set of Grakn Things via supervised learning.
- A KGCN can be trained directly for the classification or regression of Things stored in Grakn.
- Future work will include building embeddings via unsupervised learning.
What Is It Used For?
Often, data doesn't fit well into a tabular format. There are many benefits to storing complex and interrelated data in a knowledge graph, not least that the context of each datapoint can be stored in full.
However, many existing machine learning techniques rely upon the existence of an input vector for each example. Creating such a vector to represent a node in a knowledge graph is non-trivial.
In order to make use of the wealth of existing ideas, tools, and pipelines in machine learning, we need a method of building these vectors. In this way, we can leverage contextual information from a knowledge graph for machine learning.
This is what a KGCN can achieve. Given an example node in a knowledge graph, it can examine the nodes in the vicinity of that example — its context. Based on this context, it can determine a vector representation, an embedding, for that example.There are two broad-learning tasks a KGCN is suitable for:
- Supervised learning from a knowledge graph for prediction e.g. multi-class classification (implemented), regression, link prediction
- Unsupervised creation of Knowledge Graph Embeddings, e.g. for clustering and node comparison tasks
In order to build a useful representation, a KGCN needs to perform some learning. To do that, it needs a function to optimize. Revisiting the broad tasks we can perform, we have different cases to configure the learning:
- In the supervised case, we can optimize for the exact task we want to perform. In this case, embeddings are interim tensors in a learning pipeline.
- To build unsupervised embeddings as the output, we optimize to minimize some similarity metrics across the graph
The ideology behind this project is described here, and in a video of the presentation. The principles of the implementation are based on GraphSAGE, from the Stanford SNAP group, heavily adapted to work over a knowledge graph. Instead of working on a typical property graph, a KGCN learns from contextual data stored in a typed hypergraph, Grakn. Additionally, it learns from facts deduced by Grakn's automated logical reasoner. From this point on, some understanding of Grakn's docs is assumed.
Now we introduce the key components and how they interact.
A KGCN is responsible for deriving embeddings for a set of Things (and thereby directly learn to classify them). We start by querying Grakn to find a set of labeled examples. Following that, we gather data about the context of each example Thing. We do this by considering their neighbors and their neighbors' neighbors, recursively, up to K hops away.
We retrieve the data concerning this neighborhood from Grakn (diagram above). This information includes the type hierarchy, roles, and attribute value of each neighboring Thing encountered and any inferred neighbors (represented above by dotted lines). This data is compiled into arrays to be ingested by a neural network.
Via operations Aggregate and Combine, a single vector representation is built for a Thing. This process can be chained recursively over K hops of neighboring Things. This builds a representation for a Thing of interest that contains information extracted from a wide context.
In supervised learning, these embeddings are directly optimized to perform the task at hand. For multi-class classification, this is achieved by passing the embeddings to a single subsequent dense layer and determining loss via softmax cross entropy (against the example Things' labels); then, optimizing to minimize that loss.
A KGCN object brings together a number of sub-components, a Context Builder, Neighbour Finder, Encoder, and an Embedder.
The input pipeline components are less interesting, so we'll skip to the fun stuff. You can read about the rest in the KGCN readme.
To create embeddings, we build a network in TensorFlow that successively aggregates and combines features from the K hops until a "summary" representation remains — an embedding (diagram below).
To create the pipeline, the Embedder chains Aggregate and Combine operations for the K-hops of neighbors considered. e.g. for the 2-hop case, this means Aggregate-Combine-Aggregate-Combine.
The diagram above shows how this chaining works in the case of supervised classification.
The Embedder is responsible for chaining the sub-components Aggregator and Combiner, explained below.
An Aggregator (pictured below) takes in a vector representation of a sub-sample of a Thing's neighbors. It produces one vector that is representative of all of those inputs. It must do this in a way that is order agnostic since the neighbors are unordered. To achieve this, we use one densely connected layer and maxpool the outputs (maxpool is order-agnostic).
Once we have Aggregated the neighbors of a Thing into a single vector representation, we need to combine this with the vector representation of that thing itself. A Combiner achieves this by concatenating the two vectors and reduces the dimensionality using a single densely connected layer.
Supervised KGCN Classifier
A Supervised KGCN Classifier is responsible for orchestrating the actual learning. It takes in a KGCN instance, and as for any learner making use of a KGCN, it provides:
- Methods for train/evaluation/prediction
- A pipeline from embedding tensors to predictions
- A loss function that takes in predictions and labels
- An optimizer
- The backpropagation training loop
It must be the class that provides these behaviors since a KGCN is not coupled to any particular learning task. This class, therefore, provides all of the specializations required for a supervised learning framework.
Below is a slightly simplified UML activity diagram of the program flow.
Build With KGCNs
import kglib.kgcn.core.model as model import kglib.kgcn.learn.classify as classify import tensorflow as tf import grakn URI = "localhost:48555" client = grakn.Grakn(uri=URI) session = client.session(keyspace=training_keyspace) transaction = session.transaction(grakn.TxType.WRITE) kgcn = model.KGCN(neighbour_sample_sizes, features_size, example_things_features_size, aggregated_size, embedding_size, transaction, batch_size) optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate) classifier = classify.SupervisedKGCNClassifier(kgcn, optimizer, num_classes, log_dir, max_training_steps=max_training_steps) training_feed_dict = classifier.get_feed_dict(session, training_things, labels=training_labels) classifier.train(training_feed_dict) transaction.close() session.close()
This will get you on your way to building a multi-class classifier for your own knowledge graph! There's also an example in the repository with real data that should fill in any gaps the template usage misses.
If you like what we're up to and you use/are interested in KGCNs, there are several things you can do:
- Submit an issue for any problems you encounter installing or using KGCNs
- Star the repo if you're inclined to help us raise the profile of this work :)
- Ask questions, propose ideas, or have a conversation with us on the Grakn Community slack channel
This post has been written for the third kglib pre-release, using Grakn commit
20750ca0a46b4bc252ad81edccdfd8d8b7c46caa, and may subsequently fall out of line with the repo. Check there for the latest!
Published at DZone with permission of James Fletcher, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.