Modeling Data With Hypergraphs
Modeling Data With Hypergraphs
Curious as to how GRAKN.AI represents its knowledge graphs? Why, hypergraphs, of course! Learn about these types of graphs and how they're leveraged by GRAKN.AI.
Join the DZone community and get the full member experience.Join For Free
Hypergraphs generalize the common notion of graphs by relaxing the definition of edges. An edge in a graph is simply a pair of vertices. Instead, a hyperedge in a hypergraph is a set of vertices. Such sets of vertices can be further structured, following some additional restrictions involved in different possible definitions of hypergraphs.
The hypergraph data model (HDM) that we have developed and proposed as the formal foundation of GRAKN.AI is based on a specific notion of hypergraphs, the structure of which can be derived from three basic premises:
- A hypergraph consists of a non-empty set of vertices and a set of hyperedges.
- A hyperedge is a finite set of vertices (distinguishable by specific roles they play in that hyperedge).
- A hyperedge is also a vertex itself and can be connected by other hyperedges.
For instance, the figure below depicts one such hypergraph consisting of four vertices (bob, alice, m-1, and df-1), two of which happen to be also hyperedges:
- m-1, describing a binary marriage relationship between bob and alice playing the roles of husband and wife, respectively.
- df-1, describing a ternary divorce_filing relationship involving three roleplayers in the roles of certified marriage, petitioner, and respondent.
In other words, a hyperedge can be simply seen as a collection of roleplayer pairs of arbitrary cardinality. The roleplayers in such hyperedges can be simple vertices, such as bob or alice, but also other hyperedges, as is the case with m-1 playing the role of certified marriage in df-1. In terms of data modeling, hypergraphs of this structure offer a very natural and straightforward data representation formalism, closely aligned with popular conceptual modeling frameworks such as entity-relationship diagrams. The basic modeling proviso in HDM is this: entities are simple vertices, while relationships of any arity are modeled uniformly as hyperedges.
While several alternative generalizations of hypergraphs are known and have been studied, the definition adopted here has been recognized and received particular attention in the context of knowledge representation, general AI, and graph- and object-oriented databases. The appeal of this formulation stems predominantly from the way it naturally generalizes over a number of information modeling structures ubiquitous in data engineering and AI, including first-order relations, graphs, nesting of information patterns, meta-modelling, and others. This characteristic turns out to be invaluable also in the context of representing and reasoning over knowledge graphs.
Hypergraphs vs. Relations vs. Directed Graphs
What, then, is the core value of using hypergraphs over other prominent data representation structures such as the relational model (underlying SQL databases) or variants of the directed graph model (underlying graph databases and RDF triple stores)?
Interestingly, HDM fits naturally within this landscape. While being essentially equally expressive as those, it endorses a somewhat differently balanced data modeling approach and thus offers a largely missing link between these existing alternatives. In the following paragraphs, we explain these key differentiating aspects and the benefits of modeling data using HDM.
Let's consider a typical representation of the same data set as used in the previous example in SQL tables (or, formally speaking, relations).
The correspondence to the hypergraph above is evident. Simple vertices in the hypergraph (bob, alice) correspond to the records of the “entity” tables (Man, Woman), while the hyperedges (m-1, df-1) directly reflect the contents of each record from the “relationship” tables (Marriage, Divorce_filing).
Thus, the structuring of information characteristic and critical to the SQL-style modeling where relevant pieces of data are grouped together in records is faithfully preserved in hypergraphs. In general, the hypergraph data model can seamlessly accommodate any relational information.
Yet, the connected nature of this information (i.e. the fact that the presence of foreign keys in the records indicates the existence of actual relationships between entities) is made much more prominent in the hypergraph representation. This is a critical difference not only on the conceptual level when just exploring the data but also in a deeper, technical sense when expressing and executing complex traversal queries, which are well-known bottlenecks of SQL databases.
Furthermore, the strictness of the SQL schema makes it very hard to cater for irregular or incomplete data instances (i.e. an odd instance of polygamous marriage showing up in the data set) or an instance with missing information about one of the spouses. Similarly, it is cumbersome to revise the structuring of data once the SQL database has been defined and populated. As has been often argued by graph data advocates, this shortcoming is naturally alleviated in much more flexible graph-based models, in which aspect hypergraphs are no different. All three hyperedges (husband:bob, wife:alice), (husband:mark), and (husband:jacob, wife:gloria, wife:gertrude) can naturally coexist as instances of the marriage relation in the same hypergraph.
Finally, the native graph-oriented structure of hypergraphs also makes them an obvious ground for applying advanced graph computing techniques such as shortest path finding or network analysis, which we will explore in depth in future posts.
Is, then, a hypergraph anything more than just a graph? Arguably, the hypergraph depicted in the beginning of this post could be seen as a plainly labeled directed graph. That’s actually a very desirable characteristic. Every hypergraph can simply be mapped to the corresponding directed graph.
For instance, in the RDF model, we could represent our marriage example as the following RDF graph, where entity and relation types are also explicitly encoded as RDF resources, in the typical RDF style.
Virtually the same mapping could be applied to achieve a direct reduction of hypergraphs to the property graph model. Because of this close relationship to directed graphs, HDM can be naturally implemented over any graph-based data storage such as increasingly popular graph databases or RDF triple stores. In fact, GRAKN.AI is mounted on top of TinkerPop, an open-source interoperability layer that exposes a uniform graph data model (property graph) over any TinkerPop-compliant data management system.
This allows us to greatly reduce the cost of developing a robust and mature system and grants a good level of vendor-agnosticism in terms of the choice of the underlying storage platform.
The central differentiator between hypergraphs and directed graphs, however, is the introduction of hyperedges as first-class modeling constructs in HDM. Firstly, hyperedges have a significant impact on the data modeling practice and possibilities when compared to directed graph models. In principle, it is possible one could arrive immediately at the RDF graph depicted above when modeling our example dataset, thus achieving the same clean and uniform representation without explicitly employing HDM.
In a real-life scenario, however, when the complete conceptual model is not fully foreseen at the outset of the process and the data is injected to the graph gradually over time, the actual outcome may be considerably different. In fact, RDF practitioners should not be surprised to see, as the result of this exercise, the RDF graph below instead.
To start with, the binary marriage relation would typically be represented using a directed edge involving the married_to predicate, following the standard “good practice” recommended by popular RDF tutorials. Later on, however, once the married couple files for divorce and the graph must account for this new fact, the RDF data modeler faces two problems:
- Modeling ternary divorce_filing relationship has to follow a different route than that employed in case of binary relations. Instead of asserting a link between two entities, one needs to use a dedicated ontology pattern for n-ary relations, which would typically mimic the hyperedge structure — one resource representing the relation object connected with a set of outgoing links towards respective role-players.
- There’s no simple way to connect the original binary marriage relationship with other domain elements. In this respect, RDF offers a notorious triple reification mechanism, which allows for introducing a dedicated resource (of type rdf:Statement) representing the entire triple and linked to all its constitutive components via predicates rdf:subject, rdf:predicate, and rdf:object. RDF reification has attracted a lot of criticism in the past, essentially by being a crude tool immediately damaging the transparency of the model and its usability on the query level.
The second benefit that hyperedges can bring to the data management table, as compared to the binary directed edges, is the potential improvements to query planning and query optimization mechanisms. Arguably, the data grouped together in the same structural “containers,” precisely such as hyperedges or SQL records, is also often retrieved in similar groupings by users and applications. By acknowledging the structure of these collections in advance of querying, the information retrieval process can be more optimally planned and executed.
The hypergraph data model underpinning the knowledge representation system implemented in GRAKN.AI presents a novel alternative in the data modeling space, providing a viable middle-ground between the relational and directed graph-based models — the best of both worlds. The key points supporting this view, which we have touched upon in this post, can be summarized as follows.
Benefits Over the Relational Model
- Native graph-oriented structure:
Relationships (connections) in data are first-class citizens.
Graph-oriented data modeling frameworks (i.e. ER diagrams) can be more easily applied and linked to the actual data model.
Graph-oriented computation techniques can be efficiently applied (i.e. path queries, network analysis).
- The flexibility of data modeling is not impeded by restrictive relational schemas (i.e. multiple values of the same role can be involved in the same hyperedge without an a priori modeling decision).
Benefits Over the Directed Graph Model
- The natural mechanism of grouping relevant pieces of information in a relational style, which is to a large extent lost in directed graphs.
- Uniform handling of all n-ary relationships, as opposed to directed graphs where n-ary relations for n>2 require a radical change in the modeling approach (so-called n-ary relation patterns) compared to the case of n=2.
- A natural way of expressing higher-order information (relations between relations, nesting of information, etc.), which in directed graphs requires dedicated modeling techniques (so-called reification).
Published at DZone with permission of Szymon Klarman , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.