A Quick Guide to Prevail in the Graph Database Arena
A Quick Guide to Prevail in the Graph Database Arena
The criteria to meet a differentiation strategy for graph databases include multi-model engines, many-to-many relationships, and forethought regarding query languages.
Join the DZone community and get the full member experience.Join For Free
Built by the engineers behind Netezza and the technology behind Amazon Redshift, AnzoGraph™ is a native, Massively Parallel Processing (MPP) distributed Graph OLAP (GOLAP) database that executes queries more than 100x faster than other vendors.
There are endless discussions in the database arena about which DBMS is best suited for operational or data warehousing analytics, which one is the most efficient for online transaction processing, or which one is suitable for semantic integration. Recently, graph databases have grown in popularity, especially in the enterprise space, and perhaps that adds more headaches to those vendors that try to differentiate from the competition and to those clients that are completely uncertain how to embrace this database technology.
Definition of Graph Databases
Recently, Bloor published a report about Graph and RDF Databases. The author, Philip Howard, claims that “the difference between a true graph product and a triple store is that the former supports index free adjacency (which means you can traverse a graph without needing an index) and the latter doesn’t”. On the contrary, Claudius Weinberger, CEO of ArrangoDB, argues that this is not a fundamental criterion of graph databases. In a post titled “Index Free Adjacency or Hybrid Indexes for Graph Databases,” he proposes that the definition of graph database remains
A database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data independent of the way the data is stored internally
Graph Database is a database that uses a graph topology, i.e. vertices and edges, to manage information at the conceptual level independent of the logical and physical implementation of the graph data structure.
In another recently published paper by Bloor, “All about graphs: a primer,” the author discusses the Graph data model and highlights the representational differences of a many-to-many relationship, including those of bipartite, hypergraph, and associative graphs. He observes that
unlike other new database approaches, graphs cannot easily be subsumed by the leading relational database vendors because the architectural constraints of graphs do not fit easily within the relational paradigm.
He mentions that the two main variants on entity relationships are labeled property graphs and
subject-predicate-object triples. In practice, although the idea of relationships (associations) between entities is at the heart of Peter Chen’s Entity-Relationship model, see some (Fig. 2) illustrations here (Fig. 3), there are subtle dissimilarities in its implementation on various graph databases. In a series of posts on associative data modeling written with a hands-on practice style, I attempted to clear the information glut of this topic with a thorough examination of graph data models.
Multi-Model Database Engine
The graph engine and the type of data model are critical factors for any graph database. Therefore, it is not strange that many vendors have started marketing their DBMS as multi-model. We have extensive and long experience with two such products, OrientDB and Intersystems Cache. The former supports Graph, Document, Key/Value, and Object models, whereas the latter is an object database with relational access, integrated support for JSON documents, and a multidimensional key-value storage mechanism that can be easily extended to cover a Graph data model. Generally speaking, we have reasons to believe that multi-model DBMS will dominate the database market. Currently, OrientDB has become a leading player in the graph databases, and Intersystems Cache is one of the best operational DBMSs, according to the Magic Quadrant report.
Physical vs. Logical Perspective
Not only has a multi-model database been flexible with its logical schema, but it also has a unified storage data architecture. Although the developer should hardly need access to the physical implementation details of the storage engine, an API for direct use of the engine is desirable and beneficial for many reasons. Most importantly, this kind of architecture allows someone to build a customized database management system. In theory, ANSI/SPARC three-level architecture (external, conceptual/logical, and physical) is an effort to allow these three perspectives to be relatively independent of each other, but in practice, the front-end of a DBMS is most often strongly dependent on the back-end storage data model.
A loose coupling can be achieved with associative/multidimensional arrays. No matter their physical implementation, i.e. hash tables or trees, based on this abstract data type, you can model all four NoSQL database types, (Key/Value, Tabular/Columnar, Document, Graph). For one reason or another, we are of the opinion that associative/multidimensional arrays will eventually prevail in the world of databases. There is already strong competition for their best physical implementation and sparse, column-family store, databases have proven to be very popular (HBase, Hypertable, BigTable, Intersystems Cache).
There are other properties that are crucial for operational database management systems, such as ACID transactions, distributed data architecture, and scalability. Whether we are talking about multi-model or single model graph databases, there is a tendency to use them for online transaction processing. Therefore, these properties are worth having. And again, in terms of architectural design, there is always the problem of how to achieve a loose coupling between the physical structures of a database and the application logic.
With that said it brings us to the question on what kind of logical/conceptual data model architecture to use. Our R3DM/S3DM framework is based on the powerful theory of the semiotic triangle. We use numerical vectors (signs), to encode abstract things in our mind (signified) to which the sign refers, e.g. Person, name, Car, model. We associate these with data containers-forms that the sign takes for the storage of data values (signifier), i.e. primitive data types (see also Signified and Signifier). This trilateral principle of our framework permits a uniform treatment of semantics, syntax, and storage of information based on a symbolic representation. This way, we define a fundamental, atomic information resource unit, (AIR). Those units, in turn, can be easily shaped to form any tabular, hierarchical, or graph data structure in a unified way. For example, study this R3DM hypergraph representation of QlikView's associative model. Data granularity can be also deeply connected and related to the definition of a fundamental unit of processing.
Based on this single primitive construct as a building block, (AIR), we have implemented seven type systems for an upper-level management of any DBMS. These are:
We characterize Datasets, Domain Models (schemas), Entities, Attributes, etc, as information resources, values are information realization, and our AIR units that represent everything are called information representations — or simply references. Our current implementation phase has been completed on top of OrientDB and a forthcoming article will present R3DM/S3DM architecture in detail. In the past, the Freebase collaborative knowledge graph had a type system that was built on primitive constructs.
Yet another decisive norm in databases is the query language. With RDF directed, labeled graph data format, and with RDF store databases respectively, e.g. OpenLink Virtuoso, AllegroGraph, and Ontotext GraphDB, the SPARQL query language is a standard way to retrieve data. On the contrary, the query language of property graph databases varies a lot. There are those similar to SQL APIs, such as those of OrientDB and ArrangoDB, Neo4j is using its own Cypher declarative graph query language, and there is also the Gremlin open-source graph programming language.
We have developed a functional RESTful API that can be served as a prototype for a uniform, universal treatment of data language. Commands and their parameters can become more efficient and they can be simplified if we take into account the hierarchical relationship of Server, Database, Class, Property, and Record containers. There are five sets of commands for getting, updating, deleting, adding, and linking information. The current implementation is built with the Wolfram Language and we will expose more details in a forthcoming article where we analyze R3DM/S3DM architecture.
Last but not least, there is an emerging need for databases that can function as both analytic and operational. In particular, the modern data warehouse should unify all of a client’s transactional databases as well as integrate other external data sources that enable data cleansing, validation, and enhancement. Not only that, but for quick and smart business analytics, the interface should be both user-friendly and functionally powerful. We are aware of such a player in this market segment with a technology that possesses similar features to our R3DM/S3DM framework. This is the reason that we devoted one of our articles to describe QlikView’s unique, award-winning, in-memory associative technology.
Make no mistake, relational databases are the past of computer database technology. Graph databases are the present and the future. This quick review on what we considered important criteria for graph database-related technology products might leave the reader in more perplexity than satisfaction. This is our perspective, and we wanted to share some of our knowledge with experts and chief technology people in this field so that we could discuss the matter in more
Published at DZone with permission of Athanassios I. Hatzis, PhD , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.