Got a Minute? Check Out AnzoGraph's Record-Shattering Trillion Triple Benchmark Results
AnzoGraph completed the load and query of over one trillion triples in just under 2 hours.
Join the DZone community and get the full member experience.Join For Free
Imagine how much information is contained in one trillion facts. That's roughly equal to…
6 months worth of all Google searches worldwide
133 facts for each of the 7 billion people on earth
156 facts about each device connected to the internet
Recently, the team here at Cambridge Semantics tested AnzoGraph — our native, massively parallel processing, distributed graph OLAP database — by running a load and query of one trillion triples in accordance with the Lehigh University Benchmark (LUBM). Simply put, a triple is a fact expressed as a subject, predicate, and object.
AnzoGraph completed the load and query of over one trillion triples in just under 2 hours — over 100 times faster than the previous LUBM benchmark of 220 hours set by Oracle.
A summary of the AnzoGraph LUBM benchmark test and results are as follows; you can also access the complete Trillion Triples Benchmarking white paper online.
The Lehigh University Benchmark (LUBM) provides a standardized and systematic way of evaluating Semantic Web Knowledge Base systems. LUBM has been utilized by software and platform vendors to test the scalability and efficiency of graph databases.
LUBM test data are synthetically generated instance data over an ontology provided by LUBM. The data are random and repeatable and can be scaled to an arbitrary size. It offers fourteen (14) test queries over the data, as well as a robust set of performance metrics.
The test data was generated in compressed Turtle format, a W3C standard format for data storage and interchange. The data was generated using the Parallel Data Generator.
Our AnzoGraph LUBM system under test consisted of a Google Compute Platform cluster of computers; specifically, 200 n1-highmem-32 machine type server instances. Each server provides 32 vCPU’s, which correspond to 32 Intel hyper threads on 16 hardware cores. The processors were Intel® Xeon® CPU E5-26XX @ 2.30GHz.
The storage configured was 100GB persistent SSD drives with each node to hold the generated data. The nodes each had 208GB of memory available to the Ubuntu Linux 14.04 operating system and AnzoGraph.
Loaded triples. The data was loaded from the SSD drives’ compressed Turtle files in a parallel manner and automatically distributed across the cluster’s memory, indexed and pruned of duplicates, leaving just the “asserted triples”. The data was loaded as 22 logical graphs, corresponding to a data lake of 22 major sources.
The SPARQL statement load:
<dir:/place/on/machine.ttl.gz> into <dst>
was executed for each of the 22 logical graphs.
587.3 billion triples were loaded in 00:29:24 (hh:mm:ss); an effective load of 332.8 million triples per second.
Inferred triples were then generated for each of the 22 logical graphs using the GQE SPARQL-extension statement:
create inferences from <src> to <dst>
The inference (reasoning) rules and behavior are specified by the Web Ontology Language (OWL) triples present in the datasets.
477.8 billion triples were created by inference in 1:16:14, for an effective load of 332.8 million triples per second. The combined loaded (asserted) and inferred triples totaled to over one trillion triples.
Finally, all fourteen LUBM queries took just under 14 minutes to run…
…for a total elapsed load, inference and query time of just under 2 hours; over 100x faster than the previous LUBM record.
This outcome not only establishes AnzoGraph as a hyperfast OLAP graph database but also validates that semantically described graph based OLAP can very effectively address mainstream informational challenges requiring new levels of data integration, exploration, discovery, and analytics — well beyond what is possible using relational database technologies.
As modern data diversity and volumes continue to grow, the cost and the inflexibility of the traditional relational, data warehouse approach is becoming increasingly impractical — and in many cases unsustainable — for the scale of analysis and decision support requests currently brought to IT from their businesses.
In contrast, graph databases — notably AnzoGraph — enable multiple data sources to be combined and queried in very flexible ways, even though the data wasn't originally collected with any particular integration or question in mind.
Anzograph combines the best of labeled property graphs and RDF SPARQL semantic query languages into a single hybrid graph database, purpose-built for analytics — making AnzoGraph ideally suited for data analysts, enterprise architects, and application developers.
If you’re looking to build and execute heavy-duty data warehouse analytics including aggregations, graph algorithms and inferencing at massive data scales that traditional relational database technologies, AnzoGraph will get you there…super fast.
Opinions expressed by DZone contributors are their own.