Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using GRAKN.AI to Reason Over an R Dataset

DZone's Guide to

Using GRAKN.AI to Reason Over an R Dataset

Get familiar with GRAKN.AI to see how it can be used to help reason over a simple dataset, showing just how easy and powerful such a tool is.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

In this article, I will introduce an open-source knowledge graph platform called GRAKN.AI. I’m going to use it to load a simple dataset and show how to calculate basic statistics such as maximum and mean values. A good question at this point would be: As a data scientist, surely there are easier ways for me to make such simple calculations? The answer: Yes, there are! But I’ve chosen this familiar example to introduce the knowledge graph paradigm, the strength of which comes into play for large amounts of highly interconnected data. To keep the example simple, I’ve removed accidental complexity by using a familiar dataset in a new way.

This article will be useful for data scientists interested in using a new approach to modeling complex and/or big datasets. You don’t need any experience with GRAKN.AI to understand it because I’ll explain the key concepts as I go along. So let’s get started!

What Is GRAKN.AI?

GRAKN.AI is an open-source distributed knowledge base with a reasoning query language called Graql (not to be confused with GraphQL) that enables you to query for explicitly stored data and implicitly derived information. It is built using graph computing (Apache TinkerPop), which allows you to traverse links to discover how remote parts of a domain relate to each other. Various graph-computing techniques and algorithms can be applied, such as shortest path computations or network analysis, which add additional intelligence over the stored data.

Some of the potential applications include semantic search, automated fraud detection, intelligent chatbots, advanced drug discovery, dynamic risk analysis, content-based recommendation engines, and knowledge management systems.

The Data

This example uses a dataset that will be familiar to students of R: mtcars (Motor Trend Car Road Tests) data. The data was extracted from the 1974 Motor Trend U.S. magazine and comprises fuel consumption and 10 other aspects of automobile design and performance for 32 automobiles (1973-74 models). I took a CSV file of the mtcars data and added two new columns to indicate the car maker’s name and region that the car was made in (Europe, Japan, or North America). This file, and everything else for this example can be found on GitHub and is also included in the examples folder of the GRAKN.AI distribution.

When working with GRAKN.AI, a key step is the definition of an ontology, which allows you to model the data. We’ve published a number of articles about the GRAKN.AI ontology, including a recent blog post, but to keep it simple, I’d suggest you think of it rather as a class definition in C++ or Java. An ontology specifies the relevant concepts and their meaningful associations. The ontology has four types of concepts to model the domain: entity, relation, role, and resource.

  • entity: Objects or things in the domain — for example, car, carmaker.

  • relation: Relationships between different domain instances — for example, manufactured, which is typically a relationship between two instances of entity types (car and carmaker), playing roles of made and maker, respectively.

  • role: Roles involved in specific relationships — for example, made, maker.

  • resource: Attributes associated with domain instances — for example, model. Resources consist of primitive types and values.

With GRAKN.AI’s declarative query language, Graql, I have represented the mtcars dataset using the following ontology, stored in the ontology.gql file, although many other variations are possible:

insert

# Entities

vehicle sub entity
 is-abstract;

car sub vehicle
 is-abstract

has model
 has mpg
 has cyl
 has disp
 has hp
 has wt
 has gear
 has carb
 plays made;

automatic-car sub car;
manual-car sub car;

carmaker sub entity
 is-abstract
 has maker-name
 plays maker;

japanese-maker sub carmaker;
american-maker sub carmaker;
european-maker sub carmaker;

# Resources

model sub resource datatype string;
maker-name sub resource datatype string;
mpg sub resource datatype double;
cyl sub resource datatype long;
disp sub resource datatype double;
hp sub resource datatype long;
wt sub resource datatype double;
gear sub resource datatype long;
carb sub resource datatype long;

# Roles and Relations

manufactured sub relation
 relates maker
 relates made;

maker sub role;
made sub role;

There are two main entities: vehicle and carmaker, but I’ve used inheritance (that’s the sub keyword) to set up a hierarchy. A manual-car (or automatic-car) is a subtype of car, which is a subtype of vehicle. Likewise, a japanese-maker is a subtype of carmaker, as is the american-makerand european-maker entity. The entities have some resources, such as numerical values to represent fuel consumption, horsepower, etc. for the cars, and string values to represent the name for carmaker. There is a single relation (manufactured) within the data, between the car and carmaker entities, where the car plays the made role and the carmaker plays the maker role.

The first step in running the example is to load this ontology into a graph. Having installed GRAKN.AI, you start the engine and load ontology.gql by typing the following into a terminal window:

<relative-path-to-Grakn>/bin/grakn.sh start
<relative-path-to-Grakn>/bin/graql.sh -f ./ontology.gql

Now to load the mtcars data into the graph, which I have munged into a single data file (data.gql) for easy loading. However, GRAKN.AI does allow you to import CSV (as well as TSV, SQL, JSON, and OWL data), so it is perfectly possible to pull it in directly from the CSV file. The readme file in the GitHub repository gives further information.

To load the mtcars data:

<relative-path-to-Grakn>/bin/graql.sh -b ./data.gql

Now you can take a look at the dataset by spinning up the Grakn visualizer by pointing your browser to http://localhost:4567/. You can submit queries to check the data or explore it using the Types dropdown menu.

Blue — European manufacturers; Red — Japanese manufacturers; Purple — American manufacturers; Green — automatic cars; Yellow — manual cars.

Some sample queries: don’t type the lines starting with # (these are just comments):

# Cars where the model name contains “Merc” (7 cars)
match $x has model contains “Merc”;

7 cars with model names containing ‘Merc’:

# Cars with more than 4 gears (should be 5 cars)
match $x has gear > 4;

# Japanese-made cars that are manual (should be 5 cars)
match $x isa manual-car; $y isa japanese-maker; (made: $x, maker:$y);

# European cars that are automatic (should all be Mercedes)
match $x isa automatic-car; $y isa european-maker; (made: $x, maker:$y);
# Japanese-made cars that are manual (should be 5 cars)
match $x isa manual-car; $y isa japanese-maker; (made: $x, maker:$y);
# European cars that are automatic (should all be Mercedes)
match $x isa automatic-car; $y isa european-maker; (made: $x, maker:$y);

Aggregate

The Graql aggregate keyword is the workhorse for statistics. Switch views using the left-hand navigation pane from Graph to Console to submit some queries. Here are some example aggregate queries to try:

# Count of all cars (32)
match $x isa car; aggregate count;

# Count American car makers (6)
match $x isa american-maker; aggregate count;

# Maximum MPG for an automatic car (24.4)
match $x isa automatic-car, has mpg $a; aggregate max $a;

# Minimum HP for all cars (52)
match $x isa car, has hp $hp; aggregate min $hp;

# Mean MPG for manual and automatic cars (24.39, 17.15)
match $x isa manual-car has mpg $mpg; aggregate mean $mpg;
match $x isa automatic-car has mpg $mpg; aggregate mean $mpg;

# Median number of cylinders (all Mercedes cars) (6)
match $x has model contains “Merc”, has cyl $c; aggregate median $c;

# Maximum number of carburetors (all Chrysler cars) (4) 
match $x has model contains “Chry”, has carb $c; aggregate median $c;

# Minimum number of gears (all cars) (3)
match $x isa car, has gear $g; aggregate min $g;

Compute

Graql also provides compute queries that can be used to determine values such as mean, minimum, and maximum. These can be submitted using the Graph view on the Visualizer. For example, type each of the following into the form and submit:

# Number of automatic (19) and manual cars (13)
compute count in automatic-car; 
compute count in manual-car;

# Number of Japanese car makers (4)
compute count in japanese-maker;

# Median number of cylinders (all cars) (6)
compute median of cyl;

# Minimum number of gears (all cars) (3)
compute min of gear;

# Maximum number of carburetors (all cars) (8)
compute max of carb;

# Mean MPG for an automatic car (17.15)
compute mean of mpg in automatic-car;

# Mean MPG for a manual car (24.39)
compute mean of mpg in manual-car;

When to Use Aggregate and When to Use Compute

Graql’s aggregate queries are computationally light and run single-threaded on a single machine. They are also more flexible than the equivalent compute query (for example, you can use an aggregate query to filter results by resource).

match $x isa car has model contains “Merc”; aggregate count; # 7

There are times when compute queries are more powerful. They are computationally intensive and can run in parallel on a cluster, so are good for big data and can be used to calculate results very fast. However, you can’t filter the results by resource in the same way as you can for an aggregate query.

You can perform much more with compute than I have illustrated in this example — for example, you can calculate the shortest path between two nodes in the graph, and look at clusters within the data. However, mtcars isn’t a great example for those features, since there aren’t many connections within such a simple dataset. The bonus of using a knowledge graph is that it has a flexible structure: the ontology can be extended and revised as new data is added. So if we found additional data, for example, about dealers offering these cars for sale, links to fan websites, photos, or reviews, we could add those in and make compute queries to uncover new information.

Reasoning Using Graql

Speaking of new information, it’s time to talk about inference, which can be used to find implicit information from the data. For example, given the following statements:

(If) grass is not an animal.
(If) vegetarians only eat things which are not animals.
(If) sheep only eat grass.

It is possible to infer the following:

(Then) sheep are vegetarians.

The initial statements can be seen as a set of premises. If all the premises are met we can infer a new fact (that sheep are vegetarians). If we hypothesize that sheep are vegetarians, then the whole example can be expressed with a particular two-block structure: If some premises are met, then a given hypothesis is true.

This is how reasoning in Graql works. It checks whether a set of Graql statements can be verified and, if they can, makes an inference from the second block of statements. The first set of statements (the IF part or, if you prefer, the antecedent) is called the left-hand side (LHS). The second part (also known as the consequent), not surprisingly, is the right-hand side (RHS). Using Graql, both sides of the rule are enclosed in curly braces and preceded by, respectively, the keywords lhs and rhs.

At the bottom of the ontology.gql file, you’ll see the Graql for reasoning over the dataset. The car entity has two extra resources (strings to represent whether they are powerful and economical, which are set either to true or false by Grakn’s reasoner). There are four rules to test whether a car is economical (by checking if its MPG is greater than or equal to 19.0) and whether it is powerful (when it has horsepower equal or above 147).

# Reasoning
car
has economical
has powerful;
economical sub resource datatype string;
powerful sub resource datatype string;
$car-economy-true isa inference-rule
lhs
{$c isa car has mpg >= 19.0;}
rhs
{$c has economical “TRUE”;};
$car-economy-false isa inference-rule
lhs
{$c isa car has mpg < 19.0;}
rhs
{$c has economical “FALSE”;};
$car-powerful-true isa inference-rule
lhs
{$c isa car has hp >= 147;}
rhs
{$c has powerful “TRUE”;};
$car-powerful-false isa inference-rule
lhs
{$c isa car has hp < 147;}
rhs
{$c has powerful “FALSE”;};

By default, the inference is switched off, and the only information you can query Grakn about is what was directed loaded from the data. Open the Graql shell by typing the following in the terminal window:

../../bin/graql.sh -n

The -n flag turns on inference, so you can make the following query in the shell:

>>>match $x has model $s, has powerful “TRUE” has economical “TRUE”;
$x id “106584” isa manual-car; $y val “Ferrari Dino” isa model;
$x id “254120” isa automatic-car; $y val “Pontiac Firebird” isa model;

I found the results returned quite surprising. If I want to buy an economical yet powerful car, it seems I will need to save up for a Ferrari Dino or Pontiac Firebird! Or maybe I should add some extra data to the graph that takes purchase price into account, and uses reasoning to find the sweet spot between cheap to buy, cost to run, and powerful cars.

This is a rather trivial illustration of reasoning, again, somewhat hampered by the simplistic nature of the example. There is a complete example in our documentation, that examines implicit family relationships.

Where Next?

If you haven’t already, I recommend that you review the documentation about aggregate queries and compute queries, since there is more to compute than just statistical analysis. There is also an example of using Graql analytics on the genealogy dataset available here.

This example was based on CSV data migrated into Grakn. Having read it, you may want to further study our documentation about CSV migration and Graql templating.

If you want to read another guide to getting started with GRAKN.AI, we have one on our website to get you up and running with Graql, and a tutorial to get started with Java. We do also have bindings for R, Python, and Haskell, although these are currently incomplete.

The code for GRAKN.AI is available on GitHub, and there is a thriving developer community that offers support over Slack and discussion forums. Check out GRAKN.AI for more information.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
grakn.ai ,r ,dataset ,big data ,graph ,tutorial

Published at DZone with permission of Jo Stichbury, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}