Over a million developers have joined DZone.

The Microsoft Concept Graph in Neo4j

DZone's Guide to

The Microsoft Concept Graph in Neo4j

Dive into importing the Microsoft Concept Graph into Neo4j and explore what that graph can tell us, particularly in the realm of context and specificity.

· Database Zone
Free Resource

Whether you work in SQL Server Management Studio or Visual Studio, Redgate tools integrate with your existing infrastructure, enabling you to align DevOps for your applications with DevOps for your SQL Server databases. Discover true Database DevOps, brought to you in partnership with Redgate.

What does the study of concepts (or categories, depending on your field of study) tell us about the human mind?

A result of the Probase research project, the Microsoft Concept Graph harnesses billions of web pages and search logs to build a huge graph of relations between words (like “apple”) and their concepts (like “fruit” or “hardware company”). Using this data, the team at Microsoft hopes to build better search engines, spell-checkers, recommendation engine, taxonomies and more.

This blog post will walk through how we can harness Neo4j to delve into the Single Instance Conceptualization dataset proposed by the first release of the Microsoft Concept Graph in late 2016. Specifically, it will walk through importing the data into Neo4j using neo4j-import and using Cypher to determine when an “apple” means a dessert instead of a particular company. I encourage you to read more in the excellent papers by the Microsoft team here and here.

Concepts embody our knowledge of the kinds of things there are in the world. Tying our past experiences to our present interactions with the environment, they enable us to recognize and understand new objects and events.
– Gregory Murphy, The Big Book of Concepts

Learn how to explore the Microsoft Concept Graph using Neo4j and the Cypher graph query languageIn the dataset, Instances are English noun phrases (NPs) and Concepts are the mental bucket or category the NP may belong to. For example, instances of the concept snake includes the words “boa,” “python,” and “viper,” which are also instances of the concepts of artist (p=0.128), language (p=.557), and car (p=.107), respectively.

Download and Import: The V1 Release

Download link for the Microsoft Concept Graph.

This first release, called Single Instance Conceptualization, provides the core Is_A data mined from billions of web pages. It contains 5,376,526 unique concepts, 12,501,527 unique instances, and 85,101,174 Is_A relations.

The data is in a single tab-separated file, 330MB zipped and 1.2GB uncompressed, which we can import with neo4j-import (so make sure you’re using the .tar version of Neo4j).

The data in the file is organized according to Concept, Instance and Probability, like so:

state california 18062
supplement msm glucosamine sulfate 15942

Important: Note that the probability is out of 10^4.

This is a relatively simple graph can be represented like so:

  • Concept
  • Instance
  • Concept
  • IS_A
  • IS_A
# a quick peek at the data
head -n 10 data-concept-instance-relations.txt

free rich company datumsize33222
free rich company datumrevenue33185
supplementmsm glucosamine sulfate15942
issuestress pain depression sickness11110

# extract concepts (this can take a few seconds)
$ echo "name:ID(Concept)" > concepts.txt
$ cat data-concept-instance-relations.txt | cut -d $'\t' -f 1 | sort | uniq >> concepts.txt

# extract instances (this can take a few seconds)
echo "name:ID(Instance)" > instances.txt
cat data-concept-instance-relations.txt | cut -d $'\t' -f 2 | sort | uniq >> instances.txt

# create the header row for the relationships import
echo $':END_ID(Concept)\t:START_ID(Instance)\tprobability' > is_a.hdr

# import into Neo4j
$NEO4J_HOME/bin/neo4j-import --into concepts.db --id-type string 
--delimiter TAB --bad-tolerance 100000 --skip-duplicate-nodes true 
--skip-bad-relationships true --nodes:Concept concepts.txt 
--nodes:Instance instances.txt 
--relationships:IS_A is_a.hdr,data-concept-instance-relations.txt


IMPORT DONE in 1m 27s 888ms.
  17878053 nodes
  33377320 relationships
  51255373 properties
Peak memory usage: 410.36 MB

# Add two Constraints/Indexes
echo $'
                     | $NEO4J_HOME/bin/neo4j-shell -path concepts.db

Now that you’ve created the concepts.db graph, you can move it to $NEO4J_HOME/data/databases and update $NEO4J_HOME/conf/neo4j.conf to mount concepts.db:

# The name of the database to mount

You should now be able to start the Neo4j Browser and see the Concept Graph.

Let’s Explore the Concept Graph

How is the word “apple” represented in the concept space?

MATCH (i:Instance {name:"apple"})-[r:IS_A]->(c:Concept)
RETURN i.name AS Instance, tofloat(r.probability)/10000 
              AS `is a(n)`, c.name AS Concept
ORDER BY r.probability DESC


| Instance |is a(n)        | Concept        |
| "apple"  | 0.6315        | "fruit"        |
| "apple"  | 0.4353        | "company"      |
| "apple"  | 0.1152        | "food"         |
| "apple"  | 0.764         | "brand"        |
| "apple"  | 0.750         | "fresh fruit"  |
| "apple"  | 0.568         | "fruit tree"   |
| "apple"  | 0.483         | "crop"         |
| "apple"  | 0.280         | "corporation"  |
| "apple"  | 0.279         | "manufacturer" |
| "apple"  | 0.257         | "firm"         |

How is the word “pie” represented in the concept space?

MATCH (i:Instance {name:"pie"})-[r:IS_A]->(c:Concept)
RETURN i.name AS Instance, tofloat(r.probability)/10000 
              AS `is a(n)`, c.name AS Concept
ORDER BY r.probability DESC


| Instance | is a(n) | Concept       |
| "pie"    | 0.0256  | "food"        |
| "pie"    | 0.0245  | "dessert"     |
| "pie"    | 0.0208  | "baked goods" |
| "pie"    | 0.018   | "bakery item" |
| "pie"    | 0.0105  | "baked good"  |
| "pie"    | 0.0097  | "item"        |
| "pie"    | 0.0087  | "product"     |
| "pie"    | 0.0054  | "food item"   |
| "pie"    | 0.0041  | "sweet"       |
| "pie"    | 0.0041  | "dish"        |
10 rows
9321 ms

Adding some context: What Concepts represent both an apple and a pie?

We want to be very sure we’re talking about apple in the sense of the food, not Apple in the sense of the company.

MATCH (a:Instance {name:"apple"})-[r1:IS_A]->(c:Concept)<-[r2:IS_A]-(b:Instance {name:"pie"})
USING INDEX a:Instance(name)
USING INDEX b:Instance(name)
RETURN c.name AS Concept, tofloat(r1.probability)*tofloat(r2.probability)*10^-8 AS prob


| Concept     | prob                  |
| "food"      | 0.00294912            |
| "item"      | 2.4056000000000001E-4 |
| "product"   | 1.5747E-4             |
| "fruit"     | 6.315E-5              |
| "snack"     | 3.959E-5              |
| "food item" | 3.51E-5               |
| "dessert"   | 3.43E-5               |
| "name"      | 7.92E-6               |
| "dish"      | 4.92E-6               |
| "case"      | 3.92E-6               |

10 rows
15 ms

Adding some context: What instances are similar to both apples and pies?

We can even go further and check the instances of those concepts and aggregate by them, instead just the relations stored on the IS_A relationship, allowing us to deduce that things that are both apples and pies are bread-like fruit-based cakes.

MATCH (a:Instance {name:"apple"})-[:IS_A]->(c:Concept)<-[:IS_A]-(b:Instance {name:"pie"})
USING INDEX a:Instance(name)
USING INDEX b:Instance(name)
MATCH (c)<-[:IS_A]-(o:Instance) WHERE o <> a and o <> b
WITH o, count(*) AS freq
RETURN o.name AS Instance, freq;

| Instance    | freq |
| "bread"     | 115  |
| "fruit"     | 113  |
| "cake"      | 110  |
| "cookie"    | 109  |
| "chocolate" | 102  |
| "cheese"    | 99   |
| "vegetable" | 93   |
| "egg"       | 93   |
| "banana"    | 91   |
| "fish"      | 91   |
10 rows
4900 ms


Although the Microsoft Concept Graph is a currently a bit more sparse than other concept graphs online, the research that created it is a valuable addition to the study of taxonomy and language.


  • Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, in ACM International Conference on Information and Knowledge Management (CIKM), ACM – Association for Computing Machinery, October 2015.
  • Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Zhu, Probase: A Probabilistic Taxonomy for Text Understanding, in ACM International Conference on Management of Data (SIGMOD), May 2012.

It’s easier than you think to extend DevOps practices to SQL Server with Redgate tools. Discover how to introduce true Database DevOps, brought to you in partnership with Redgate

neo4j ,graph database ,microsoft concept graph ,tutorial

Published at DZone with permission of Cristina Escalante, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}