Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Microsoft Concept Graph in Neo4j

DZone's Guide to

The Microsoft Concept Graph in Neo4j

Dive into importing the Microsoft Concept Graph into Neo4j and explore what that graph can tell us, particularly in the realm of context and specificity.

· Database Zone
Free Resource

Learn how to move from MongoDB to Couchbase Server for consistent high performance in distributed environments at any scale.

What does the study of concepts (or categories, depending on your field of study) tell us about the human mind?

A result of the Probase research project, the Microsoft Concept Graph harnesses billions of web pages and search logs to build a huge graph of relations between words (like “apple”) and their concepts (like “fruit” or “hardware company”). Using this data, the team at Microsoft hopes to build better search engines, spell-checkers, recommendation engine, taxonomies and more.

This blog post will walk through how we can harness Neo4j to delve into the Single Instance Conceptualization dataset proposed by the first release of the Microsoft Concept Graph in late 2016. Specifically, it will walk through importing the data into Neo4j using neo4j-import and using Cypher to determine when an “apple” means a dessert instead of a particular company. I encourage you to read more in the excellent papers by the Microsoft team here and here.

Concepts embody our knowledge of the kinds of things there are in the world. Tying our past experiences to our present interactions with the environment, they enable us to recognize and understand new objects and events.
– Gregory Murphy, The Big Book of Concepts

Learn how to explore the Microsoft Concept Graph using Neo4j and the Cypher graph query languageIn the dataset, Instances are English noun phrases (NPs) and Concepts are the mental bucket or category the NP may belong to. For example, instances of the concept snake includes the words “boa,” “python,” and “viper,” which are also instances of the concepts of artist (p=0.128), language (p=.557), and car (p=.107), respectively.

Download and Import: The V1 Release

Download link for the Microsoft Concept Graph.

This first release, called Single Instance Conceptualization, provides the core Is_A data mined from billions of web pages. It contains 5,376,526 unique concepts, 12,501,527 unique instances, and 85,101,174 Is_A relations.

The data is in a single tab-separated file, 330MB zipped and 1.2GB uncompressed, which we can import with neo4j-import (so make sure you’re using the .tar version of Neo4j).

The data in the file is organized according to Concept, Instance and Probability, like so:

state california 18062
supplement msm glucosamine sulfate 15942

Important: Note that the probability is out of 10^4.

This is a relatively simple graph can be represented like so:

  • Concept
    {“name”
    “fruit”}
  • Instance
    {“name”
    “apple”}
  • Concept
    {“name”
    “company”}
  • IS_A
    {“probability”
    6315}
  • IS_A
    {“probability”
    4353}
# a quick peek at the data
head -n 10 data-concept-instance-relations.txt

factorage35167
free rich company datumsize33222
free rich company datumrevenue33185
statecalifornia18062
supplementmsm glucosamine sulfate15942
factorgender14230
factortemperature13660
metalcopper11142
issuestress pain depression sickness11110
variableage9375

# extract concepts (this can take a few seconds)
$ echo "name:ID(Concept)" > concepts.txt
$ cat data-concept-instance-relations.txt | cut -d $'\t' -f 1 | sort | uniq >> concepts.txt

# extract instances (this can take a few seconds)
echo "name:ID(Instance)" > instances.txt
cat data-concept-instance-relations.txt | cut -d $'\t' -f 2 | sort | uniq >> instances.txt

# create the header row for the relationships import
echo $':END_ID(Concept)\t:START_ID(Instance)\tprobability' > is_a.hdr

# import into Neo4j
$NEO4J_HOME/bin/neo4j-import --into concepts.db --id-type string 
--delimiter TAB --bad-tolerance 100000 --skip-duplicate-nodes true 
--skip-bad-relationships true --nodes:Concept concepts.txt 
--nodes:Instance instances.txt 
--relationships:IS_A is_a.hdr,data-concept-instance-relations.txt

...

IMPORT DONE in 1m 27s 888ms.
Imported:
  17878053 nodes
  33377320 relationships
  51255373 properties
Peak memory usage: 410.36 MB

# Add two Constraints/Indexes
echo $'
CREATE CONSTRAINT ON (i:Instance) ASSERT i.name IS  UNIQUE;\n
CREATE CONSTRAINT ON (c:Concept)  ASSERT c.name IS  UNIQUE;' 
                     | $NEO4J_HOME/bin/neo4j-shell -path concepts.db


Now that you’ve created the concepts.db graph, you can move it to $NEO4J_HOME/data/databases and update $NEO4J_HOME/conf/neo4j.conf to mount concepts.db:

# The name of the database to mount
dbms.active_database=concepts.db


You should now be able to start the Neo4j Browser and see the Concept Graph.

Let’s Explore the Concept Graph

How is the word “apple” represented in the concept space?

MATCH (i:Instance {name:"apple"})-[r:IS_A]->(c:Concept)
RETURN i.name AS Instance, tofloat(r.probability)/10000 
              AS `is a(n)`, c.name AS Concept
ORDER BY r.probability DESC
LIMIT 10;

https://www.dropbox.com/s/t971i04ej991xpk/apple_graph.svg?dl=0

+-------------------------------------------+
| Instance |is a(n)        | Concept        |
+-------------------------------------------+
| "apple"  | 0.6315        | "fruit"        |
| "apple"  | 0.4353        | "company"      |
| "apple"  | 0.1152        | "food"         |
| "apple"  | 0.764         | "brand"        |
| "apple"  | 0.750         | "fresh fruit"  |
| "apple"  | 0.568         | "fruit tree"   |
| "apple"  | 0.483         | "crop"         |
| "apple"  | 0.280         | "corporation"  |
| "apple"  | 0.279         | "manufacturer" |
| "apple"  | 0.257         | "firm"         |
+-------------------------------------------+


How is the word “pie” represented in the concept space?

MATCH (i:Instance {name:"pie"})-[r:IS_A]->(c:Concept)
RETURN i.name AS Instance, tofloat(r.probability)/10000 
              AS `is a(n)`, c.name AS Concept
ORDER BY r.probability DESC
LIMIT 10;

https://www.dropbox.com/s/7vmnsk1avl9j1b4/pie_graph.svg?dl=0

+------------------------------------+
| Instance | is a(n) | Concept       |
+------------------------------------+
| "pie"    | 0.0256  | "food"        |
| "pie"    | 0.0245  | "dessert"     |
| "pie"    | 0.0208  | "baked goods" |
| "pie"    | 0.018   | "bakery item" |
| "pie"    | 0.0105  | "baked good"  |
| "pie"    | 0.0097  | "item"        |
| "pie"    | 0.0087  | "product"     |
| "pie"    | 0.0054  | "food item"   |
| "pie"    | 0.0041  | "sweet"       |
| "pie"    | 0.0041  | "dish"        |
+------------------------------------+
10 rows
9321 ms


Adding some context: What Concepts represent both an apple and a pie?

We want to be very sure we’re talking about apple in the sense of the food, not Apple in the sense of the company.

MATCH (a:Instance {name:"apple"})-[r1:IS_A]->(c:Concept)<-[r2:IS_A]-(b:Instance {name:"pie"})
USING INDEX a:Instance(name)
USING INDEX b:Instance(name)
RETURN c.name AS Concept, tofloat(r1.probability)*tofloat(r2.probability)*10^-8 AS prob
ORDER BY prob DESC
LIMIT 10;

https://www.dropbox.com/s/s27h2uh16k4mwyw/apple_pie.svg?dl=0

+-------------------------------------+
| Concept     | prob                  |
+-------------------------------------+
| "food"      | 0.00294912            |
| "item"      | 2.4056000000000001E-4 |
| "product"   | 1.5747E-4             |
| "fruit"     | 6.315E-5              |
| "snack"     | 3.959E-5              |
| "food item" | 3.51E-5               |
| "dessert"   | 3.43E-5               |
| "name"      | 7.92E-6               |
| "dish"      | 4.92E-6               |
| "case"      | 3.92E-6               |
+-------------------------------------+

10 rows
15 ms


Adding some context: What instances are similar to both apples and pies?

We can even go further and check the instances of those concepts and aggregate by them, instead just the relations stored on the IS_A relationship, allowing us to deduce that things that are both apples and pies are bread-like fruit-based cakes.

MATCH (a:Instance {name:"apple"})-[:IS_A]->(c:Concept)<-[:IS_A]-(b:Instance {name:"pie"})
USING INDEX a:Instance(name)
USING INDEX b:Instance(name)
MATCH (c)<-[:IS_A]-(o:Instance) WHERE o <> a and o <> b
WITH o, count(*) AS freq
ORDER BY freq DESC LIMIT 10
RETURN o.name AS Instance, freq;

+--------------------+
| Instance    | freq |
+--------------------+
| "bread"     | 115  |
| "fruit"     | 113  |
| "cake"      | 110  |
| "cookie"    | 109  |
| "chocolate" | 102  |
| "cheese"    | 99   |
| "vegetable" | 93   |
| "egg"       | 93   |
| "banana"    | 91   |
| "fish"      | 91   |
+--------------------+
10 rows
4900 ms


Conclusion

Although the Microsoft Concept Graph is a currently a bit more sparse than other concept graphs online, the research that created it is a valuable addition to the study of taxonomy and language.

References

  • Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, in ACM International Conference on Information and Knowledge Management (CIKM), ACM – Association for Computing Machinery, October 2015.
  • Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Zhu, Probase: A Probabilistic Taxonomy for Text Understanding, in ACM International Conference on Management of Data (SIGMOD), May 2012.

Want to deliver a whole new level of customer experience? Learn how to make your move from MongoDB to Couchbase Server.

Topics:
neo4j ,graph database ,microsoft concept graph ,tutorial

Published at DZone with permission of Cristina Escalante, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}