Over a million developers have joined DZone.

Exploring a Graph Representation of Wikipedia Using Gremlin

· Database Zone

Learn NoSQL for free with hands-on sample code, example queries, tutorials, and more.  Brought to you in partnership with Couchbase.

A graph is a data structure composed of vertices and edges. These terms are synonymous with the following:

  • vertices, dots, nodes, things, points
  • edges, lines, links, relations, arcs

From vertices and edges, numerous artificial and real-world systems can be modeled and processed. The purpose of this post is to explore a graph representation of Wikipedia persisted in the graph database Neo4j and processed using the graph traversal language Gremlin. Please note that a parse-friendly representation of Wikipedia was provided by DBPedia and the present analysis and presentation is provided by a collaborative effort between Aurelius and the Intelligent Systems Laboratory at PARC (a Xerox company).

On the Graph Nature of Wikipedia

There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.

Using Gremlin, some descriptive graph statistics of this representation are calculated. Along with vertex and edge counts, the in- and out-degree distributions of the graph are provided below. These distribution calculations are agnostic to the edge labels.

gremlin> g = new Neo4jGraph('/data/dbpedia')
gremlin> g.V.count()
gremlin> g.E.count()
gremlin> outDegrees = [:]; inDegrees = [:]
gremlin> g.V.transform{it.out.count()}.groupCount(outDegrees).iterate()
gremlin> g.V.transform{it.in.count()}.groupCount(inDegrees).iterate()
gremlin> f = new File('out-degrees.txt')
gremlin> outDegrees.each{k,v -> f.append(k + '\t' + v + '\n')}
gremlin> f = new File('in-degrees.txt')
gremlin> inDegrees.each{k,v -> f.append(k + '\t' + v + '\n')}

The files out-degrees.txt and in-degrees.txt generated by Gremlin are loaded into R Statistics and plotted.

R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing

 out.degrees <- read.table('out-degrees.txt', sep='\t')
 in.degrees <- read.table('in-degrees.txt', sep='\t')
 plot(out.degrees, log='xy', main='Wikipedia Out-Degree Distribution', xlab='Out Degree', ylab='Frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)
 plot(in.degrees, log='xy', main='Wikipedia In-Degree Distribution', xlab='In Degree', ylab='Frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)


On the Referential Nature of Ideas

One lazy Sunday afternoon, Gremlin walks into his library. The smell of leather and mahogany overcomes his senses. His eyes scan over the shelves filled with books about every imaginable topic. On one inspired day many moons ago, Gremlin printed out each Wikipedia article, bound it, and stored it. On this particular occasion, Gremlin decides to review one of his favorite subjects: graph theory. Off the shelf, Gremlin grabs the book respectfully entitled “Graph Theory.”

gremlin> v = g.idx(T.v)[[uri:'http://dbpedia.org/resource/Graph_theory']].next()
gremlin> v.map()
name=graph theory

Gremlin cracks open the cover of Graph Theory. To his amazement, he notices that the words in the book reference other books on his shelf! [line 1] He wonders, how many such references exist? [line 7]

gremlin> v.out('href').name[0..4]
computer science
graph (mathematics)
vertex (graph theory)
graph of a function
gremlin> v.out('href').count()

One reference at a time, Gremlin pulls the respective book from its shelf and looks at its pages. Unbelievably, these books also make reference to other books! [line 1] How many references do the references of Graph Theory have? [line 8] Dumbfounded by the circuitous nature of ideas being defined in terms of other ideas, Gremlin comes to realize the inherent structure of the graph of knowledge — its self-referential nature [line 9].

gremlin> v.out('href').out('href').name[0..4]
mathematics (disambiguation)
math (disambiguation)
greek language
gremlin> v.out('href').out('href').count()   
gremlin> v.out('href').dedup.out('href').out('href').retain([v]).paths{it.name}[0..4]
[graph theory, mathematics, logic, graph theory]
[graph theory, mathematics, history of mathematics, graph theory]
[graph theory, mathematics, operations research, graph theory]
[graph theory, mathematics, paul erdős, graph theory]
[graph theory, mathematics, theoretical computer science, graph theory]

There is only so much time a little green Gremlin gets to live. He wonders what he would be expert in if he continues to read book after book as dictated by the references that they maintain to one another.

gremlin> v.in('href').name[0..4]
semiotics of the structure
sones graphdb
trapezoid graph
alpha centrality
block graph

On the Recurrent Nature of Thought

For hours upon hours and days upon days, Gremlin goes from book to book. Sometimes coming back to reading a previous book over again. From the vantage point of a long exposure film, his form looks like a blur amongst his collection. However, this blur is not evenly distributed across his library. In fact, he is more likely reading some books over others. Those books that are most central in the hyperlink structure are those most likely to be encountered on Gremlin’s random walk. In his ledger, Gremlin decides to record the potential that he will be reading any one book at any one time. Initially all pages are equally likely. [line 1] He begins his journey at Graph Theory. [line 2] He looks up how much “potential” exists for Graph Theory. [line 3] He then retrieves all the outgoing neighbors of Graph Theory. [line 4] He determines the total number of neighbors. [line 5] To each neighbor, he distributes Graph Theory’s potential equally amongst them. [lines 6--8] Gremlin then jumps to Graph Theory’s referred to books and continues the process over and over and over again for 100,000 steps. Starting from the Graph Theory vertex, Gremlin diffuses himself like a wave over the structure that surrounds.

gremlin> m = [:].withDefault{1}
gremlin> v.transform{
  rank = m[it.name]; 
  neighbors = it.out('href').toList();
  degree = neighbors.size();
  neighbors.each {
    m[it.name] = m[it.name] + (rank/degree);

Gremlin looks over the top 10 ranked entries in his ledger. This is where his diffusion shows the most recurrence — it articulates which book he is most likely to be reading at any one moment in the next 100,000 book reads.

gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
graph theory=6.3579261869065686025097269665928272601625649213
graph (mathematics)=3.8056391897097907459393869524402476473212892533
computer science=3.2189950001323052861187247709061777261390151631
real number=2.942860316400352223267769366106131293745859000
number theory=2.7296994022747017139496107107489077766999403538
set theory=2.5682887400468326671894663603540368070624776875

On the Taxonomical Nature of Categories

 Gremlin’s library is not just shelves with books. His library also has a card catalog in it. The catalog indexes books by category. Moreover, each book has a reference to its subject category. For instance, the subject category of Graph Theory is, well…’graph theory.’ The catalog is also structured such that the categories make reference to one another. For instance, ‘combinatorics’ is a broader category than ‘graph theory.’ For the fun of it, Gremlin decides to move between the books and the categories.

gremlin> v.out('subject').name
graph theory
gremlin> v.out('subject').out('broader').name

In order to see the layers in a breadth-first manner, Gremlin exposes the category “rings” emanating from ‘graph theory.’

gremlin> x = [];
gremlin> v.out('subject').out('broader').gather{t = it.unique(); x.add(t*.name); t}.scatter.loop(3){it.loops < 7}.iterate()
gremlin> x
[combinatorics, mathematical relations]
[subdivisions of mathematics, discrete mathematics, mathematics, predicate logic]
[mathematics, subfields by academic discipline, subdivisions of mathematics, main topic classifications, abstraction, formal sciences, structure, scientific disciplines, systems of formal logic]
[main topic classifications, abstraction, formal sciences, structure, scientific disciplines, categories by topic, academic disciplines, mathematics, subfields by academic discipline, articles, thought, innovation, problem solving, creativity, analogy, concepts, logic, interdisciplinary fields, formalism (deductive), matter, dimension, form, science, mathematical logic, formal systems]
[articles, thought, structure, innovation, problem solving, creativity, analogy, concepts, academic disciplines, logic, interdisciplinary fields, scientific disciplines, formalism (deductive), matter, dimension, form, science, subfields by academic discipline, categories by parameter, academia, categories by topic, main topic classifications, abstraction, formal sciences, contents, mind, cognition, mental processes, mental content, technological change, intelligence, critical thinking, human skills, human behavior, action, quality, fundamental categories, branches of philosophy, axiology, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, subdivisions of mathematics, discrete mathematics, mathematical logic, metalogic, formal theories, logical syntax]
[contents, mind, cognition, mental processes, mental content, matter, concepts, dimension, form, technological change, intelligence, creativity, innovation, critical thinking, human skills, human behavior, action, quality, thought, fundamental categories, academia, categories by topic, abstraction, branches of philosophy, interdisciplinary fields, axiology, science, academic disciplines, subfields by academic discipline, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, main topic classifications, categories, education, categories by parameter, articles, structure, problem solving, analogy, logic, scientific disciplines, formalism (deductive), cognitive science, concepts in metaphysics, philosophy of mind, consciousness, epistemology, psychology, technology, economic development, technological change, and growth, neurology, neuroscience, memory, learning, educational psychology, philosophical logic, evaluation, life skills, skills, humans, behavior, phenomena, determinism, free will, motion, management, philosophy by field, value, deduction, scientific theories, philosophical theories, philosophy of mathematics, physics, concepts by field, physical cosmology, astrophysics, astronomical dynamical systems, reality, everything, life, metaphysics, philosophy of science, topological spaces, differential geometry, geometric topology, measurement, information, perception, concepts in epistemology, mathematics, subdivisions of mathematics, discrete mathematics, metaphilosophy, metatheory, theories, formal languages, formal sciences, syntax]

These rings are useful for seeing what is touched at each step of the walk. However, in order to see the inverted tree structure, Gremlin nests his traversal so that the categories are grouped under their children. Gremlin records the subsumption hierarchy as categories broaden one another.

gremlin> v.out('subject').out('broader').groupBy{it.name}{
mathematical relations=[{mathematics=[main topic classifications, abstraction, formal sciences, structure, scientific disciplines], predicate logic=[systems of formal logic]}]
combinatorics=[{discrete mathematics=[subdivisions of mathematics], subdivisions of mathematics=[mathematics, subfields by academic discipline]}]

Gremlin’s library is an adventure through the fascinating world of interconnected ideas.


Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, S., Cyganiak, R., Hellmann, S., “DBpedia – A Crystallization Point for the Web of Data,” Journal of Web Semantics, 7(3), 2009.

Rodriguez, M.A., Pepe, A., Shinavier, J., “The Dilated Triple,” Emergent Web Intelligence: Advanced Semantic Technologies, Advanced Information and Knowledge Processing series, 3–16, 2010.

Newman, M.E.J., “The Structure and Function of Complex Networks“, SIAM Review, 45, 167–256, 2003.

Bollen, J., Van de Sompel, H., Hagberg, A., Bettencourt, L.M.A, Chute, R., Rodriguez, M.A., Balakireva, L.L., “Clickstream Data Yields High-Resolution Maps of Science,” PLoS One, Public Library of Science, 4(3), e4803, 2009.

The Getting Started with NoSQL Guide will get you hands-on with NoSQL in minutes with no coding needed. Brought to you in partnership with Couchbase.


Published at DZone with permission of Marko Rodriguez, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}