Exploring a Graph Representation of Wikipedia Using Gremlin
Join the DZone community and get the full member experience.
Join For Freea graph is a data structure composed of vertices and edges. these terms are synonymous with the following:
- vertices, dots, nodes, things, points
- edges, lines, links, relations, arcs
from vertices and edges, numerous artificial and real-world systems can be modeled and processed. the purpose of this post is to explore a graph representation of wikipedia persisted in the graph database neo4j and processed using the graph traversal language gremlin . please note that a parse-friendly representation of wikipedia was provided by dbpedia and the present analysis and presentation is provided by a collaborative effort between aurelius and the intelligent systems laboratory at parc (a xerox company).
on the graph nature of wikipedia
there are numerous ways in which wikipedia can be represented as a graph. the articles and the href hyperlinks between them is one way. this type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. a more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph . for the purpose of this post, a middle ground representation is used. the vertices are wikipedia articles and wikipedia categories . the edges are hyperlinks between articles as well as taxonomical relations amongst the categories.
using gremlin, some descriptive graph statistics of this representation are calculated. along with vertex and edge counts, the in- and out-degree distributions of the graph are provided below. these distribution calculations are agnostic to the edge labels.
gremlin> g = new neo4jgraph('/data/dbpedia') neo4jgraph[/data/dbpedia] gremlin> g.v.count() 30962172 gremlin> g.e.count() 191767951 gremlin> outdegrees = [:]; indegrees = [:] gremlin> g.v.transform{it.out.count()}.groupcount(outdegrees).iterate() gremlin> g.v.transform{it.in.count()}.groupcount(indegrees).iterate() gremlin> f = new file('out-degrees.txt') gremlin> outdegrees.each{k,v -> f.append(k + '\t' + v + '\n')} gremlin> f = new file('in-degrees.txt') gremlin> indegrees.each{k,v -> f.append(k + '\t' + v + '\n')}
the files out-degrees.txt and in-degrees.txt generated by gremlin are loaded into r statistics and plotted.
r version 2.13.1 (2011-07-08) copyright (c) 2011 the r foundation for statistical computing out.degrees <- read.table('out-degrees.txt', sep='\t') in.degrees <- read.table('in-degrees.txt', sep='\t') plot(out.degrees, log='xy', main='wikipedia out-degree distribution', xlab='out degree', ylab='frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5) plot(in.degrees, log='xy', main='wikipedia in-degree distribution', xlab='in degree', ylab='frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)
on the referential nature of ideas
one lazy sunday afternoon, gremlin walks into his library. the smell of
leather and mahogany
overcomes his senses. his eyes scan over the shelves filled with books
about every imaginable topic. on one inspired day many moons ago,
gremlin printed out each wikipedia article, bound it, and stored it. on
this particular occasion, gremlin decides to review one of his favorite
subjects: graph theory. off the shelf, gremlin grabs the book
respectfully entitled “
graph theory
.”
gremlin> v = g.idx(t.v)[[uri:'http://dbpedia.org/resource/graph_theory']].next() v[26634] gremlin> v.map() name=graph theory uri=http://dbpedia.org/resource/graph_theory
gremlin cracks open the cover of graph theory. to his amazement, he notices that the words in the book reference other books on his shelf! [line 1] he wonders, how many such references exist? [line 7]
gremlin> v.out('href').name[0..4] mathematics computer science graph (mathematics) vertex (graph theory) graph of a function gremlin> v.out('href').count() 240
one reference at a time, gremlin pulls the respective book from its
shelf and looks at its pages. unbelievably, these books also make
reference to other books! [line 1] how many references do the references
of graph theory have? [line 8] dumbfounded by the circuitous nature of
ideas being defined in terms of other ideas, gremlin comes to realize
the inherent structure of the graph of knowledge — its self-referential
nature [line 9].
gremlin> v.out('href').out('href').name[0..4] mathematics (disambiguation) math (disambiguation) euclid greek language quantity gremlin> v.out('href').out('href').count() 21182 gremlin> v.out('href').dedup.out('href').out('href').retain([v]).paths{it.name}[0..4] [graph theory, mathematics, logic, graph theory] [graph theory, mathematics, history of mathematics, graph theory] [graph theory, mathematics, operations research, graph theory] [graph theory, mathematics, paul erdős, graph theory] [graph theory, mathematics, theoretical computer science, graph theory]
there is only so much time a little green gremlin gets to live. he wonders what he would be expert in if he continues to read book after book as dictated by the references that they maintain to one another.
gremlin> v.in('href').name[0..4] semiotics of the structure sones graphdb trapezoid graph alpha centrality block graph
on the recurrent nature of thought
for hours upon hours and days upon days, gremlin goes from book to
book. sometimes coming back to reading a previous book over again. from
the vantage point of a
long exposure film
,
his form looks like a blur amongst his collection. however, this blur
is not evenly distributed across his library. in fact, he is more likely
reading some books over others. those books that are most central in
the hyperlink structure are those most likely to be encountered on
gremlin’s
random walk
.
in his ledger, gremlin decides to record the potential that he will be
reading any one book at any one time. initially all pages are equally
likely. [line 1] he begins his journey at graph theory. [line 2] he
looks up how much “potential” exists for graph theory. [line 3] he then
retrieves all the outgoing neighbors of graph theory. [line 4] he
determines the total number of neighbors. [line 5] to each neighbor, he
distributes graph theory’s potential equally amongst them. [lines 6--8]
gremlin then jumps to graph theory’s referred to books and continues the
process over and over and over again for 100,000 steps. starting from
the graph theory vertex, gremlin
diffuses
himself like a
wave
over the structure that surrounds.
gremlin> m = [:].withdefault{1} gremlin> v.transform{ rank = m[it.name]; neighbors = it.out('href').tolist(); degree = neighbors.size(); neighbors.each { m[it.name] = m[it.name] + (rank/degree); } neighbors; }.scatter.range(0,100000).loop(3){true}.iterate() null
gremlin looks over the top 10 ranked entries in his ledger. this is where his diffusion shows the most recurrence — it articulates which book he is most likely to be reading at any one moment in the next 100,000 book reads.
gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] mathematics=8.2325984930913507177010807338237871640327644591 graph theory=6.3579261869065686025097269665928272601625649213 graph (mathematics)=3.8056391897097907459393869524402476473212892533 computer science=3.2189950001323052861187247709061777261390151631 real number=2.942860316400352223267769366106131293745859000 mathematician=2.751409475433875926984889918863144276736916 number theory=2.7296994022747017139496107107489077766999403538 combinatorics=2.6983273156966484352429094611889032326648909856 topology=2.6822253369205954921236082679389161746453858919 set theory=2.5682887400468326671894663603540368070624776875
on the taxonomical nature of categories
gremlin’s library is not just shelves with books. his library also has a card catalog in it. the catalog indexes books by category. moreover, each book has a reference to its subject category. for instance, the subject category of graph theory is, well…’graph theory.’ the catalog is also structured such that the categories make reference to one another. for instance, ‘combinatorics’ is a broader category than ‘graph theory.’ for the fun of it, gremlin decides to move between the books and the categories.
gremlin> v.out('subject').name graph theory gremlin> v.out('subject').out('broader').name combinatorics mathematics
in order to see the layers in a breadth-first manner , gremlin exposes the category “rings” emanating from ‘graph theory.’
gremlin> x = []; gremlin> v.out('subject').out('broader').gather{t = it.unique(); x.add(t*.name); t}.scatter.loop(3){it.loops < 7}.iterate() null gremlin> x [combinatorics, mathematical relations] [subdivisions of mathematics, discrete mathematics, mathematics, predicate logic] [mathematics, subfields by academic discipline, subdivisions of mathematics, main topic classifications, abstraction, formal sciences, structure, scientific disciplines, systems of formal logic] [main topic classifications, abstraction, formal sciences, structure, scientific disciplines, categories by topic, academic disciplines, mathematics, subfields by academic discipline, articles, thought, innovation, problem solving, creativity, analogy, concepts, logic, interdisciplinary fields, formalism (deductive), matter, dimension, form, science, mathematical logic, formal systems] [articles, thought, structure, innovation, problem solving, creativity, analogy, concepts, academic disciplines, logic, interdisciplinary fields, scientific disciplines, formalism (deductive), matter, dimension, form, science, subfields by academic discipline, categories by parameter, academia, categories by topic, main topic classifications, abstraction, formal sciences, contents, mind, cognition, mental processes, mental content, technological change, intelligence, critical thinking, human skills, human behavior, action, quality, fundamental categories, branches of philosophy, axiology, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, subdivisions of mathematics, discrete mathematics, mathematical logic, metalogic, formal theories, logical syntax] [contents, mind, cognition, mental processes, mental content, matter, concepts, dimension, form, technological change, intelligence, creativity, innovation, critical thinking, human skills, human behavior, action, quality, thought, fundamental categories, academia, categories by topic, abstraction, branches of philosophy, interdisciplinary fields, axiology, science, academic disciplines, subfields by academic discipline, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, main topic classifications, categories, education, categories by parameter, articles, structure, problem solving, analogy, logic, scientific disciplines, formalism (deductive), cognitive science, concepts in metaphysics, philosophy of mind, consciousness, epistemology, psychology, technology, economic development, technological change, and growth, neurology, neuroscience, memory, learning, educational psychology, philosophical logic, evaluation, life skills, skills, humans, behavior, phenomena, determinism, free will, motion, management, philosophy by field, value, deduction, scientific theories, philosophical theories, philosophy of mathematics, physics, concepts by field, physical cosmology, astrophysics, astronomical dynamical systems, reality, everything, life, metaphysics, philosophy of science, topological spaces, differential geometry, geometric topology, measurement, information, perception, concepts in epistemology, mathematics, subdivisions of mathematics, discrete mathematics, metaphilosophy, metatheory, theories, formal languages, formal sciences, syntax]
these rings are useful for seeing what is touched at each step of the walk. however, in order to see the inverted tree structure, gremlin nests his traversal so that the categories are grouped under their children . gremlin records the subsumption hierarchy as categories broaden one another.
gremlin> v.out('subject').out('broader').groupby{it.name}{ it.out('broader').groupby{it.name}{ it.out('broader').name }.cap.next() }.cap.next() mathematical relations=[{mathematics=[main topic classifications, abstraction, formal sciences, structure, scientific disciplines], predicate logic=[systems of formal logic]}] combinatorics=[{discrete mathematics=[subdivisions of mathematics], subdivisions of mathematics=[mathematics, subfields by academic discipline]}]
gremlin’s library is an adventure through the fascinating world of interconnected ideas.
references
bizer, c., lehmann, j., kobilarov, g., auer, s., becker, s., cyganiak, r., hellmann, s., “ dbpedia – a crystallization point for the web of data ,” journal of web semantics, 7(3), 2009.
rodriguez, m.a., pepe, a., shinavier, j., “ the dilated triple ,” emergent web intelligence: advanced semantic technologies, advanced information and knowledge processing series, 3–16, 2010.
newman, m.e.j., “ the structure and function of complex networks “, siam review, 45, 167–256, 2003.
bollen, j., van de sompel, h., hagberg, a., bettencourt, l.m.a, chute, r., rodriguez, m.a., balakireva, l.l., “ clickstream data yields high-resolution maps of science ,” plos one, public library of science, 4(3), e4803, 2009.
Published at DZone with permission of Marko Rodriguez, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments