DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Exploring a Graph Representation of Wikipedia Using Gremlin

Marko Rodriguez user avatar by
Marko Rodriguez
·
Mar. 10, 12 · Interview
Like (0)
Save
Tweet
Share
6.21K Views

Join the DZone community and get the full member experience.

Join For Free

a graph is a data structure composed of vertices and edges. these terms are synonymous with the following:

  • vertices, dots, nodes, things, points
  • edges, lines, links, relations, arcs

from vertices and edges, numerous artificial and real-world systems can be modeled and processed. the purpose of this post is to explore a graph representation of wikipedia persisted in the graph database neo4j and processed using the graph traversal language gremlin . please note that a parse-friendly representation of wikipedia was provided by dbpedia and the present analysis and presentation is provided by a collaborative effort between aurelius and the intelligent systems laboratory at parc (a xerox company).

on the graph nature of wikipedia

there are numerous ways in which wikipedia can be represented as a graph. the articles and the href hyperlinks between them is one way. this type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. a more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph . for the purpose of this post, a middle ground representation is used. the vertices are wikipedia articles and wikipedia categories . the edges are hyperlinks between articles as well as taxonomical relations amongst the categories.

using gremlin, some descriptive graph statistics of this representation are calculated. along with vertex and edge counts, the in- and out-degree distributions of the graph are provided below. these distribution calculations are agnostic to the edge labels.

gremlin> g = new neo4jgraph('/data/dbpedia')
neo4jgraph[/data/dbpedia]
gremlin> g.v.count()
30962172
gremlin> g.e.count()
191767951 
gremlin> outdegrees = [:]; indegrees = [:]
gremlin> g.v.transform{it.out.count()}.groupcount(outdegrees).iterate()
gremlin> g.v.transform{it.in.count()}.groupcount(indegrees).iterate()
gremlin> f = new file('out-degrees.txt')
gremlin> outdegrees.each{k,v -> f.append(k + '\t' + v + '\n')}
gremlin> f = new file('in-degrees.txt')
gremlin> indegrees.each{k,v -> f.append(k + '\t' + v + '\n')}

the files out-degrees.txt and in-degrees.txt generated by gremlin are loaded into r statistics and plotted.

r version 2.13.1 (2011-07-08)
copyright (c) 2011 the r foundation for statistical computing

 out.degrees <- read.table('out-degrees.txt', sep='\t')
 in.degrees <- read.table('in-degrees.txt', sep='\t')
 plot(out.degrees, log='xy', main='wikipedia out-degree distribution', xlab='out degree', ylab='frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)
 plot(in.degrees, log='xy', main='wikipedia in-degree distribution', xlab='in degree', ylab='frequency', cex.lab=1.5, cex.axis=1.5, cex.main=1.5)

on the referential nature of ideas

one lazy sunday afternoon, gremlin walks into his library. the smell of leather and mahogany overcomes his senses. his eyes scan over the shelves filled with books about every imaginable topic. on one inspired day many moons ago, gremlin printed out each wikipedia article, bound it, and stored it. on this particular occasion, gremlin decides to review one of his favorite subjects: graph theory. off the shelf, gremlin grabs the book respectfully entitled “ graph theory .”

gremlin> v = g.idx(t.v)[[uri:'http://dbpedia.org/resource/graph_theory']].next()
v[26634]
gremlin> v.map()
name=graph theory
uri=http://dbpedia.org/resource/graph_theory

gremlin cracks open the cover of graph theory. to his amazement, he notices that the words in the book reference other books on his shelf! [line 1] he wonders, how many such references exist? [line 7]

gremlin> v.out('href').name[0..4]
mathematics
computer science
graph (mathematics)
vertex (graph theory)
graph of a function
gremlin> v.out('href').count()
240

one reference at a time, gremlin pulls the respective book from its shelf and looks at its pages. unbelievably, these books also make reference to other books! [line 1] how many references do the references of graph theory have? [line 8] dumbfounded by the circuitous nature of ideas being defined in terms of other ideas, gremlin comes to realize the inherent structure of the graph of knowledge — its self-referential nature [line 9].

gremlin> v.out('href').out('href').name[0..4]
mathematics (disambiguation)
math (disambiguation)
euclid
greek language
quantity
gremlin> v.out('href').out('href').count()   
21182
gremlin> v.out('href').dedup.out('href').out('href').retain([v]).paths{it.name}[0..4]
[graph theory, mathematics, logic, graph theory]
[graph theory, mathematics, history of mathematics, graph theory]
[graph theory, mathematics, operations research, graph theory]
[graph theory, mathematics, paul erdős, graph theory]
[graph theory, mathematics, theoretical computer science, graph theory]

there is only so much time a little green gremlin gets to live. he wonders what he would be expert in if he continues to read book after book as dictated by the references that they maintain to one another.

gremlin> v.in('href').name[0..4]
semiotics of the structure
sones graphdb
trapezoid graph
alpha centrality
block graph

on the recurrent nature of thought

for hours upon hours and days upon days, gremlin goes from book to book. sometimes coming back to reading a previous book over again. from the vantage point of a long exposure film , his form looks like a blur amongst his collection. however, this blur is not evenly distributed across his library. in fact, he is more likely reading some books over others. those books that are most central in the hyperlink structure are those most likely to be encountered on gremlin’s random walk . in his ledger, gremlin decides to record the potential that he will be reading any one book at any one time. initially all pages are equally likely. [line 1] he begins his journey at graph theory. [line 2] he looks up how much “potential” exists for graph theory. [line 3] he then retrieves all the outgoing neighbors of graph theory. [line 4] he determines the total number of neighbors. [line 5] to each neighbor, he distributes graph theory’s potential equally amongst them. [lines 6--8] gremlin then jumps to graph theory’s referred to books and continues the process over and over and over again for 100,000 steps. starting from the graph theory vertex, gremlin diffuses himself like a wave over the structure that surrounds.

gremlin> m = [:].withdefault{1}
gremlin> v.transform{
  rank = m[it.name]; 
  neighbors = it.out('href').tolist();
  degree = neighbors.size();
  neighbors.each {
    m[it.name] = m[it.name] + (rank/degree);
  }
  neighbors;
}.scatter.range(0,100000).loop(3){true}.iterate()
null

gremlin looks over the top 10 ranked entries in his ledger. this is where his diffusion shows the most recurrence — it articulates which book he is most likely to be reading at any one moment in the next 100,000 book reads.

gremlin> m.sort{a,b -> b.value <=> a.value}[0..9]
mathematics=8.2325984930913507177010807338237871640327644591
graph theory=6.3579261869065686025097269665928272601625649213
graph (mathematics)=3.8056391897097907459393869524402476473212892533
computer science=3.2189950001323052861187247709061777261390151631
real number=2.942860316400352223267769366106131293745859000
mathematician=2.751409475433875926984889918863144276736916
number theory=2.7296994022747017139496107107489077766999403538
combinatorics=2.6983273156966484352429094611889032326648909856
topology=2.6822253369205954921236082679389161746453858919
set theory=2.5682887400468326671894663603540368070624776875

on the taxonomical nature of categories

gremlin’s library is not just shelves with books. his library also has a card catalog in it. the catalog indexes books by category. moreover, each book has a reference to its subject category. for instance, the subject category of graph theory is, well…’graph theory.’ the catalog is also structured such that the categories make reference to one another. for instance, ‘combinatorics’ is a broader category than ‘graph theory.’ for the fun of it, gremlin decides to move between the books and the categories.

gremlin> v.out('subject').name
graph theory
gremlin> v.out('subject').out('broader').name
combinatorics
mathematics

in order to see the layers in a breadth-first manner , gremlin exposes the category “rings” emanating from ‘graph theory.’

gremlin> x = [];
gremlin> v.out('subject').out('broader').gather{t = it.unique(); x.add(t*.name); t}.scatter.loop(3){it.loops < 7}.iterate()
null
gremlin> x
[combinatorics, mathematical relations]
[subdivisions of mathematics, discrete mathematics, mathematics, predicate logic]
[mathematics, subfields by academic discipline, subdivisions of mathematics, main topic classifications, abstraction, formal sciences, structure, scientific disciplines, systems of formal logic]
[main topic classifications, abstraction, formal sciences, structure, scientific disciplines, categories by topic, academic disciplines, mathematics, subfields by academic discipline, articles, thought, innovation, problem solving, creativity, analogy, concepts, logic, interdisciplinary fields, formalism (deductive), matter, dimension, form, science, mathematical logic, formal systems]
[articles, thought, structure, innovation, problem solving, creativity, analogy, concepts, academic disciplines, logic, interdisciplinary fields, scientific disciplines, formalism (deductive), matter, dimension, form, science, subfields by academic discipline, categories by parameter, academia, categories by topic, main topic classifications, abstraction, formal sciences, contents, mind, cognition, mental processes, mental content, technological change, intelligence, critical thinking, human skills, human behavior, action, quality, fundamental categories, branches of philosophy, axiology, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, subdivisions of mathematics, discrete mathematics, mathematical logic, metalogic, formal theories, logical syntax]
[contents, mind, cognition, mental processes, mental content, matter, concepts, dimension, form, technological change, intelligence, creativity, innovation, critical thinking, human skills, human behavior, action, quality, thought, fundamental categories, academia, categories by topic, abstraction, branches of philosophy, interdisciplinary fields, axiology, science, academic disciplines, subfields by academic discipline, theories of deduction, formalism, fundamental physics concepts, universe, nature, ontology, manifolds, physical quantities, knowledge, main topic classifications, categories, education, categories by parameter, articles, structure, problem solving, analogy, logic, scientific disciplines, formalism (deductive), cognitive science, concepts in metaphysics, philosophy of mind, consciousness, epistemology, psychology, technology, economic development, technological change, and growth, neurology, neuroscience, memory, learning, educational psychology, philosophical logic, evaluation, life skills, skills, humans, behavior, phenomena, determinism, free will, motion, management, philosophy by field, value, deduction, scientific theories, philosophical theories, philosophy of mathematics, physics, concepts by field, physical cosmology, astrophysics, astronomical dynamical systems, reality, everything, life, metaphysics, philosophy of science, topological spaces, differential geometry, geometric topology, measurement, information, perception, concepts in epistemology, mathematics, subdivisions of mathematics, discrete mathematics, metaphilosophy, metatheory, theories, formal languages, formal sciences, syntax]

these rings are useful for seeing what is touched at each step of the walk. however, in order to see the inverted tree structure, gremlin nests his traversal so that the categories are grouped under their children . gremlin records the subsumption hierarchy as categories broaden one another.

gremlin> v.out('subject').out('broader').groupby{it.name}{
  it.out('broader').groupby{it.name}{
    it.out('broader').name
  }.cap.next()
}.cap.next()
mathematical relations=[{mathematics=[main topic classifications, abstraction, formal sciences, structure, scientific disciplines], predicate logic=[systems of formal logic]}]
combinatorics=[{discrete mathematics=[subdivisions of mathematics], subdivisions of mathematics=[mathematics, subfields by academic discipline]}]

gremlin’s library is an adventure through the fascinating world of interconnected ideas.

references

bizer, c., lehmann, j., kobilarov, g., auer, s., becker, s., cyganiak, r., hellmann, s., “ dbpedia – a crystallization point for the web of data ,” journal of web semantics, 7(3), 2009.

rodriguez, m.a., pepe, a., shinavier, j., “ the dilated triple ,” emergent web intelligence: advanced semantic technologies, advanced information and knowledge processing series, 3–16, 2010.

newman, m.e.j., “ the structure and function of complex networks “, siam review, 45, 167–256, 2003.

bollen, j., van de sompel, h., hagberg, a., bettencourt, l.m.a, chute, r., rodriguez, m.a., balakireva, l.l., “ clickstream data yields high-resolution maps of science ,” plos one, public library of science, 4(3), e4803, 2009.

Graph (Unix) Gremlin (query language) Book

Published at DZone with permission of Marko Rodriguez, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Spring Boot vs Eclipse MicroProfile: Resident Set Size (RSS) and Time to First Request (TFR) Comparative
  • The Path From APIs to Containers
  • Building a Real-Time App With Spring Boot, Cassandra, Pulsar, React, and Hilla
  • gRPC on the Client Side

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: