DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Avatar

Davy Suvee

Architect at Datablend

Heist-Op-Den-Berg, BE

Joined May 2008

About

Davy Suvee is the founder of Datablend. He is currently working as an IT Lead/Software Architect in the Research and Development division of a large pharmaceutical company. Required to work with big and unstructured scientific data sets, Davy gathered hands-on expertise and insights in the best practices on Big Data and NoSql. Through Datablend, Davy aims at sharing his practical experience within a broader IT environment.

Stats

Reputation: 3
Pageviews: 198.0K
Articles: 3
Comments: 1
  • Articles
  • Comments

Articles

article thumbnail
Back To The Future with Datomic
At the beginning of March, Rich Hickey and his team released Datomic. Datomic is a novel distributed database system designed to enable scalable, flexible and intelligent applications, running on next-generation cloud architectures. Its launch was surrounded with quite some buzz and skepticism, mainly related to its rather disruptive architectural proposal. Instead of trying to recapitulate the various pros and cons of its architectural approach, I will try to focus on the other innovation it introduces, namely its powerful data model (based upon the concept of Datoms) and its expressive query language (based upon the concept of Datalog). The remainder of this article will describe how to store facts and query them through Datalog expressions and rules. Additionally, I will show how Datomic introduces an explicit notion of time, which allows for the execution of queries against both the previous and future states of the database. As an example, I will use a very simple data model that is able to describe genealogical information. As always, the complete source code can be found on the Datablend public GitHub repository. 1. The Datomic data model Datomic stores facts (i.e. your data points) as datoms. A datom represents the addition (or retraction) of a relation between an entity, an attribute, a value, and a transaction. The datom concept is closely related to the concept of a RDF triple, where each triple is a statement about a particular resource in the form of a subject-predicate-object expression. Datomic adds the notion of time by explicitly tagging a datom with a transaction identifier (i.e. the exact time-point at which the fact was persisted into the Datomic database). This allows Datomic to promote data immutability: updates are not changing your existing facts; they are merely creating new datoms that are tagged with a more recent transaction. Hence, the system keeps track of all the facts, forever. Datomic does not enforce an explicit entity schema; it’s up to the user to decide what type of attributes he/she want to store for a particular entity. Attributes are part of the Datomic meta model, which specifies the characteristics (i.e. attributes) of the attributes themselves. Our genealogical example data model stores information about persons and their ancestors. For this, we will require two attributes: name and parent. An attribute is basically an entity, expressed in terms of the built-in system attributes such as cardinality, value type and attribute description. // Open a connection to the database String uri = "datomic:mem://test"; Peer.createDatabase(uri); Connection conn = Peer.connect(uri); // Declare attribute schema List tx = new ArrayList(); tx.add(Util.map(":db/id", Peer.tempid(":db.part/db"), ":db/ident", ":person/name", ":db/valueType", ":db.type/string", ":db/cardinality", ":db.cardinality/one", ":db/doc", "A person's name", ":db.install/_attribute", ":db.part/db")); tx.add(Util.map(":db/id", Peer.tempid(":db.part/db"), ":db/ident", ":person/parent", ":db/valueType", ":db.type/ref", ":db/cardinality", ":db.cardinality/many", ":db/doc", "A person's parent", ":db.install/_attribute", ":db.part/db")); // Store it conn.transact(tx).get(); All entities in a Datomic database need to have an internal key, called the entity id. In our case, we generate a temporary id through the tempid utility method. All entities are stored within a specific database partition that groups together logically related entities. Attribute definitions need to reside in the :db.part/db partition, a dedicated system partition employed exclusively for storing system entities and schema definitions. :person/name is a single-valued attribute of value type string. :person/parent is a multi-valued attribute of value type ref. The value of a reference attribute points to (the id) of another entity stored within the Datomic database. Once our attribute schema is persisted, we can start populating our database with concrete person entities. // Define person entities List tx = new ArrayList(); Object edmond = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", edmond, ":person/name", "Edmond Suvee")); Object gilbert = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", gilbert, ":person/name", "Gilbert Suvee", ":person/parent", edmond)); Object davy = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", davy, ":person/name", "Davy Suvee", ":person/parent", gilbert)); // Store them conn.transact(tx).get(); We will create three concrete persons: myself, my dad Gilbert Suvee and my grandfather Edmond Suvee. Similarly to the definition of attributes, we again employ the tempid utility method to retrieve temporary ids for our newly created entities. This time however, we store our persons within the :db.part/user database partition, which is the default partition for storing application entities. Each person is given a name (via the :person/name attribute) and parent (via the :person/parent attribute). When calling the transact method, each entity is translated into a set of individual datoms that together describe the entity. Once persisted, Datomic ensures that temporary ids are replaced with their final counterparts. 2. The Datomic query language Datomic’s query model is an extended form of Datalog. Datalog is a deductive query system which will feel quite familiar to people who have experience with SPARQL and/or Prolog. The declarative query language makes use of a pattern matching mechanism to find all combinations of values (i.e. facts) that satisfy a particular set of conditions expressed as clauses. Let’s have a look at a few example queries: // Find all persons System.out.println(Peer.q("[:find ?name " + ":where [?person :person/name ?name] ]", conn.db())); // Find the parents of all persons System.out.println(Peer.q("[:find ?name ?parentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/name ?parentname] ]" , conn.db())); // Find the grandparent of all persons System.out.println(Peer.q("[:find ?name ?grandparentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/parent ?grandparent] " + "[?grandparent :person/name ?grandparentname] ]" , conn.db())); We consider entities to be of type person if they own a :person/name attribute. The :where-part of the first query, which aims at finding all persons stored in the Datomic database, specifies the following “conditional” clause: [?person :person/name ?name]. ?person and ?name are variables which act as placeholders. The Datalog query engine retrieves all facts (i.e. datoms) that match this clause. The :find-part of the query specifies the “values” that should be returned as the result of the query. Result query 1: [["Davy Suvee"], ["Edmond Suvee"], ["Gilbert Suvee"]] The second and the third query aim at retrieving the parents and grandparents of all persons stored in the Datomic database. These queries specify multiple clauses that are solved through the use of unification: when a variable name is used more than once, it must represent the same value in every clause in order to satisfy the total set of clauses. As expected, only Davy Suvee has been identified as having a grandparent, as the necessary facts to satisfy this query are not available for neither Gilbert Suvee and Edmond Suvee. Result query 2: [["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"]] Result query 3: [["Davy Suvee" "Edmond Suvee"]] If several queries require this “grandparent” notion, one can define a reusable rule that encapsulates the required clauses. Rules can be flexibly combined with clauses (and other rules) in the :where-part of a query. Our third query can be rewritten using the following rules and clauses: String grandparentrule = "[ [ (grandparent ?person ?grandparent) [?person :person/parent ?parent] " + "[?parent :person/parent ?grandparent] ] ]"; System.out.println(Peer.q("[:find ?name ?grandparentname " + ":in $ % " + ":where [?person :person/name ?name] " + "(grandparent ?person ?grandparent) " + "[?grandparent :person/name ?grandparentname] ]" , conn.db(), grandparentrule)); Rules can also be used to write recursive queries. Imagine the ancestor-relationship. It’s impossible to predict the number of parent-levels one needs to go up in order to retrieve the ancestors of a person. As Datomic rules supports the notion of recursion, a rule can call itself within its definition. Similar to recursion in other languages, recursive rules are build up out of a simple base case and a set of clauses which reduce all other cases toward this base case. String ancestorrule = "[ [ (ancestor ?person ?ancestor) [?person :person/parent ?ancestor] ] " + "[ (ancestor ?person ?ancestor) [?person :person/parent ?parent] " + "(ancestor ?parent ?ancestor) ] ] ]"; System.out.println(Peer.q("[:find ?name ?ancestorname " + ":in $ % " + ":where [?person :person/name ?name] " + "[ancestor ?person ?ancestor] " + "[?ancestor :person/name ?ancestorname] ]" , conn.db(), ancestorrule)); Result query 4: [["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"]] 3. Back To The Future I As already mentioned in section 1, Datomic does not perform in-place updates. Instead, all facts are stored and tagged with a transaction such that the most up-to-date value of a particular entity attribute can be retrieved. By doing so, Datomic allows you to travel back into time and perform queries against previous states of the database. Using the asOf method, one can retrieve a version of the database that only contains facts that were part of the database at that particular moment in time. The use of a checkpoint that predates the storage of my own person entity will result in parent-query results that do not longer contain results related to myself. System.out.println(Peer.q("[:find ?name ?parentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/name ?parentname] ]", conn.db().asOf(getCheckPoint(checkpoint)))); Result query 2: [["Gilbert Suvee" "Edmond Suvee"]] 4. Back To The Future II Datomic also allows to predict the future. Well, sort of … Similar to the asOf method, one can use the with method to retrieve a version of the database that gets extended with a list of not-yet transacted datoms. This allows to run queries against future states of the database and to observe the implications if these new facts were to be added. List tx = new ArrayList(); tx.add(Util.map(":db/id", Peer.tempid(":db.part/user"), ":person/name", "FutureChild Suvee", ":person/parent", Peer.q("[:find ?person :where [?person :person/name \"Davy Suvee\"] ]", conn.db()).iterator().next().get(0))); System.out.println(Peer.q("[:find ?name ?ancestorname " + ":in $ % " + ":where [?person :person/name ?name] " + "[ancestor ?person ?ancestor] " + "[?ancestor :person/name ?ancestorname] ]" , conn.db().with(tx), ancestorrule)); Result query 4: [["FutureChild Suvee" "Edmond Suvee"], ["FutureChild Suvee" "Gilbert Suvee"], ["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"], ["FutureChild Suvee" "Davy Suvee"]] 5. Conclusion The use of Datoms and Datalog allows you to express simple, yet powerful queries. This article introduces only a fraction of the features offered by Datomic. To get myself better acquainted with the various Datomic gotchas, I implemented the Tinkerpop Blueprints API on top of Datomic. By doing so, you basically get a distributed, temporal graph database, which is, as far as I know, unique within the Graph database ecosystem. The source code of this Blueprints implementation can currently be found on the Datablend public GitHub repository and will soon be merged within the Tinkerpop project..
April 14, 2012
· 17,500 Views
article thumbnail
Circos: An Amazing Tool for Visualizing Big Data
storing massive amounts of data in a nosql data store is just one side of the big data equation. being able to visualize your data in such a way that you can easily gain deeper insights , is where things really start to get interesting. lately, i've been exploring various options for visualizing (directed) graphs, including circos . circos is an amazing software package that visualizes your data through a circular layout . although it's originally designed for displaying genomic data , it allows to create good-looking figures from data in any field. just transform your data set into a tabular format and you are ready to go. the figure below illustrates the core concept behind circos. the table's columns and rows are represented by segments around the circle. individual cells are shown as ribbons , which connect the corresponding row and column segments. the ribbons themselves are proportional in width to the value in the cell. when visualizing a directed graph , nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. the proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. in my case, i want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, ...) and how do they navigate between pages. the rest of this article details how to 1) retrieve the raw visit information through the google analytics api, 2) persist this information as a graph in neo4j and 3) query and preprocess this data for visualization through circos. as always, the complete source code can be found on the datablend public github repository . 1. retrieving your google analytics data let's start by retrieving the raw google analytics data . the google analytics data api provides access to all dimensions and metrics that can be queried through the web application. in my case, i'm interested in retrieving the previous page path property for each page view. if a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance) . otherwise, it contains the internal path . we will use google's java data api to connect and retrieve this information. we are particularly interested in the pagepath , pagetitle , previouspagepath and medium dimensions, while our metric of choice is the number of pageviews . after setting the date range, the feed of entries that satisfy this criteria can be retrieved. for ease of use, we transform this data to a domain entity and filter/clean the data accordingly. if a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, ...) as previous path. // authenticate analyticsservice = new analyticsservice(configuration.service); analyticsservice.setusercredentials(configuration.client_username, configuration.client_pass); // create query dataquery query = new dataquery(new url(configuration.data_url)); query.setids(configuration.table_id); query.setdimensions("ga:medium,ga:previouspagepath,ga:pagepath,ga:pagetitle"); query.setmetrics("ga:pageviews"); query.setstartdate(datestring); query.setenddate(datestring); // execute datafeed feed = analyticsservice.getfeed(createqueryurl(date), datafeed.class); // iterate and clean for (dataentry entry : feed.getentries()) { string pagepath = entry.stringvalueof("ga:pagepath"); string pagetitle = entry.stringvalueof("ga:pagetitle"); string previouspagepath = entry.stringvalueof("ga:previouspagepath"); string medium = entry.stringvalueof("ga:medium"); long views = entry.longvalueof("ga:pageviews"); // filter the data if (filter(pagepath) && filter(previouspagepath) && (!clean(previouspagepath).equals(clean(pagepath)))) { // check criteria are satisfied navigation navigation = new navigation(clean(previouspagepath), clean(pagepath), pagetitle, date, views); if (navigation.getsource().equals("(entrance)")) { // in case of an entrace, save its medium instead navigation.setsource(medium); } navigations.add(navigation); } } 2. storing navigational data as a directed graph in neo4j the set of site navigations can easily be stored as a directed graph in the neo4j graph database . nodes are site paths (or mediums), while relationships are the navigations themselves. we start by retrieving the navigations for a particular date range and retrieve (or lazily create) the nodes representing the source and target paths (or mediums). next we de-normalize the pageviews metric (for instance, 6 individual relationships will be created for 6 page-views). although this de-normalization step is not really required, i did so to make sure that the degree of my nodes is correct if i would perform other types of calculations. for each individual navigation relationship, we also store the date of visit . // retrieve navigations for a particular date list navigations = retrieval.getnavigations(date); // save them in the graph database transaction tx = graphdb.begintx(); // iterate and create for (navigation nav : navigations) { node source = getpath(nav.getsource()); node target = getpath(nav.gettarget()); if (!target.hasproperty("title")) { target.setproperty("title", nav.gettargettitle()); } for (long i = 0; i < nav.getamount(); i++) { // duplicate relationships relationship transition = source.createrelationshipto(target, relationships.navigation); transition.setproperty("date", date.gettime()); // save time as long } } // commit tx.success(); tx.finish(); 3. creating the circos tabular data format the circos tabular data format is quite easy to construct. it's basically a tab-delimited file with row and column headers. a cell is interpreted as a value that flows from the row entity to the column entity . we will use the neo4j cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period . doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time. // access the graph database graphdb = new embeddedgraphdatabase("var/analytics"); engine = new executionengine(graphdb); // execute the data range cypher query map params = new hashmap(); params.put("fromdate", from.gettime()); params.put("todate", to.gettime()); // execute the query executionresult result = engine.execute("start sourcepath=node:index(\"path:*\") " + "match sourcepath-[r]->targetpath " + "where r.date >= {fromdate} and r.date <= {todate} " + "return sourcepath,targetpath", params); next, we create the tab delimited file itself. we iterate through all entries (i.e. navigations) that match our cypher query and store them in a temporary list. afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. at the end, we filter this occurrence matrix on the minimal number of required navigations. this ensures that we will only create segments for paths that are relevant in the total population. as a final step, we print the occurrences matrix as a tab-delimited file. for each path, we will use a shorthand as the circos renderer seems to have problem with long string identifiers. // retrieve the results iterator> it = result.javaiterator(); list navigations = new arraylist(); map titles = new hashmap(); set paths = new hashset(); // iterate the results while (it.hasnext()) { map record = it.next(); string source = (string)((node) record.get("sourcepath")).getproperty("path"); string target = (string) ((node) record.get("targetpath")).getproperty("path"); string targettitle = (string) ((node) record.get("targetpath")).getproperty("title"); // reuse the navigation object as temorary holder navigations.add(new navigation(source, target, targettitle, new date(), 1)); paths.add(source); paths.add(target); if (!titles.containskey(target)) { titles.put(target, targettitle); } } // retrieve the various paths list pathids = arrays.aslist(paths.toarray(new string[]{})); // create the matrix that holds the info int[][] occurences = new int[pathids.size()][pathids.size()]; // iterate through all the navigations and update accordingly for (navigation navigation : navigations) { int sourceindex = pathids.indexof(navigation.getsource()); int targetindex = pathids.indexof(navigation.gettarget()); occurences[sourceindex][targetindex] = occurences[sourceindex][targetindex] + 1; } // matrix build, filter on threshold for (int i = 0; i < occurences.length; i++) { for (int j = 0; j < occurences.length; j++) { if (occurences[i][j] < threshold) { occurences[i][j] = 0; } } // print printcircosdata(pathids, titles, occurences); the text below is a sample of the output generated by the printcircosdata method. it first prints the legend (matching shorthands with actual paths). next it prints the tab-delimited circos table. link0 - /?p=411/wp-admin - storing and querying rdf data in neo4j through sail - datablend link1 - /?p=1146 - visualizing rdf schema inferencing through neo4j, tinkerpop, sail and gephi - datablend link2 - /?p=164 - big data / concise articles - datablend link3 - referral - null link4 - /?p=1400 - the joy of algorithms and nosql revisited: the mongodb aggregation framework - datablend ... datal0l1l2l3l4... l000000 l100000 l200000 l3059400197 l400000 4. use the circos power although circos can be installed on your local computer, we will use its online version to create the visualization of our data. upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site's navigation information. with just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. the outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. in case of referrals, no navigations have this path as target (indicated by the empty middle ring). its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. the l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). this segment visualizes the navigation data related to my "the joy of algorithms and nosql: a mongodb example" -article. most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. the l15-segment (my blog's main page) is the only path that receives an almost equal amount of incoming and outgoing traffic. with just a few tweaks to the circos input data, we can easily focus on particular types of navigation data. in the figure below, i made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors. 5. conclusions in the era of big data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. circos specializes in a very specific type of visualization, but does its job extremely well. i would be delighted to hear about other types of visualizations for directed graphs.
March 13, 2012
· 35,644 Views · 2 Likes
article thumbnail
RDF data in Neo4J - the Tinkerpop story
My previous blog post discussed the use of Neo4J as a RDF triple store. Michael Hunger however informed me that the neo-rdf-sail component is no longer under active development and advised me to have a look at Tinkerpop’s Sail implementation. As mentioned in my previous blog post, I recently got asked to implement a storage and querying platform for biological RDF (Resource Description Framework) data. Traditional RDF stores are not really an option as my solution should also provide the ability to calculate shortest paths between random subjects. Calculating shortest path is however one of the strong selling points of Graph Databases and more specifically Neo4J. Unfortunately, the neo-rdf-sail component, which suits my requirements perfectly, is no longer under active development. Tinkerpop’s Sail implementation however, fills the void with an even better alternative! 1. What is Tinkerpop? Tinkerpop is an open source project that provides an entire stack of technologies within the Graph Database space. At the core of this stack is the Blueprints framework. Blueprints can be considered as the JDBC of Graph Databases. By providing a collection of generic interfaces, it allows to develop graph-based applications, without introducing explicit dependencies on concrete Graph Database implementations. Additionally, Blueprints provides concrete bindings for the Neo4J, OrientDB and Dex Graph Databases. On top of Blueprints, the Tinkerpop team developed an entire range of graph technologies, including Gremlin, a powerful, domain-specific language designed for traversing graphs. Hence, once a Blueprints binding is available for a particular Graph Database, an entire range of technologies can be leveraged. 2. Tinkerpop and Sail Last time, I talked about exposing a Neo4J Graph Database (containing RDF triples) through the Sail interface, which is part of the openrdf.org project. By doing so, we can reuse an entire range of RDF utilities (parsers and query evaluators) that are part of the openrdf.org project. The Blueprints framework provides us with a similar ability: each Graph Database binding that implements the Tinkerpop TransactionalGraph and IndexableGraph interfaces can be exposed as a GraphSail, which is Tinkerpop’s implementation of the Sail interface. Once you have your Sail available, storing and querying RDF is analogous to the piece of code shown in my previous blog article. // Create the sail graph database graph = new MyNeo4jGraph("var/flights", 100000); graph.setTransactionMode(TransactionalGraph.Mode.MANUAL); sail = new GraphSail(graph); // Initialize the sail store sail.initialize(); // Get the sail repository connection connection = new SailRepository(sail).getConnection(); // Import the data connection.add(getResource("sneeair.rdf"), null, RDFFormat.RDFXML); // Execute SPARQL query TupleQuery durationquery = connection.prepareTupleQuery(QueryLanguage.SPARQL, "PREFIX io: " + "PREFIX fl: " + "SELECT ?number ?departure ?destination " + "WHERE { " + "?flight io:flight ?number . " + "?flight fl:flightFromCityName ?departure . " + "?flight fl:flightToCityName ?destination . " + "?flight io:duration \"1:35\" . " + "}"); TupleQueryResult result = durationquery.evaluate(); The two first lines of code require some more clarification. A TransactionalGraph can be run in MANUAL or AUTOMATIC transaction mode. In AUTOMATIC mode, transactions are basically ignored, in the sense that each item that gets created is immediately persisted in the underlying Graph Database. Although this fits my needs, AUTOMATIC mode is extremely slow in case of Neo4J because of the continuous IO access. MANUAL mode on the other hand is very fast; a new transaction is created at the moment the import of the RDF data file starts and is only committed to the Neo4J data store once all RDF triples are parsed and created. Unfortunately, MANUAL mode does not scale either in my specific situation; as some of my RDF data files contain over 50 million RDF triples, they can not fit into memory (i.e. Java heap space error). Requiring fast imports, I extended the default Neo4J Blueprints binding to support intermediate commits. I based my implementation on Neo4J’s best practices for big transactions. The idea is rather simple: you specify the maximum number of items that can be kept in memory, before they should be committed to the Neo4J data store. Once this number is reached, the current transaction is committed and a new one is automatically started. Simple, but very effective! public class MyNeo4jGraph extends Neo4jGraph { private long numberOfItems = 0; private long maxNumberOfItems = 1; public MyNeo4jGraph(final String directory, long maxNumberOfItems) { super(directory, null); this.maxNumberOfItems = maxNumberOfItems; } public MyNeo4jGraph(final String directory, final Map configuration, long maxNumberOfItems) { super(directory, configuration); this.maxNumberOfItems = maxNumberOfItems; } public Vertex addVertex(final Object id) { Vertex vertex = super.addVertex(id); commitIfRequired(); return vertex; } public Edge addEdge(final Object id, final Vertex outVertex, final Vertex inVertex, final String label) { Edge edge = super.addEdge(id, outVertex, inVertex, label); commitIfRequired(); return edge; } private void commitIfRequired() { // Check whether commit should be executed if (++numberOfItems % maxNumberOfItems == 0) { // Stop the transaction stopTransaction(Conclusion.SUCCESS); // Immediately start a new one startTransaction(); } } } 3. Shortest path calculation Although Blueprints allows you to abstract away the Neo4J implementation details, it still provides you with access to the raw Neo4J data store if needed. Hence, one can still use the graph algorithms provided in the neo4j-graph-algo component to calculate shortest paths between random subjects. The complete source code can be found on the Datablend public GitHub repository.
October 24, 2011
· 24,516 Views

Comments

JavaScript Events Not Firing

Oct 10, 2012 · Ben Amada

Hi Mark,

That is correct. I'm preparing a blog article that desribes the used example in full detail. Just need some time to finish it ;-)

Davy

User has been successfully modified

Failed to modify user

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: