Data Engineering Resources

The Latest Data Engineering Topics

Amazon EMR Tutorial: Running a Hadoop MapReduce Job Using Custom JAR

See original post at https://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html Introduction Amazon EMR is a web service which can be used to easily and efficiently process enormous amounts of data. It uses a hosted Hadoop framework running on the web-scale infrastructure of Amazon EC2 and Amazon S3. Amazon EMR removes most of the cumbersome details of Hadoop while taking care of provisioning of Hadoop, running the job flow, terminating the job flow, moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop. In this tutorial, we will use a developed WordCount Java example using Hadoop and thereafter, we execute our program on Amazon Elastic MapReduce. Prerequisites You must have valid AWS account credentials. You should also have a general familiarity with using the Eclipse IDE before you begin. The reader can also use any other IDE of their choice. Step 1 – Develop MapReduce WordCount Java Program In this section, we are first going to develop a WordCount application. A WordCount program will determine how many times different words appear in a set of files. In Eclipse (or whatever the IDE you are using), Create simple Java Project with the name "WordCount". Create a java class name Map and override the map method as follow, public class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Create a java class named Reduce and override the reduce method as shown below, public class Reduce extends Reducer { @Override protected void reduce(Text key, java.lang.Iterable values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } } Create a java class named WordCount and defined the main method as below, public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setJarByClass(WordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } Export the WordCount program in a jar using eclipse and save it to some location on disk. Make sure that you have provided the Main Class (WordCount.jar) during extraction ofu8u the jar file as shown below. Our jar is ready!!! Step 2 – Upload the WordCount JAR and Input Files to Amazon S3 Now we are going to upload the WordCount jar to Amazon S3. First, go to the following URL: https://console.aws.amazon.com/s3/home Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane. Upload the WordCount JAR and sample input file for counting the words. Step 3 – Running an Elastic MapReduce job Now that the JAR is uploaded into S3, all we need to do is to create a new Job flow. let's execute the steps below. (I encourage readers to check out the following link for details regarding each step, How to Create a Job Flow Using a Custom JAR ) Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/ Click Create New Job Flow. In the DEFINE JOB FLOW page, enter the following details, a) Job Flow Name = WordCountJob b) Select Run your own applications) Select Custom JAR in the drop-down list) Click Continue In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue.JAR Location = bucketName/jarFileLocationJAR Arguments =s3n://bucketName/inputFileLocations3n://bucketName/outputpath Please note that the output path must be unique each time we execute the job. The Hadoop always create a folder with the same name specified here. After executing the job, just wait and monitor your job that runs through the Hadoop flow. You can also look for errors by using the Debug button. The job should be complete within 10 to 15 minutes (can also depend on the size of the input). After completing the job, You can view results in the S3 Browser panel. You can also download the files from S3 and can analyze the outcome of the job. Amazon Elastic MapReduce Resources Amazon Elastic MapReduce Documentation,http://aws.amazon.com/documentation/elasticmapreduce/ Amazon Elastic MapReduce Getting Started Guide,http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/ Amazon Elastic MapReduce Developer Guide,http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ Apache Hadoop,http://hadoop.apache.org/ See more at https://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html

April 23, 2012

by Muhammad Ali Khojaye

· 58,641 Views

Face Detection using HTML5, Javascript, Webrtc, Websockets, Jetty and OpenCV

How to create a real-time face detection system using HTML5, JavaScript, and OpenCV, leveraging WebRTC for webcam access and WebSockets for client-server communication.

April 23, 2012

by Jos Dirksen

· 52,131 Views

How-to: Python Data into Graphite for Monitoring Bliss

This post shows code examples in Python (2.7) for sending data to Graphite. Once you have a Graphite server setup, with Carbon running/collecting, you need to send it data for graphing. Basically, you write a program to collect numeric values and send them to Graphite's backend aggregator (Carbon). To send data, you create a socket connection to the graphite/carbon server and send a message (string) in the format: "metric_path value timestamp\n" `metric_path`: arbitrary namespace containing substrings delimited by dots. The most general name is at the left and the most specific is at the right. `value`: numeric value to store. `timestamp`: epoch time. messages must end with a trailing newline. multiple messages maybe be batched and sent in a single socket operation. each message is delimited by a newline, with a trailing newline at the end of the message batch. Example message: "foo.bar.baz 42 74857843\n" Let's look at some (Python 2.7) code for sending data to graphite... Here is a simple client that sends a single message to graphite. Code: #!/usr/bin/env python import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 message = 'foo.bar.baz 42 %d\n' % int(time.time()) print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() Here is a command line client that sends a single message to graphite: Usage: $ python client-cli.py metric_path value Code: #!/usr/bin/env python import argparse import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 parser = argparse.ArgumentParser() parser.add_argument('metric_path') parser.add_argument('value') args = parser.parse_args() if __name__ == '__main__': timestamp = int(time.time()) message = '%s %s %d\n' % (args.metric_path, args.value, timestamp) print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() Here is a client that collects load average (Linux-only) and sends a batch of 3 messages (1min/5min/15min loadavg) to graphite. It will run continuously in a loop until killed. (adjust the delay for faster/slower collection interval): #!/usr/bin/env python import platform import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 DELAY = 15 # secs def get_loadavgs(): with open('/proc/loadavg') as f: return f.read().strip().split()[:3] def send_msg(message): print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() if __name__ == '__main__': node = platform.node().replace('.', '-') while True: timestamp = int(time.time()) loadavgs = get_loadavgs() lines = [ 'system.%s.loadavg_1min %s %d' % (node, loadavgs[0], timestamp), 'system.%s.loadavg_5min %s %d' % (node, loadavgs[1], timestamp), 'system.%s.loadavg_15min %s %d' % (node, loadavgs[2], timestamp) ] message = '\n'.join(lines) + '\n' send_msg(message) time.sleep(DELAY) Resources: Graphite Docs Graphite Docs - Getting Your Data Into Graphite Installing Graphite 0.9.9 on Ubuntu 12.04 LTS Installing and configuring Graphite END

April 20, 2012

by Corey Goldberg

· 24,961 Views

Back To The Future with Datomic

At the beginning of March, Rich Hickey and his team released Datomic. Datomic is a novel distributed database system designed to enable scalable, flexible and intelligent applications, running on next-generation cloud architectures. Its launch was surrounded with quite some buzz and skepticism, mainly related to its rather disruptive architectural proposal. Instead of trying to recapitulate the various pros and cons of its architectural approach, I will try to focus on the other innovation it introduces, namely its powerful data model (based upon the concept of Datoms) and its expressive query language (based upon the concept of Datalog). The remainder of this article will describe how to store facts and query them through Datalog expressions and rules. Additionally, I will show how Datomic introduces an explicit notion of time, which allows for the execution of queries against both the previous and future states of the database. As an example, I will use a very simple data model that is able to describe genealogical information. As always, the complete source code can be found on the Datablend public GitHub repository. 1. The Datomic data model Datomic stores facts (i.e. your data points) as datoms. A datom represents the addition (or retraction) of a relation between an entity, an attribute, a value, and a transaction. The datom concept is closely related to the concept of a RDF triple, where each triple is a statement about a particular resource in the form of a subject-predicate-object expression. Datomic adds the notion of time by explicitly tagging a datom with a transaction identifier (i.e. the exact time-point at which the fact was persisted into the Datomic database). This allows Datomic to promote data immutability: updates are not changing your existing facts; they are merely creating new datoms that are tagged with a more recent transaction. Hence, the system keeps track of all the facts, forever. Datomic does not enforce an explicit entity schema; it’s up to the user to decide what type of attributes he/she want to store for a particular entity. Attributes are part of the Datomic meta model, which specifies the characteristics (i.e. attributes) of the attributes themselves. Our genealogical example data model stores information about persons and their ancestors. For this, we will require two attributes: name and parent. An attribute is basically an entity, expressed in terms of the built-in system attributes such as cardinality, value type and attribute description. // Open a connection to the database String uri = "datomic:mem://test"; Peer.createDatabase(uri); Connection conn = Peer.connect(uri); // Declare attribute schema List tx = new ArrayList(); tx.add(Util.map(":db/id", Peer.tempid(":db.part/db"), ":db/ident", ":person/name", ":db/valueType", ":db.type/string", ":db/cardinality", ":db.cardinality/one", ":db/doc", "A person's name", ":db.install/_attribute", ":db.part/db")); tx.add(Util.map(":db/id", Peer.tempid(":db.part/db"), ":db/ident", ":person/parent", ":db/valueType", ":db.type/ref", ":db/cardinality", ":db.cardinality/many", ":db/doc", "A person's parent", ":db.install/_attribute", ":db.part/db")); // Store it conn.transact(tx).get(); All entities in a Datomic database need to have an internal key, called the entity id. In our case, we generate a temporary id through the tempid utility method. All entities are stored within a specific database partition that groups together logically related entities. Attribute definitions need to reside in the :db.part/db partition, a dedicated system partition employed exclusively for storing system entities and schema definitions. :person/name is a single-valued attribute of value type string. :person/parent is a multi-valued attribute of value type ref. The value of a reference attribute points to (the id) of another entity stored within the Datomic database. Once our attribute schema is persisted, we can start populating our database with concrete person entities. // Define person entities List tx = new ArrayList(); Object edmond = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", edmond, ":person/name", "Edmond Suvee")); Object gilbert = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", gilbert, ":person/name", "Gilbert Suvee", ":person/parent", edmond)); Object davy = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", davy, ":person/name", "Davy Suvee", ":person/parent", gilbert)); // Store them conn.transact(tx).get(); We will create three concrete persons: myself, my dad Gilbert Suvee and my grandfather Edmond Suvee. Similarly to the definition of attributes, we again employ the tempid utility method to retrieve temporary ids for our newly created entities. This time however, we store our persons within the :db.part/user database partition, which is the default partition for storing application entities. Each person is given a name (via the :person/name attribute) and parent (via the :person/parent attribute). When calling the transact method, each entity is translated into a set of individual datoms that together describe the entity. Once persisted, Datomic ensures that temporary ids are replaced with their final counterparts. 2. The Datomic query language Datomic’s query model is an extended form of Datalog. Datalog is a deductive query system which will feel quite familiar to people who have experience with SPARQL and/or Prolog. The declarative query language makes use of a pattern matching mechanism to find all combinations of values (i.e. facts) that satisfy a particular set of conditions expressed as clauses. Let’s have a look at a few example queries: // Find all persons System.out.println(Peer.q("[:find ?name " + ":where [?person :person/name ?name] ]", conn.db())); // Find the parents of all persons System.out.println(Peer.q("[:find ?name ?parentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/name ?parentname] ]" , conn.db())); // Find the grandparent of all persons System.out.println(Peer.q("[:find ?name ?grandparentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/parent ?grandparent] " + "[?grandparent :person/name ?grandparentname] ]" , conn.db())); We consider entities to be of type person if they own a :person/name attribute. The :where-part of the first query, which aims at finding all persons stored in the Datomic database, specifies the following “conditional” clause: [?person :person/name ?name]. ?person and ?name are variables which act as placeholders. The Datalog query engine retrieves all facts (i.e. datoms) that match this clause. The :find-part of the query specifies the “values” that should be returned as the result of the query. Result query 1: [["Davy Suvee"], ["Edmond Suvee"], ["Gilbert Suvee"]] The second and the third query aim at retrieving the parents and grandparents of all persons stored in the Datomic database. These queries specify multiple clauses that are solved through the use of unification: when a variable name is used more than once, it must represent the same value in every clause in order to satisfy the total set of clauses. As expected, only Davy Suvee has been identified as having a grandparent, as the necessary facts to satisfy this query are not available for neither Gilbert Suvee and Edmond Suvee. Result query 2: [["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"]] Result query 3: [["Davy Suvee" "Edmond Suvee"]] If several queries require this “grandparent” notion, one can define a reusable rule that encapsulates the required clauses. Rules can be flexibly combined with clauses (and other rules) in the :where-part of a query. Our third query can be rewritten using the following rules and clauses: String grandparentrule = "[ [ (grandparent ?person ?grandparent) [?person :person/parent ?parent] " + "[?parent :person/parent ?grandparent] ] ]"; System.out.println(Peer.q("[:find ?name ?grandparentname " + ":in $ % " + ":where [?person :person/name ?name] " + "(grandparent ?person ?grandparent) " + "[?grandparent :person/name ?grandparentname] ]" , conn.db(), grandparentrule)); Rules can also be used to write recursive queries. Imagine the ancestor-relationship. It’s impossible to predict the number of parent-levels one needs to go up in order to retrieve the ancestors of a person. As Datomic rules supports the notion of recursion, a rule can call itself within its definition. Similar to recursion in other languages, recursive rules are build up out of a simple base case and a set of clauses which reduce all other cases toward this base case. String ancestorrule = "[ [ (ancestor ?person ?ancestor) [?person :person/parent ?ancestor] ] " + "[ (ancestor ?person ?ancestor) [?person :person/parent ?parent] " + "(ancestor ?parent ?ancestor) ] ] ]"; System.out.println(Peer.q("[:find ?name ?ancestorname " + ":in $ % " + ":where [?person :person/name ?name] " + "[ancestor ?person ?ancestor] " + "[?ancestor :person/name ?ancestorname] ]" , conn.db(), ancestorrule)); Result query 4: [["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"]] 3. Back To The Future I As already mentioned in section 1, Datomic does not perform in-place updates. Instead, all facts are stored and tagged with a transaction such that the most up-to-date value of a particular entity attribute can be retrieved. By doing so, Datomic allows you to travel back into time and perform queries against previous states of the database. Using the asOf method, one can retrieve a version of the database that only contains facts that were part of the database at that particular moment in time. The use of a checkpoint that predates the storage of my own person entity will result in parent-query results that do not longer contain results related to myself. System.out.println(Peer.q("[:find ?name ?parentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/name ?parentname] ]", conn.db().asOf(getCheckPoint(checkpoint)))); Result query 2: [["Gilbert Suvee" "Edmond Suvee"]] 4. Back To The Future II Datomic also allows to predict the future. Well, sort of … Similar to the asOf method, one can use the with method to retrieve a version of the database that gets extended with a list of not-yet transacted datoms. This allows to run queries against future states of the database and to observe the implications if these new facts were to be added. List tx = new ArrayList(); tx.add(Util.map(":db/id", Peer.tempid(":db.part/user"), ":person/name", "FutureChild Suvee", ":person/parent", Peer.q("[:find ?person :where [?person :person/name \"Davy Suvee\"] ]", conn.db()).iterator().next().get(0))); System.out.println(Peer.q("[:find ?name ?ancestorname " + ":in $ % " + ":where [?person :person/name ?name] " + "[ancestor ?person ?ancestor] " + "[?ancestor :person/name ?ancestorname] ]" , conn.db().with(tx), ancestorrule)); Result query 4: [["FutureChild Suvee" "Edmond Suvee"], ["FutureChild Suvee" "Gilbert Suvee"], ["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"], ["FutureChild Suvee" "Davy Suvee"]] 5. Conclusion The use of Datoms and Datalog allows you to express simple, yet powerful queries. This article introduces only a fraction of the features offered by Datomic. To get myself better acquainted with the various Datomic gotchas, I implemented the Tinkerpop Blueprints API on top of Datomic. By doing so, you basically get a distributed, temporal graph database, which is, as far as I know, unique within the Graph database ecosystem. The source code of this Blueprints implementation can currently be found on the Datablend public GitHub repository and will soon be merged within the Tinkerpop project..

April 14, 2012

by Davy Suvee

· 17,802 Views

Caching With WCF Services

This is the first part of a two part article about caching in WCF services. In this part I will explain the in-process memory cache available in .NET 4.0. In the second part I will describe the Windows AppFabric distributed memory cache. The .NET framework has provided a cache for ASP.NET applications since version 1.0. For other types of applications like WPF applications or console application, caching was never possible out of the box. Only WCF services were able to use the ASP.NET cache if they were configured to run in ASP.NET compatibility mode. But this mode has some performance drawbacks and only works when the WCF service is hosted inside IIS and uses an HTTP-based binding. With the release of the .NET 4.0 framework this has luckily changed. Microsoft has now developed an in-process memory cache that does not rely on the ASP.NET framework. This cache can be found in the “System.Runtime.Caching.dll” assembly. In order to explain the working of the cache, I have a created a simple sample application. It consists of a very slow repository called “SlowRepository”. public class SlowRepository { public IEnumerable GetPizzas() { Thread.Sleep(10000); return new List() { "Hawaii", "Pepperoni", "Bolognaise" }; } } This repository is used by my sample WCF service to gets its data. public class PizzaService : IPizzaService { private const string CacheKey = "availablePizzas"; private SlowRepository repository; public PizzaService() { this.repository = new SlowRepository(); } public IEnumerable GetAvailablePizzas() { ObjectCache cache = MemoryCache.Default; if(cache.Contains(CacheKey)) return (IEnumerable)cache.Get(CacheKey); else { IEnumerable availablePizzas = repository.GetPizzas(); // Store data in the cache CacheItemPolicy cacheItemPolicy = new CacheItemPolicy(); cacheItemPolicy.AbsoluteExpiration = DateTime.Now.AddHours(1.0); cache.Add(CacheKey, availablePizzas, cacheItemPolicy); return availablePizzas; } } } When the WCF service method GetAvailablePizzas is called, the service first retrieves the default memory cache instance ObjectCache cache = MemoryCache.Default; Next, it checks if the data is already available in the cache. If so, the cached data is used. If not, the repository is called to get the data and afterwards the data is stored in the cache. For my sample service, I also choose to restrict the maximum memory to 20% of the total physical memory. This can be done in the web.config.

April 13, 2012

by Pieter De Rycke

· 21,486 Views · 1 Like

How to Use Sigma.js with Neo4j

i’ve done a few posts recently using d3.js and now i want to show you how to use two other great javascript libraries to visualize your graphs. we’ll start with sigma.js and soon i’ll do another post with three.js . we’re going to create our graph and group our nodes into five clusters. you’ll notice later on that we’re going to give our clustered nodes colors using rgb values so we’ll be able to see them move around until they find their right place in our layout. we’ll be using two sigma.js plugins, the gefx (graph exchange xml format) parser and the forceatlas2 layout. you can see what a gefx file looks like below. notice it comes from gephi which is an interactive visualization and exploration platform, which runs on all major operating systems, is open source, and is free. ... ... in order to build this file, we will need to get the nodes and edges from the graph and create an xml file. get '/graph.xml' do @nodes = nodes @edges = edges builder :graph end we’ll use cypher to get our nodes and edges: def nodes neo = neography::rest.new cypher_query = " start node = node:nodes_index(type='user')" cypher_query << " return id(node), node" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0]}.merge(n[1]["data"])} end we need the node and relationship ids, so notice i’m using the id() function in both cases. def edges neo = neography::rest.new cypher_query = " start source = node:nodes_index(type='user')" cypher_query << " match source -[rel]-> target" cypher_query << " return id(rel), id(source), id(target)" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0], "source" => n[1], "target" => n[2]} } end so far we have seen graphs represented as json, and we’ve built these manually. today we’ll take advantage of the builder ruby gem to build our graph in xml. xml.instruct! :xml xml.gexf 'xmlns' => "http://www.gephi.org/gexf", 'xmlns:viz' => "http://www.gephi.org/gexf/viz" do xml.graph 'defaultedgetype' => "directed", 'idtype' => "string", 'type' => "static" do xml.nodes :count => @nodes.size do @nodes.each do |n| xml.node :id => n["id"], :label => n["name"] do xml.tag!("viz:size", :value => n["size"]) xml.tag!("viz:color", :b => n["b"], :g => n["g"], :r => n["r"]) xml.tag!("viz:position", :x => n["x"], :y => n["y"]) end end end xml.edges :count => @edges.size do @edges.each do |e| xml.edge:id => e["id"], :source => e["source"], :target => e["target"] end end end end you can get the code on github as usual and see it running live on heroku. you will want to see it live on heroku so you can see the nodes in random positions and then move to form clusters. use your mouse wheel to zoom in, and click and drag to move around. credit goes out to alexis jacomy and mathieu jacomy . you’ve seen me create numerous random graphs, but for completeness here is the code for this graph. notice how i create 5 clusters and for each node i assign half its relationships to other nodes in their cluster and half to random nodes? this is so the forceatlas2 layout plugin clusters our nodes neatly. def create_graph neo = neography::rest.new graph_exists = neo.get_node_properties(1) return if graph_exists && graph_exists['name'] names = 500.times.collect{|x| generate_text} clusters = 5.times.collect{|x| {:r => rand(256), :g => rand(256), :b => rand(256)} } commands = [] names.each_index do |n| cluster = clusters[n % clusters.size] commands << [:create_node, {:name => names[n], :size => 5.0 + rand(20.0), :r => cluster[:r], :g => cluster[:g], :b => cluster[:b], :x => rand(600) - 300, :y => rand(150) - 150 }] end names.each_index do |from| commands << [:add_node_to_index, "nodes_index", "type", "user", "{#{from}"] connected = [] # create clustered relationships members = 20.times.collect{|x| x * 10 + (from % clusters.size)} members.delete(from) rels = 3 rels.times do |x| to = members[x] connected << to commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless to == from end # create random relationships rels = 3 rels.times do |x| to = rand(names.size) commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless (to == from) || connected.include?(to) end end batch_result = neo.batch *commands end

April 12, 2012

by Max De Marzi

· 15,084 Views

F1 Live Timing Map

this is a live timing map application for f1 championship races made using javascript and google maps markers. the live timing data is supplied by formula1.com. it’s interactive, you can press over a driver to track him or press into an empty map zone to untrack and have a general view. it has also been made with a responsive design to adapt it to mobile browsers using jquerymobile framework. how it works: the client side: until the race start date a countdown and a demo race is showed. when the countdown finishes it will connect to server (using ajax) to get the live timing data from server (every five seconds) and the interface will be updated using this data. the server side: it uses a django app for the web page and the static race data (circuit, laps, drivers) is put into the html using the django template system. for the dynamic data (live timing) i have modified the source of a c program for the linux terminal called live-f1 to generate a json with the data that the client requires instead of printing it on terminal screen. enjoy the race!

April 12, 2012

by Luis Sobrecueva

· 15,191 Views

A Regular Expression HashMap Implementation in Java

Below is an implementation of a Regular Expression HashMap. It works with key-value pairs which the key is a regular expression. It compiles the key (regular expression) while adding (i.e. putting), so there is no compile time while getting. Once getting an element, you don't give regular expression; you give any possible value of a regular expression. As a result, this behaviour provides to map numerous values of a regular expression into the same value. The class does not depend to any external libraries, uses only default java.util. So, it will be used simply when a behaviour like that is required. import java.util.ArrayList; import java.util.HashMap; import java.util.regex.Pattern; /** * This class is an extended version of Java HashMap * and includes pattern-value lists which are used to * evaluate regular expression values. If given item * is a regular expression, it is saved in regexp lists. * If requested item matches with a regular expression, * its value is get from regexp lists. * * @author cb * * @param : Key of the map item. * @param : Value of the map item. */ public class RegExHashMap extends HashMap { // list of regular expression patterns private ArrayList regExPatterns = new ArrayList(); // list of regular expression values which match patterns private ArrayList regExValues = new ArrayList(); /** * Compile regular expression and add it to the regexp list as key. */ @Override public V put(K key, V value) { regExPatterns.add(Pattern.compile(key.toString())); regExValues.add(value); return value; } /** * If requested value matches with a regular expression, * returns it from regexp lists. */ @Override public V get(Object key) { CharSequence cs = new String(key.toString()); for (int i = 0; i < regExPatterns.size(); i++) { if (regExPatterns.get(i).matcher(cs).matches()) { return regExValues.get(i); } } return super.get(key); } }

April 11, 2012

by Cagdas Basaraner

· 24,425 Views

Configuring Quartz With JDBCJobStore in Spring

I am starting a little series about Quartz scheduler internals, tips and tricks, this is chapter 0 - how to configure persistent job store.

April 7, 2012

by Tomasz Nurkiewicz

CORE

· 37,129 Views

Algorithm of the Week: Rabin-Karp String Searching

Brute force string matching is a very basic sub-string matching algorithm, but it’s good for some reasons. For example it doesn’t require preprocessing of the text or the pattern. The problem is that it’s very slow. That is why in many cases brute force matching can’t be very useful. For pattern matching we need something faster, but to understand other sub-string matching algorithms let’s take a look once again at brute force matching. In brute force sub-string matching we checked every single character from the text with the first character of the pattern. Once we have a match between them we shift the comparison between the second character of the pattern with the next character of the text, as shown on the picture below. This algorithm is slow for mainly two reasons. First, we have to check every single character from the text. On the other hand even if we find a match between a text character and the first character of the pattern we continue to check step by step (character by character) every single symbol of the pattern in order to find whether it is in the text. So is there any other approach to find whether the text contains the pattern? In fact there is a “faster” approach. In this case, in order to avoid the comparison between the pattern and the text character by character, we’ll try to compare them all at once, so we need a good hash function. With its help we can hash the pattern and check against hashed sub-strings of the text. We must be sure that the hash function is returning “small” hash codes for larger sub-strings. Another problem is that for larger patterns we can’t expect to have short hashes. But besides this the approach should be quite effective compared to the brute force string matching. This approach is known as Rabin-Karp algorithm. Overview Michael O. Rabin and Richard M. Karp came up with the idea of hashing the pattern and to check it against a hashed sub-string from the text in 1987. In general the idea seems quite simple, the only thing is that we need a hash function that gives different hashes for different sub-strings. Said hash function, for instance, may use the ASCII codes for every character, but we must be careful for multi-lingual support. The hash function may vary depending on many things, so it may consist of ASCII char to number converting, but it can also be anything else. The only thing we need is to convert a string (pattern) into some hash that is faster to compare. Let’s say we have the string “hello world”, and let’s assume that its hash is hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern “he” is contained in the text “hello world”. So in every step, we take from the text a sub-string with the length of m, where m is the pattern length. Thus we hash this sub-string and we can directly compare it to the hashed pattern, as in the picture above. Implementation So far we saw some diagrams explaining the Rabin-Karp algorithm, but let’s take a look at its implementation here, in this very basic example where a simple hash table is used in order to convert the characters into integers. The code is PHP and it’s used only to illustrate the principles of this algorithm. function hash_string($str, $len) { $hash = ''; $hash_table = array( 'h' => 1, 'e' => 2, 'l' => 3, 'o' => 4, 'w' => 5, 'r' => 6, 'd' => 7, ); for ($i = 0; $i < $len; $i++) { $hash .= $hash_table[$str{$i}]; } return (int)$hash; } function rabin_karp($text, $pattern) { $n = strlen($text); $m = strlen($pattern); $text_hash = hash_string(substr($text, 0, $m), $m); $pattern_hash = hash_string($pattern, $m); for ($i = 0; $i < $n-$m+1; $i++) { if ($text_hash == $pattern_hash) { return $i; } $text_hash = hash_string(substr($text, $i, $m), $m); } return -1; } // 2 echo rabin_karp('hello world', 'ello'); Multiple Pattern Match It’s great to say that the Rabin-Karp algorithm is great for multiple pattern match. Indeed its nature is supposed to support such functionality, which is its advantage in comparison to other string searching algorithms. Complexity The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. So where is it compared to brute-force matching? Well, brute force matching complexity is O(nm), so as it seems there’s not much of a gain in performance. However, it’s considered that Rabin-Karp’s complexity is O(n+m) in practice, and that makes it a bit faster, as shown on the chart below. Note that the Rabin-Karp algorithm also needs O(m) preprocessing time. Application As we saw Rabin-Karp is not much faster than brute force matching. So where we should use it? 3 Reasons Why Rabin-Karp is Cool 1. Good for plagiarism, because it can deal with multiple pattern matching! 2. Not faster than brute force matching in theory, but in practice its complexity is O(n+m)! 3. With a good hashing function it can be quite effective and it’s easy to implement! 2 Reasons Why Rabin-Karp is Not Cool 1. There are lots of string matching algorithms that are faster than O(n+m) 2. It’s practically as slow as brute force matching and it requires additional space Final Words Rabin-Karp is a great algorithm for one simple reason – it can be used to match against multiple patterns. This makes it perfect to detect plagiarism even for larger phrases.

April 3, 2012

by Stoimen Popov

· 36,427 Views

Wrapping Begin/End Async API Into C#5 Tasks

Microsoft offered programmers several different ways of dealing with the asynchronous programming since .NET 1.0. The first model was Asynchronous programming model or APM for short. The pattern is implemented with two methods named BeginOperation and EndOperation. .NET 4 introduced new pattern – Task Asynchronous Pattern and with the introduction of .NET 4.5, Microsoft added language support for language integrated asynchronous coding style. You can check the MSDN for more samples and information. I will assume that you are familiar with it and have written code using it. You can wrap existing APM pattern into TPL pattern using the Task.Factory.FromAsync methods. For example: public static Task> ExecuteAsync(this DataServiceQuery query, object state) { return Task.Factory.FromAsync>(query.BeginExecute, query.EndExecute, state); } It is easy to wrap most of the asynchronous functions this way, but some cannot be since the wrapper functions assume that the last two parameters to the BeginOperation are AsyncCallback and object, and there are some versions of asynchronous operations that have different specifications. Examples: Extra parameters after the object state parameter: IAsyncResult DataServiceContext.BeginExecuteBatch( AsyncCallback callback, object state, params DataServiceRequest[] queries); Missing the expected object state parameter and different return type: ICancelableAsyncResult BeginQuery(AsyncCallback callBack); WorkItemCollection EndQuery(ICancelableAsyncResult car); Short solution for the first example The short and elegant way for wrapping the first example is to provide the following wrapper: public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { if (context == null) throw new ArgumentNullException("context"); return Task.Factory.FromAsync( context.BeginExecuteBatch(null, state, queries), context.EndExecuteBatch); } We simply call the Begin method ourselves and then wrap it using an another overload for FromAsync function. The longer way However, we can fully wrap it ourselves by simulating what the FromAsync wrapper does. The complete code is listed below. public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { // this will be our sentry that will know when our async operation is completed var tcs = new TaskCompletionSource(); try { context.BeginExecuteBatch((iar) => { try { var result = context.EndExecuteBatch(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { // if the inner operation was canceled, this task is cancelled too tcs.TrySetCanceled(); } catch (Exception ex) { // general exception has been set bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }, state, queries); } catch { tcs.TrySetResult(default(DataServiceResponse)); // propagate exceptions to the outside throw; } return tcs.Task; } Besides educational benefits, writing the full wrapper code allows us to add cancellation, logging and diagnostic information. Once we understand how to wrap APM pattern, We can now tackle the second problem easily. Handling the BeginQuery/EndQuery We will first create our own wrapper function in the spirit of the above code with the notable difference that we use the ICancelableAsyncResult interface instead of the IAsyncResult. public static class TaskEx { public static Task FromAsync(Func beginMethod, Func endMethod) { if (beginMethod == null) throw new ArgumentNullException("beginMethod"); if (endMethod == null) throw new ArgumentNullException("endMethod"); var tcs = new TaskCompletionSource(); try { beginMethod((iar) => { try { var result = endMethod(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { tcs.TrySetCanceled(); } catch (Exception ex) { bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }); } catch { tcs.TrySetResult(default(TResult)); throw; } return tcs.Task; } } The code is pretty self-explanatory and we can go ahead with the wrapping. There are four different operations that are exposed both in synchronous and asynchronous version: Query, LinkQuery, CountOnlyQuery and RegularQuery. The extension methods are short since we have already created our generic wrapper above: public static Task RunQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginQuery, query.EndQuery); } public static Task RunLinkQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginLinkQuery, query.EndLinkQuery); } public static Task RunCountOnlyQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginCountOnlyQuery, query.EndCountOnlyQuery); } public static Task RunRegularQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginRegularQuery, query.EndRegularQuery); } That is it for today, you can write your own handy extensions easily for APM functions out there.

April 2, 2012

by Toni Petrina

· 10,893 Views

Converting a Value to String in JavaScript

In JavaScript, there are three main ways in which any value can be converted to a string. This blog post explains each way, along with its advantages and disadvantages. Three approaches for converting to string The three approaches for converting to string are: value.toString() "" + value String(value) The problem with approach #1 is that it doesn’t work if the value is null or undefined. That leaves us with approaches #2 and #3, which are basically equivalent. ""+value: The plus operator is fine for converting a value when it is surrounded by non-empty strings. As a way for converting a value to string, I find it less descriptive of one’s intentions. But that is a matter of taste, some people prefer this approach to String(value). String(value): This approach is nicely explicit: Apply the function String() to value. The only problem is that this function call will confuse some people, especially those coming from Java, because String is also a constructor. However, function and constructor produce completely different results: > String("abc") === new String("abc") false > typeof String("abc") 'string' > String("abc") instanceof String false > typeof new String("abc") 'object' > new String("abc") instanceof String true The function produces, as promised, a string (a primitive [1]). The constructor produces an instance of the type String (an object). The latter is hardly ever useful in JavaScript, which is why you can usually forget about String as a constructor and concentrate on its role as converting to string. A minor difference between ""+value and String(value) Until now you have heard that + and String() convert their “argument” to string. But how do they actually do that? It turns out that they do it in slightly different ways, but usually arrive at the same result. Converting primitives to string Both approaches use the internal ToString() operation to convert primitives to string. “Internal” means: a function specified by the ECMAScript 5.1 (§9.8) that isn’t accessible to the language itself. The following table explains how ToString() operates on primitives. Argument Result undefined "undefined" null "null" boolean value either "true" or "false" number value the number as a string, e.g. "1.765" string value no conversion necessary Converting objects to string Both approaches first convert an object to a primitive, before converting that primitive to string. However, + uses the internal ToNumber() operator (except for dates [2]), while String() uses ToString(). ToNumber(): To convert an object obj to a primitive, invoke obj.valueOf(). If the result is primitive, return that result. Otherwise, invoke obj.toString(). If the result is primitive, return that result. Otherwise, throw a TypeError. ToString(): Works the same, but invokes obj.toString() before obj.valueOf(). With the following object, you can observe the difference: var obj = { valueOf: function () { console.log("valueOf"); return {}; // not a primitive, keep going }, toString: function () { console.log("toString"); return {}; // not a primitive, keep going } }; Interaction: > "" + obj valueOf toString TypeError: Cannot convert object to primitive value > String(obj) toString valueOf TypeError: Cannot convert object to primitive value Most objects use the default implementation of valueOf() which returns this for objects. Hence, that method will always be skipped by ToNumber(). > var x = {} > x.valueOf() === x true Instances of Boolean, Number, and String wrap primitives and valueOf returns the wrapped primitive. But that still means that the final result will be the same as for toString(), even though it will have been produced in a different manner. > var n = new Number(756) > n.valueOf() === n false > n.valueOf() === 756 true Conclusion Which of the three approaches for converting to string should you choose? value.toString() can be OK, if you are sure that value will never be null or undefined. Otherwise, ""+value and String(value) are mostly equivalent. Which one people prefer is a matter of taste. I find String(value) more explicit. Related posts JavaScript values: not everything is an object [primitives versus objects] What is {} + {} in JavaScript? [explains how the + operator works] String concatenation in JavaScript [how to best concatenate many strings]

March 25, 2012

by Axel Rauschmayer

· 31,376 Views · 2 Likes

Using "Natural": A NLP Module for node.js

Like most node modules "natural" is packaged as an NPM and can be installed from the command line with node.js.

March 27, 2012

by Christopher Umbel

· 63,151 Views · 3 Likes

Algorithm of the Week: Brute Force String Matching

String matching is something crucial for database development and text processing software. Fortunately, every modern programming language and library is full of functions for string processing that help us in our everyday work. However it's important to understand their principles. String algorithms can typically be divided into several categories. One of these categories is string matching. When it comes to string matching, the most basic approach is what is known as brute force, which simply means to check every single character from the text to match against the pattern. In general we have a text and a pattern (most commonly shorter than the text). What we need to do is to answer the question whether this pattern appears in the text. Overview The principles of brute force string matching are quite simple. We must check for a match between the first characters of the pattern with the first character of the text as on the picture bellow. If they don’t match, we move forward to the second character of the text. Now we compare the first character of the pattern with the second character of the text. If they don’t match again, we move forward until we get a match or until we reach the end of the text. In case they match, we move forward to the second character of the pattern comparing it with the “next” character of the text, as shown in the picture bellow. Just because we have found a match between the first character from the pattern and some character of the text, doesn’t mean that the pattern appears in the text. We must move forward to see whether the full pattern is contained in the text. Implementation Implementation of brute force string matching is easy and here we can see a short PHP example. The bad news is that this algorithm is naturally quite slow. function sub_string($pattern, $subject) { $n = strlen($subject); $m = strlen($pattern); for ($i = 0; i < $n-$m; $i++) { $j = 0; while ($j < $m && $subject[$i+$j] == $pattern[$j]) { $j++; } if ($j == $m) return $i; } return -1; } echo sub_string('o wo', 'hello world!'); Complexity As I said this algorithm is slow. Actually every algorithm that contains “brute force” in its name is slow, but to show how slow string matching is, I can say that its complexity is O(n.m). Here n is the length of the text, while m is the length of the pattern. In case we fix the length of the text and test against variable length of the pattern, again we get a rapidly growing function. Application Brute force string matching can be very ineffective, but it can also be very handy in some cases. Just like the sequential search. It can be very useful… Doesn’t require pre-processing of the text – Indeed if we search the text only once we don’t need to pre-process it. Most of the algorithms for string matching need to build an index of the text in order to search quickly. This is great when you’ve to search more than once into a text, but if you do only once, perhaps (for short texts) brute force matching is great! Doesn’t require additional space – Because brute force matching doesn’t need pre-processing it also doesn’t require more space, which is one cool feature of this algorithm Can be quite effective for short texts and patterns It can be ineffective… If we search the text more than once – As I said in the previous section if you perform the search more than once it’s perhaps better to use another string matching algorithm that builds an index, and it’s faster. It’s slow – In general brute force algorithms are slow and brute force matching isn’t an exception. Final Words String matching is something very special in software development and it is used in various cases, so every developer must be familiar with this topic.

March 27, 2012

by Stoimen Popov

· 61,462 Views · 3 Likes

Cassandra Indexing: The Good, the Bad and the Ugly

Within NoSQL, the operations of indexing, fetching and searching for information are intimately tied to the physical storage mechanisms. It is important to remember that rows are stored across hosts, but a single row is stored on a single host. (with replicas) Columns families are stored in sorted order, which makes querying a set of columns efficient (provided you are spanning rows). The Bad : Partitioning One of the tough things to get used to at first is that without any indexes queries that span rows can (very) be bad. Thinking back to our storage model however, that isn't surprising. The strategy that Cassandra uses to distribute the rows across hosts is called Partitioning. Partitioning is the act of carving up the range of rowkeys assigning them into the "token ring", which also assigns responsibility for a segment (i.e. partition) of the rowkey range to each host. You've probably seen this when you initialized your cluster with a "token". The token gives the host a location along the token ring, which assigns responsibility for a section of the token range. Partitioning is the act of mapping the rowkey into the token range. There are two primary partitioners: Random and Order Preserving. They are appropriately named. The RandomPartitioner hashes the rowkeys into tokens. With the RandomPartitioner, the token is a hash of the rowkey. This does a good job of evenly distributing your data across a set of nodes, but makes querying a range of the rowkey space incredibly difficult. From only a "start rowkey" value and an "end rowkey" value, Cassandra can't determine what range of the token space you need. It essentially needs to perform a "table scan" to answer the query, and a "table scan" in Cassandra is bad because it needs to go to each machine (most likely ALL machines if you have a good hash function) to answer the query. Now, at the great cost of even data distribution, you can employ the OrderPreservingPartitioner (OPP). I am *not* down with OPP. The OPP preserves order as it translates rowkeys into tokens. Now, given a start rowkey value and a end rowkey value, Cassandra *can* determine exactly which hosts have the data you are looking for. It computes the start value to a token the end value to a token, and simply selects and returns everything in between. BUT, by preserving order, unless your rowkeys are evenly distributed across the space, your tokens won't be either and you'll get a lopsided cluster, which greatly increases the cost of configuration and administration of the cluster. (not worth it) The Good : Secondary Indexes Cassandra does provide a native indexing mechanism in Secondary Indexes. Secondary Indexes work off of the columns values. You declare a secondary index on a Column Family. Datastax has good documentation on the usage. Under the hood, Cassandra maintains a "hidden column family" as the index. (See Ed Anuff's presentation for specifics) Since Cassandra doesn't maintain column value information in any one node, and secondary indexes are on columns value (rather than rowkeys), a query still needs to be sent to all nodes. Additionally, secondary indexes are not recommended for high-cardinality sets. I haven't looked yet, but I'm assuming this is because of the data model used within the "hidden column family". If the hidden column family stores a row per unique value (with rowkeys as columns), then it would mean scanning the rows to determine if they are within the range in the query. From Ed's presentation: Not recommended for high cardinality values(i.e.timestamps,birthdates,keywords,etc.) Requires at least one equality comparison in a query--not great for less-than/greater-than/range queries Unsorted - results are in token order, not query value order Limited to search on datatypes, Cassandra natively understands With all that said, secondary indexes work out of the box and we've had good success using them on simple values. The Ugly : Do-It-Yourself (DIY) / Wide-Rows Now, beauty is in the eye of the beholder. One of the beautiful things about NoSQL is the simplicity. The constructs are simple: Keyspaces, Column Families, Rows and Columns. Keeping it simple however means sometimes you need to take things into your own hands. This is the case with wide-row indexes. Utilizing Cassandra's storage model, its easy to build your own indexes where each row-key becomes a column in the index. This is sometimes hard to get your head around, but lets imagine we have a case whereby we want to select all users in a zip code. The main users column family is keyed on userid, zip code is a column on each user row. We could use secondary indexes, but there are quite a few zip codes. Instead we could maintain a column family with a single row called "idx_zipcode". We could then write columns into this row of the form "zipcode_userid". Since the columns are stored in sorted order, it is fast to query for all columns that start with "18964" (e.g. we could use 18964_ and 18964_ZZZZZZ as start and end values). One obvious downside of this approach is that rows are self-contained on a host. (again except for replicas) This means that all queries are going to hit a single node. I haven't yet found a good answer for this. Additionally, and IMHO, the ugliest part of DIY wide-row indexing is from a client perspective. In our implementation, we've done our best to be language agnostic on the client-side, allowing people to pick the best tool for the job to interact with the data in Cassandra. With that mentality, the DIY indexes present some trouble. Wide-rows often use composite keys (imagine if you had an idx_state_zip, which would allow you to query by state then zip). Although there is "native" support for composite keys, all of the client libraries implement their own version of them (Hector, Astyanax, and Thrift). This means that client needing to query data needs to have the added logic to first query the index, and additionally all clients need to construct the composite key in the same manner. Making It Better... For this very reason, we've decided to release two open source projects that help push this logic to the server-side. The first project is Cassandra-Triggers. This allows you to attached asynchronous activities to writes in Cassandra. (one such activity could be indexing) We've also released Cassandra-Indexing. This is hot off the presses and is still in its infancy (e.g. it only supports UT8Types in the index), but the intent is to provide a generic server-side mechanism that indexes data as its written to Cassandra. Employing the same server-side technique we used in Cassandra-Indexing, you simply configure the columns you want indexed, and the AOP code does the rest as you write to the target CF. As always, questions, comments and thoughts are welcome. (especially if I'm off-base somewhere)

March 23, 2012

by Brian O' Neill

· 35,096 Views

PHP objects in MongoDB with Doctrine

An is equivalent to an Object-Relational Mapper, but with its targets are documents of a NoSQL database instead of table rows. No one said that a Data Mapper must always rely on a relational database as its back end. In the PHP world, probably the Doctrine ODM for MongoDB is the most successful. This followes to the opularity of Mongo, which is a transitional product between SQL and NoSQL, still based on some relational concepts like queries. Lots of features The Doctrine Mongo ODM supports mapping of objects via annotations placed in the class source code, or via external XML or YAML files. In this and in many aspects it is based on the same concepts as the Doctrine ORM: it features a Facade DocumentManager object and a Unit Of Work that batches changes to the database when objects are added to it. Moreover, two different types of relationships between objects are supported: references and embedded documents. The first is the equivalent of the classical pointer to another row which ORM always transform object references into; the second actually stores an object inside another one, like you would do with a Value Object. Thus, at least in Doctrine's case, it is easier to map objects as documents that as rows. As said before, the ODM borrows some concepts and classes from the ORM, in particular from the Doctrine\Common package which features a standard collection class. So if you have built objects mapped with the Doctrine ORM nothing changes for persisting them in MongoDB, except for the mapping metadata itself. Advantages If an ORM is sometimes a leaky abstraction, an ODM probably becomes an issue less often. It has less overhead than an ORM, since there is no schema to define and the ability to embed objects means there should be no compromises between the object model and the capabilities of the database. How many times we have renounced introducing a potential Value Object because of the difficulty in persisting it? The case for an ODM over a plain Mongo connection object is easy to make: you will still be able to use objects with proper encapsulation (like private fields and associations) and behavior (many methods) instead of extracting just a JSON package from your database. Installation A prerequisite for the ODM is the presence of the mongo extension, that can be installed via pecl. After having verified the extension is present, grab the Doctrine\Common as the 2.2.x package, and a zip of the doctrine-mongodb and doctrine-mongodb-odm projects from Github. Decompress everything into a Doctrine/ folder. After having setup autoloading for classes in Doctrine\, use this bootstrap to get a DocumentManager (the equivalent of EntityManager): use Doctrine\Common\Annotations\AnnotationReader, Doctrine\ODM\MongoDB\DocumentManager, Doctrine\MongoDB\Connection, Doctrine\ODM\MongoDB\Configuration, Doctrine\ODM\MongoDB\Mapping\Driver\AnnotationDriver; private function getADm() { $config = new Configuration(); $config->setProxyDir(__DIR__ . '/mongocache'); $config->setProxyNamespace('MongoProxies'); $config->setDefaultDB('test'); $config->setHydratorDir(__DIR__ . '/mongocache'); $config->setHydratorNamespace('MongoHydrators'); $reader = new AnnotationReader(); $config->setMetadataDriverImpl(new AnnotationDriver($reader, __DIR__ . '/Documents')); return DocumentManager::create(new Connection(), $config); } You will be able to call persist() and flush() on the DocumentManager, along with a set of other methods for querying like find() and getRepository(). Integration with an ORM We are researching a solution for versioning objects mapped with the Doctrine ORM. Doing this with a version column would be invasive, and also strange where multiple objects are involved (do you version just the root of an object graph? Duplicate the other ones when they change? How can you detect that?) The idea is taking a snapshot and putting it in a read only MongoDB instance, where all previous versions can be retrieved later for auditing (business reasons). This has been verified to be technically possible: the DocumentManager and EntityManager are totally separate object graphs, so they won't clash with each other. The only point of conflict is the annotations of model classes, since both use different version of @Id, and can see the other's annotation like @Entity and @Document while parsing. This can be solved by using aliases for all the annotations, using their parent namespace basename as a prefix: model = $model; } public function __toString() { return "Car #$this->document_id: $this->id, $this->model"; } } This make us able to save a copy of an ORM object into Mongo: $car = new Car('Ford'); $this->em->persist($car); $this->em->flush(); $this->dm->persist($car); $this->dm->flush(); var_dump($car->__toString()); $this->assertTrue(strlen($car->__toString()) > 20); The output produces by this test is: .string(38) "Car #4f61a8322f762f1121000000: 3, Ford" When retrieving the object, one of the two ids will be null as it is ignored by the ORM or ODM. I am not using the same field because I want to store multiple copies of a row, so it's id alone won't be unique. If you're interested, checkout my hack on Github. It contains the running example presented in this post. Remember to create the relational schema with: $ php doctrine.php orm:schema-tool:create before running the test with phpunit --bootstrap bootstrap.php DoubleMappingTest.php MongoDB won't need the schema setup, of course. There are still some use cases to test, like the behavior in the presence of proxies, but it seems that non-invasive approach of Data Mappers like Doctrine 2 is paying off: try mapping an object in multiple database with Active Records.

March 20, 2012

by Giorgio Sironi

· 21,955 Views

Adding a .first() method to Django's QuerySet

In my last Django project, we had a set of helper functions that we used a lot. The most used was helpers.first, which takes a query set and returns the first element, or None if the query set was empty. Instead of writing this: try: object = MyModel.objects.get(key=value) except model.DoesNotExist: object = None You can write this: def first(query): try: return query.all()[0] except: return None object = helpers.first(MyModel.objects.filter(key=value)) Note, that this is not identical. The get method will ensure that there is exactly one row in the database that matches the query. The helper.first() method will silently eat all but the first matching row. As long as you're aware of that, you might choose to use the second form in some cases, primarily for style reasons. But the syntax on the helper is a little verbose, plus you're constantly including helpers.py. Here is a version that makes this available as a method on the end of your query set chain. All you have to do is have your models inherit from this AbstractModel. class FirstQuerySet(models.query.QuerySet): def first(self): try: return self[0] except: return None class ManagerWithFirstQuery(models.Manager): def get_query_set(self): return FirstQuerySet(self.model) class AbstractModel(models.Model): objects = ManagerWithFirstQuery() class Meta: abstract = True class MyModel(AbstractModel): ... Now, you can do the following. object = MyModel.objects.filter(key=value).first()

March 19, 2012

by Chase Seibert

· 12,362 Views

GapList – a Lightning-Fast List Implementation

This article introduces GapList, an implementation which strives for combining the strengths of both ArrayList and LinkedList.

March 19, 2012

by Thomas Mauch

· 64,353 Views · 4 Likes

Defensive Programming vs. Batshit Crazy Paranoid Programming

Hey, let’s be careful out there. --Sergeant Esterhaus, daily briefing to the force of Hill Street Blues When developers run into an unexpected bug and can’t fix it, they’ll “add some defensive code” to make the code safer and to make it easier to find the problem. Sometimes just doing this will make the problem go away. They’ll tighten up data validation – making sure to check input and output fields and return values. Review and improve error handling – maybe add some checking around “impossible” conditions. Add some helpful logging and diagnostics. In other words, the kind of code that should have been there in the first place. Expect the Unexpected The whole point of defensive programming is guarding against errors you don’t expect. ---Steve McConnell, Code Complete The few basic rules of defensive programming are explained in a short chapter in Steve McConnell’s classic book on programming, Code Complete: Protect your code from invalid data coming from “outside”, wherever you decide “outside” is. Data from an external system or the user or a file, or any data from outside of the module/component. Establish “barricades” or “safe zones” or “trust boundaries” – everything outside of the boundary is dangerous, everything inside of the boundary is safe. In the barricade code, validate all input data: check all input parameters for the correct type, length, and range of values. Double check for limits and bounds. After you have checked for bad data, decide how to handle it. Defensive Programming is NOT about swallowing errors or hiding bugs. It’s about deciding on the trade-off between robustness (keep running if there is a problem you can deal with) and correctness (never return inaccurate results). Choose a strategy to deal with bad data: return an error and stop right away (fast fail), return a neutral value, substitute data values, … Make sure that the strategy is clear and consistent. Don’t assume that a function call or method call outside of your code will work as advertised. Make sure that you understand and test error handling around external APIs and libraries. Use assertions to document assumptions and to highlight “impossible” conditions, at least in development and testing. This is especially important in large systems that have been maintained by different people over time, or in high-reliability code. Add diagnostic code, logging and tracing intelligently to help explain what’s going on at run-time, especially if you run into a problem. Standardize error handling. Decide how to handle “normal errors” or “expected errors” and warnings, and do all of this consistently. Use exception handling only when you need to, and make sure that you understand the language’s exception handler inside out. Programs that use exceptions as part of their normal processing suffer from all the readability and maintainability problems of classic spaghetti code. --The Pragmatic Programmer I would add a couple of other rules. From Michael Nygard’s Release It! n Never ever wait forever on an external call, especially a remote call. Forever can be a long time when something goes wrong. Use time-out/retry logic and his Circuit Breaker stability pattern to deal with remote failures. And for languages like C and C++, defensive programming also includes using safe function calls to avoid buffer overflows and common coding mistakes. Different Kinds of Paranoia The Pragmatic Programmer describes defensive programming as “Pragmatic Paranoia”. Protect your code from other people’s mistakes, and your own mistakes. If in doubt, validate. Check for data consistency and integrity. You can’t test for every error, so use assertions and exception handlers for things that “can’t happen”. Learn from failures in test and production – if this failed, look for what else can fail. Focus on critical sections of code – the core, the code that runs the business. Healthy Paranoid Programming is the right kind of programming. But paranoia can be taken too far. In the Error Handling chapter of Clean Code, Michael Feathers cautions that “many code bases are dominated by error handling” --Michael Feathers, Clean Code Too much error handling code not only obscures the main path of the code (what the code is actually trying to do), but it also obscures the error handling logic itself – so that it is harder to get it right, harder to review and test, and harder to change without making mistakes. Instead of making the code more resilient and safer, it can actually make the code more error-prone and brittle. There’s healthy paranoia, then there’s over-the-top-error-checking, and then there’s bat shit crazy crippling paranoia – where defensive programming takes over and turns in on itself. The first real world system I worked on was a “Store and Forward” network control system for servers (they were called minicomputers back then) across the US and Canada. It shared data between distributed systems, scheduled jobs, and coordinated reporting across the network. It was designed to be resilient to network problems and automatically recover and restart from operational failures. This was ground breaking stuff at the time, and a hell of a technical challenge. The original programmer on this system didn’t trust the network, didn’t trust the O/S, didn’t trust Operations, didn’t trust other people’s code, and didn’t trust his own code – for good reason. He was a chemical engineer turned self-taught system programmer who drank a lot while coding late at night and wrote thousands of lines of unstructured FORTRAN and Assembler under the influence. The code was full of error checking and self diagnostics and error-correcting code, the files and data packets had their own checksums and file-level passwords and hidden control labels, and there was lots of code to handle sequence accounting exceptions and timing-related problems – code that mostly worked most of the time. If something went wrong that it couldn’t recover from, programs would crash and report a “label of exit” and dump the contents of variables – like today’s stack traces. You could theoretically use this information to walk back through the code to figure out what the hell happened. None of this looked anything like anything that I learned about in school. Reading and working with this code was like programming your way out of Arkham Asylum. If the programmer ran into bugs and couldn’t fix them, that wouldn’t stop him. He would find a way to work around the bugs and make the system keep running. Then later after he left the company, I would find and fix a bug and congratulate myself until it broke some “error-correcting” code somewhere else in the network that now depended on the bug being there. So after I finally figured out what was going on, I took out as much of this “protection” as I could safely remove, and cleaned up the error handling so that I could actually maintain the system without losing what was left of my mind. I setup trust boundaries for the code – although I didn’t know that’s what it was called then – deciding what data couldn’t be trusted and what could. Once this was done I was able to simplify the defensive code so that I could make changes without the system falling over itself, and still protect the core code from bad data, mistakes in the rest of the code, and operational problems. Making code safer is simple The point of defensive coding is to make the code safer and to help whoever is going to maintain and support the code – not make their job harder. Defensive code is code – all code has bugs, and, because defensive code is dealing with exceptions, it is especially hard to test and to be sure that it will work when it has to. Understanding what conditions to check for and how much defensive coding is needed takes experience, working with code in production and seeing what can go wrong in the real world. A lot of the work involved in designing and building secure, resilient systems is technically difficult or expensive. Defensive programming is neither – like defensive driving, it’s something that everyone can understand and do. It requires discipline and awareness and attention to detail, but it’s something that we all need to do if we want to make the world safe.

March 19, 2012

by Jim Bird

· 23,915 Views

Hadoop Basics—Creating a MapReduce Program

The Map Reduce Framework works in two main phases to process the data, which are the "map" phase and the "reduce" phase.

March 18, 2012

by Carlo Scarioni

· 212,269 Views · 4 Likes