DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
Amazon EMR Tutorial: Running a Hadoop MapReduce Job Using Custom JAR
See original post at https://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html Introduction Amazon EMR is a web service which can be used to easily and efficiently process enormous amounts of data. It uses a hosted Hadoop framework running on the web-scale infrastructure of Amazon EC2 and Amazon S3. Amazon EMR removes most of the cumbersome details of Hadoop while taking care of provisioning of Hadoop, running the job flow, terminating the job flow, moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop. In this tutorial, we will use a developed WordCount Java example using Hadoop and thereafter, we execute our program on Amazon Elastic MapReduce. Prerequisites You must have valid AWS account credentials. You should also have a general familiarity with using the Eclipse IDE before you begin. The reader can also use any other IDE of their choice. Step 1 – Develop MapReduce WordCount Java Program In this section, we are first going to develop a WordCount application. A WordCount program will determine how many times different words appear in a set of files. In Eclipse (or whatever the IDE you are using), Create simple Java Project with the name "WordCount". Create a java class name Map and override the map method as follow, public class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Create a java class named Reduce and override the reduce method as shown below, public class Reduce extends Reducer { @Override protected void reduce(Text key, java.lang.Iterable values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } } Create a java class named WordCount and defined the main method as below, public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setJarByClass(WordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } Export the WordCount program in a jar using eclipse and save it to some location on disk. Make sure that you have provided the Main Class (WordCount.jar) during extraction ofu8u the jar file as shown below. Our jar is ready!!! Step 2 – Upload the WordCount JAR and Input Files to Amazon S3 Now we are going to upload the WordCount jar to Amazon S3. First, go to the following URL: https://console.aws.amazon.com/s3/home Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane. Upload the WordCount JAR and sample input file for counting the words. Step 3 – Running an Elastic MapReduce job Now that the JAR is uploaded into S3, all we need to do is to create a new Job flow. let's execute the steps below. (I encourage readers to check out the following link for details regarding each step, How to Create a Job Flow Using a Custom JAR ) Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/ Click Create New Job Flow. In the DEFINE JOB FLOW page, enter the following details, a) Job Flow Name = WordCountJob b) Select Run your own applications) Select Custom JAR in the drop-down list) Click Continue In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue.JAR Location = bucketName/jarFileLocationJAR Arguments =s3n://bucketName/inputFileLocations3n://bucketName/outputpath Please note that the output path must be unique each time we execute the job. The Hadoop always create a folder with the same name specified here. After executing the job, just wait and monitor your job that runs through the Hadoop flow. You can also look for errors by using the Debug button. The job should be complete within 10 to 15 minutes (can also depend on the size of the input). After completing the job, You can view results in the S3 Browser panel. You can also download the files from S3 and can analyze the outcome of the job. Amazon Elastic MapReduce Resources Amazon Elastic MapReduce Documentation,http://aws.amazon.com/documentation/elasticmapreduce/ Amazon Elastic MapReduce Getting Started Guide,http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/ Amazon Elastic MapReduce Developer Guide,http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ Apache Hadoop,http://hadoop.apache.org/ See more at https://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html
April 23, 2012
by Muhammad Ali Khojaye
· 59,054 Views
article thumbnail
How-to: Python Data into Graphite for Monitoring Bliss
This post shows code examples in Python (2.7) for sending data to Graphite. Once you have a Graphite server setup, with Carbon running/collecting, you need to send it data for graphing. Basically, you write a program to collect numeric values and send them to Graphite's backend aggregator (Carbon). To send data, you create a socket connection to the graphite/carbon server and send a message (string) in the format: "metric_path value timestamp\n" `metric_path`: arbitrary namespace containing substrings delimited by dots. The most general name is at the left and the most specific is at the right. `value`: numeric value to store. `timestamp`: epoch time. messages must end with a trailing newline. multiple messages maybe be batched and sent in a single socket operation. each message is delimited by a newline, with a trailing newline at the end of the message batch. Example message: "foo.bar.baz 42 74857843\n" Let's look at some (Python 2.7) code for sending data to graphite... Here is a simple client that sends a single message to graphite. Code: #!/usr/bin/env python import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 message = 'foo.bar.baz 42 %d\n' % int(time.time()) print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() Here is a command line client that sends a single message to graphite: Usage: $ python client-cli.py metric_path value Code: #!/usr/bin/env python import argparse import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 parser = argparse.ArgumentParser() parser.add_argument('metric_path') parser.add_argument('value') args = parser.parse_args() if __name__ == '__main__': timestamp = int(time.time()) message = '%s %s %d\n' % (args.metric_path, args.value, timestamp) print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() Here is a client that collects load average (Linux-only) and sends a batch of 3 messages (1min/5min/15min loadavg) to graphite. It will run continuously in a loop until killed. (adjust the delay for faster/slower collection interval): #!/usr/bin/env python import platform import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 DELAY = 15 # secs def get_loadavgs(): with open('/proc/loadavg') as f: return f.read().strip().split()[:3] def send_msg(message): print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() if __name__ == '__main__': node = platform.node().replace('.', '-') while True: timestamp = int(time.time()) loadavgs = get_loadavgs() lines = [ 'system.%s.loadavg_1min %s %d' % (node, loadavgs[0], timestamp), 'system.%s.loadavg_5min %s %d' % (node, loadavgs[1], timestamp), 'system.%s.loadavg_15min %s %d' % (node, loadavgs[2], timestamp) ] message = '\n'.join(lines) + '\n' send_msg(message) time.sleep(DELAY) Resources: Graphite Docs Graphite Docs - Getting Your Data Into Graphite Installing Graphite 0.9.9 on Ubuntu 12.04 LTS Installing and configuring Graphite END
April 20, 2012
by Corey Goldberg
· 25,291 Views
article thumbnail
Back To The Future with Datomic
At the beginning of March, Rich Hickey and his team released Datomic. Datomic is a novel distributed database system designed to enable scalable, flexible and intelligent applications, running on next-generation cloud architectures. Its launch was surrounded with quite some buzz and skepticism, mainly related to its rather disruptive architectural proposal. Instead of trying to recapitulate the various pros and cons of its architectural approach, I will try to focus on the other innovation it introduces, namely its powerful data model (based upon the concept of Datoms) and its expressive query language (based upon the concept of Datalog). The remainder of this article will describe how to store facts and query them through Datalog expressions and rules. Additionally, I will show how Datomic introduces an explicit notion of time, which allows for the execution of queries against both the previous and future states of the database. As an example, I will use a very simple data model that is able to describe genealogical information. As always, the complete source code can be found on the Datablend public GitHub repository. 1. The Datomic data model Datomic stores facts (i.e. your data points) as datoms. A datom represents the addition (or retraction) of a relation between an entity, an attribute, a value, and a transaction. The datom concept is closely related to the concept of a RDF triple, where each triple is a statement about a particular resource in the form of a subject-predicate-object expression. Datomic adds the notion of time by explicitly tagging a datom with a transaction identifier (i.e. the exact time-point at which the fact was persisted into the Datomic database). This allows Datomic to promote data immutability: updates are not changing your existing facts; they are merely creating new datoms that are tagged with a more recent transaction. Hence, the system keeps track of all the facts, forever. Datomic does not enforce an explicit entity schema; it’s up to the user to decide what type of attributes he/she want to store for a particular entity. Attributes are part of the Datomic meta model, which specifies the characteristics (i.e. attributes) of the attributes themselves. Our genealogical example data model stores information about persons and their ancestors. For this, we will require two attributes: name and parent. An attribute is basically an entity, expressed in terms of the built-in system attributes such as cardinality, value type and attribute description. // Open a connection to the database String uri = "datomic:mem://test"; Peer.createDatabase(uri); Connection conn = Peer.connect(uri); // Declare attribute schema List tx = new ArrayList(); tx.add(Util.map(":db/id", Peer.tempid(":db.part/db"), ":db/ident", ":person/name", ":db/valueType", ":db.type/string", ":db/cardinality", ":db.cardinality/one", ":db/doc", "A person's name", ":db.install/_attribute", ":db.part/db")); tx.add(Util.map(":db/id", Peer.tempid(":db.part/db"), ":db/ident", ":person/parent", ":db/valueType", ":db.type/ref", ":db/cardinality", ":db.cardinality/many", ":db/doc", "A person's parent", ":db.install/_attribute", ":db.part/db")); // Store it conn.transact(tx).get(); All entities in a Datomic database need to have an internal key, called the entity id. In our case, we generate a temporary id through the tempid utility method. All entities are stored within a specific database partition that groups together logically related entities. Attribute definitions need to reside in the :db.part/db partition, a dedicated system partition employed exclusively for storing system entities and schema definitions. :person/name is a single-valued attribute of value type string. :person/parent is a multi-valued attribute of value type ref. The value of a reference attribute points to (the id) of another entity stored within the Datomic database. Once our attribute schema is persisted, we can start populating our database with concrete person entities. // Define person entities List tx = new ArrayList(); Object edmond = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", edmond, ":person/name", "Edmond Suvee")); Object gilbert = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", gilbert, ":person/name", "Gilbert Suvee", ":person/parent", edmond)); Object davy = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", davy, ":person/name", "Davy Suvee", ":person/parent", gilbert)); // Store them conn.transact(tx).get(); We will create three concrete persons: myself, my dad Gilbert Suvee and my grandfather Edmond Suvee. Similarly to the definition of attributes, we again employ the tempid utility method to retrieve temporary ids for our newly created entities. This time however, we store our persons within the :db.part/user database partition, which is the default partition for storing application entities. Each person is given a name (via the :person/name attribute) and parent (via the :person/parent attribute). When calling the transact method, each entity is translated into a set of individual datoms that together describe the entity. Once persisted, Datomic ensures that temporary ids are replaced with their final counterparts. 2. The Datomic query language Datomic’s query model is an extended form of Datalog. Datalog is a deductive query system which will feel quite familiar to people who have experience with SPARQL and/or Prolog. The declarative query language makes use of a pattern matching mechanism to find all combinations of values (i.e. facts) that satisfy a particular set of conditions expressed as clauses. Let’s have a look at a few example queries: // Find all persons System.out.println(Peer.q("[:find ?name " + ":where [?person :person/name ?name] ]", conn.db())); // Find the parents of all persons System.out.println(Peer.q("[:find ?name ?parentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/name ?parentname] ]" , conn.db())); // Find the grandparent of all persons System.out.println(Peer.q("[:find ?name ?grandparentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/parent ?grandparent] " + "[?grandparent :person/name ?grandparentname] ]" , conn.db())); We consider entities to be of type person if they own a :person/name attribute. The :where-part of the first query, which aims at finding all persons stored in the Datomic database, specifies the following “conditional” clause: [?person :person/name ?name]. ?person and ?name are variables which act as placeholders. The Datalog query engine retrieves all facts (i.e. datoms) that match this clause. The :find-part of the query specifies the “values” that should be returned as the result of the query. Result query 1: [["Davy Suvee"], ["Edmond Suvee"], ["Gilbert Suvee"]] The second and the third query aim at retrieving the parents and grandparents of all persons stored in the Datomic database. These queries specify multiple clauses that are solved through the use of unification: when a variable name is used more than once, it must represent the same value in every clause in order to satisfy the total set of clauses. As expected, only Davy Suvee has been identified as having a grandparent, as the necessary facts to satisfy this query are not available for neither Gilbert Suvee and Edmond Suvee. Result query 2: [["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"]] Result query 3: [["Davy Suvee" "Edmond Suvee"]] If several queries require this “grandparent” notion, one can define a reusable rule that encapsulates the required clauses. Rules can be flexibly combined with clauses (and other rules) in the :where-part of a query. Our third query can be rewritten using the following rules and clauses: String grandparentrule = "[ [ (grandparent ?person ?grandparent) [?person :person/parent ?parent] " + "[?parent :person/parent ?grandparent] ] ]"; System.out.println(Peer.q("[:find ?name ?grandparentname " + ":in $ % " + ":where [?person :person/name ?name] " + "(grandparent ?person ?grandparent) " + "[?grandparent :person/name ?grandparentname] ]" , conn.db(), grandparentrule)); Rules can also be used to write recursive queries. Imagine the ancestor-relationship. It’s impossible to predict the number of parent-levels one needs to go up in order to retrieve the ancestors of a person. As Datomic rules supports the notion of recursion, a rule can call itself within its definition. Similar to recursion in other languages, recursive rules are build up out of a simple base case and a set of clauses which reduce all other cases toward this base case. String ancestorrule = "[ [ (ancestor ?person ?ancestor) [?person :person/parent ?ancestor] ] " + "[ (ancestor ?person ?ancestor) [?person :person/parent ?parent] " + "(ancestor ?parent ?ancestor) ] ] ]"; System.out.println(Peer.q("[:find ?name ?ancestorname " + ":in $ % " + ":where [?person :person/name ?name] " + "[ancestor ?person ?ancestor] " + "[?ancestor :person/name ?ancestorname] ]" , conn.db(), ancestorrule)); Result query 4: [["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"]] 3. Back To The Future I As already mentioned in section 1, Datomic does not perform in-place updates. Instead, all facts are stored and tagged with a transaction such that the most up-to-date value of a particular entity attribute can be retrieved. By doing so, Datomic allows you to travel back into time and perform queries against previous states of the database. Using the asOf method, one can retrieve a version of the database that only contains facts that were part of the database at that particular moment in time. The use of a checkpoint that predates the storage of my own person entity will result in parent-query results that do not longer contain results related to myself. System.out.println(Peer.q("[:find ?name ?parentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/name ?parentname] ]", conn.db().asOf(getCheckPoint(checkpoint)))); Result query 2: [["Gilbert Suvee" "Edmond Suvee"]] 4. Back To The Future II Datomic also allows to predict the future. Well, sort of … Similar to the asOf method, one can use the with method to retrieve a version of the database that gets extended with a list of not-yet transacted datoms. This allows to run queries against future states of the database and to observe the implications if these new facts were to be added. List tx = new ArrayList(); tx.add(Util.map(":db/id", Peer.tempid(":db.part/user"), ":person/name", "FutureChild Suvee", ":person/parent", Peer.q("[:find ?person :where [?person :person/name \"Davy Suvee\"] ]", conn.db()).iterator().next().get(0))); System.out.println(Peer.q("[:find ?name ?ancestorname " + ":in $ % " + ":where [?person :person/name ?name] " + "[ancestor ?person ?ancestor] " + "[?ancestor :person/name ?ancestorname] ]" , conn.db().with(tx), ancestorrule)); Result query 4: [["FutureChild Suvee" "Edmond Suvee"], ["FutureChild Suvee" "Gilbert Suvee"], ["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"], ["FutureChild Suvee" "Davy Suvee"]] 5. Conclusion The use of Datoms and Datalog allows you to express simple, yet powerful queries. This article introduces only a fraction of the features offered by Datomic. To get myself better acquainted with the various Datomic gotchas, I implemented the Tinkerpop Blueprints API on top of Datomic. By doing so, you basically get a distributed, temporal graph database, which is, as far as I know, unique within the Graph database ecosystem. The source code of this Blueprints implementation can currently be found on the Datablend public GitHub repository and will soon be merged within the Tinkerpop project..
April 14, 2012
by Davy Suvee
· 18,300 Views
article thumbnail
How to Use Sigma.js with Neo4j
i’ve done a few posts recently using d3.js and now i want to show you how to use two other great javascript libraries to visualize your graphs. we’ll start with sigma.js and soon i’ll do another post with three.js . we’re going to create our graph and group our nodes into five clusters. you’ll notice later on that we’re going to give our clustered nodes colors using rgb values so we’ll be able to see them move around until they find their right place in our layout. we’ll be using two sigma.js plugins, the gefx (graph exchange xml format) parser and the forceatlas2 layout. you can see what a gefx file looks like below. notice it comes from gephi which is an interactive visualization and exploration platform, which runs on all major operating systems, is open source, and is free. ... ... in order to build this file, we will need to get the nodes and edges from the graph and create an xml file. get '/graph.xml' do @nodes = nodes @edges = edges builder :graph end we’ll use cypher to get our nodes and edges: def nodes neo = neography::rest.new cypher_query = " start node = node:nodes_index(type='user')" cypher_query << " return id(node), node" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0]}.merge(n[1]["data"])} end we need the node and relationship ids, so notice i’m using the id() function in both cases. def edges neo = neography::rest.new cypher_query = " start source = node:nodes_index(type='user')" cypher_query << " match source -[rel]-> target" cypher_query << " return id(rel), id(source), id(target)" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0], "source" => n[1], "target" => n[2]} } end so far we have seen graphs represented as json, and we’ve built these manually. today we’ll take advantage of the builder ruby gem to build our graph in xml. xml.instruct! :xml xml.gexf 'xmlns' => "http://www.gephi.org/gexf", 'xmlns:viz' => "http://www.gephi.org/gexf/viz" do xml.graph 'defaultedgetype' => "directed", 'idtype' => "string", 'type' => "static" do xml.nodes :count => @nodes.size do @nodes.each do |n| xml.node :id => n["id"], :label => n["name"] do xml.tag!("viz:size", :value => n["size"]) xml.tag!("viz:color", :b => n["b"], :g => n["g"], :r => n["r"]) xml.tag!("viz:position", :x => n["x"], :y => n["y"]) end end end xml.edges :count => @edges.size do @edges.each do |e| xml.edge:id => e["id"], :source => e["source"], :target => e["target"] end end end end you can get the code on github as usual and see it running live on heroku. you will want to see it live on heroku so you can see the nodes in random positions and then move to form clusters. use your mouse wheel to zoom in, and click and drag to move around. credit goes out to alexis jacomy and mathieu jacomy . you’ve seen me create numerous random graphs, but for completeness here is the code for this graph. notice how i create 5 clusters and for each node i assign half its relationships to other nodes in their cluster and half to random nodes? this is so the forceatlas2 layout plugin clusters our nodes neatly. def create_graph neo = neography::rest.new graph_exists = neo.get_node_properties(1) return if graph_exists && graph_exists['name'] names = 500.times.collect{|x| generate_text} clusters = 5.times.collect{|x| {:r => rand(256), :g => rand(256), :b => rand(256)} } commands = [] names.each_index do |n| cluster = clusters[n % clusters.size] commands << [:create_node, {:name => names[n], :size => 5.0 + rand(20.0), :r => cluster[:r], :g => cluster[:g], :b => cluster[:b], :x => rand(600) - 300, :y => rand(150) - 150 }] end names.each_index do |from| commands << [:add_node_to_index, "nodes_index", "type", "user", "{#{from}"] connected = [] # create clustered relationships members = 20.times.collect{|x| x * 10 + (from % clusters.size)} members.delete(from) rels = 3 rels.times do |x| to = members[x] connected << to commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless to == from end # create random relationships rels = 3 rels.times do |x| to = rand(names.size) commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless (to == from) || connected.include?(to) end end batch_result = neo.batch *commands end
April 12, 2012
by Max De Marzi
· 15,413 Views
article thumbnail
F1 Live Timing Map
this is a live timing map application for f1 championship races made using javascript and google maps markers. the live timing data is supplied by formula1.com. it’s interactive, you can press over a driver to track him or press into an empty map zone to untrack and have a general view. it has also been made with a responsive design to adapt it to mobile browsers using jquerymobile framework. how it works: the client side: until the race start date a countdown and a demo race is showed. when the countdown finishes it will connect to server (using ajax) to get the live timing data from server (every five seconds) and the interface will be updated using this data. the server side: it uses a django app for the web page and the static race data (circuit, laps, drivers) is put into the html using the django template system. for the dynamic data (live timing) i have modified the source of a c program for the linux terminal called live-f1 to generate a json with the data that the client requires instead of printing it on terminal screen. enjoy the race!
April 12, 2012
by Luis Sobrecueva
· 16,088 Views
article thumbnail
Configuring Quartz With JDBCJobStore in Spring
I am starting a little series about Quartz scheduler internals, tips and tricks, this is chapter 0 - how to configure persistent job store.
April 7, 2012
by Tomasz Nurkiewicz
· 37,724 Views
article thumbnail
Wrapping Begin/End Async API Into C#5 Tasks
Microsoft offered programmers several different ways of dealing with the asynchronous programming since .NET 1.0. The first model was Asynchronous programming model or APM for short. The pattern is implemented with two methods named BeginOperation and EndOperation. .NET 4 introduced new pattern – Task Asynchronous Pattern and with the introduction of .NET 4.5, Microsoft added language support for language integrated asynchronous coding style. You can check the MSDN for more samples and information. I will assume that you are familiar with it and have written code using it. You can wrap existing APM pattern into TPL pattern using the Task.Factory.FromAsync methods. For example: public static Task> ExecuteAsync(this DataServiceQuery query, object state) { return Task.Factory.FromAsync>(query.BeginExecute, query.EndExecute, state); } It is easy to wrap most of the asynchronous functions this way, but some cannot be since the wrapper functions assume that the last two parameters to the BeginOperation are AsyncCallback and object, and there are some versions of asynchronous operations that have different specifications. Examples: Extra parameters after the object state parameter: IAsyncResult DataServiceContext.BeginExecuteBatch( AsyncCallback callback, object state, params DataServiceRequest[] queries); Missing the expected object state parameter and different return type: ICancelableAsyncResult BeginQuery(AsyncCallback callBack); WorkItemCollection EndQuery(ICancelableAsyncResult car); Short solution for the first example The short and elegant way for wrapping the first example is to provide the following wrapper: public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { if (context == null) throw new ArgumentNullException("context"); return Task.Factory.FromAsync( context.BeginExecuteBatch(null, state, queries), context.EndExecuteBatch); } We simply call the Begin method ourselves and then wrap it using an another overload for FromAsync function. The longer way However, we can fully wrap it ourselves by simulating what the FromAsync wrapper does. The complete code is listed below. public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { // this will be our sentry that will know when our async operation is completed var tcs = new TaskCompletionSource(); try { context.BeginExecuteBatch((iar) => { try { var result = context.EndExecuteBatch(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { // if the inner operation was canceled, this task is cancelled too tcs.TrySetCanceled(); } catch (Exception ex) { // general exception has been set bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }, state, queries); } catch { tcs.TrySetResult(default(DataServiceResponse)); // propagate exceptions to the outside throw; } return tcs.Task; } Besides educational benefits, writing the full wrapper code allows us to add cancellation, logging and diagnostic information. Once we understand how to wrap APM pattern, We can now tackle the second problem easily. Handling the BeginQuery/EndQuery We will first create our own wrapper function in the spirit of the above code with the notable difference that we use the ICancelableAsyncResult interface instead of the IAsyncResult. public static class TaskEx { public static Task FromAsync(Func beginMethod, Func endMethod) { if (beginMethod == null) throw new ArgumentNullException("beginMethod"); if (endMethod == null) throw new ArgumentNullException("endMethod"); var tcs = new TaskCompletionSource(); try { beginMethod((iar) => { try { var result = endMethod(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { tcs.TrySetCanceled(); } catch (Exception ex) { bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }); } catch { tcs.TrySetResult(default(TResult)); throw; } return tcs.Task; } } The code is pretty self-explanatory and we can go ahead with the wrapping. There are four different operations that are exposed both in synchronous and asynchronous version: Query, LinkQuery, CountOnlyQuery and RegularQuery. The extension methods are short since we have already created our generic wrapper above: public static Task RunQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginQuery, query.EndQuery); } public static Task RunLinkQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginLinkQuery, query.EndLinkQuery); } public static Task RunCountOnlyQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginCountOnlyQuery, query.EndCountOnlyQuery); } public static Task RunRegularQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginRegularQuery, query.EndRegularQuery); } That is it for today, you can write your own handy extensions easily for APM functions out there.
April 2, 2012
by Toni Petrina
· 11,269 Views
article thumbnail
Cassandra Indexing: The Good, the Bad and the Ugly
Within NoSQL, the operations of indexing, fetching and searching for information are intimately tied to the physical storage mechanisms. It is important to remember that rows are stored across hosts, but a single row is stored on a single host. (with replicas) Columns families are stored in sorted order, which makes querying a set of columns efficient (provided you are spanning rows). The Bad : Partitioning One of the tough things to get used to at first is that without any indexes queries that span rows can (very) be bad. Thinking back to our storage model however, that isn't surprising. The strategy that Cassandra uses to distribute the rows across hosts is called Partitioning. Partitioning is the act of carving up the range of rowkeys assigning them into the "token ring", which also assigns responsibility for a segment (i.e. partition) of the rowkey range to each host. You've probably seen this when you initialized your cluster with a "token". The token gives the host a location along the token ring, which assigns responsibility for a section of the token range. Partitioning is the act of mapping the rowkey into the token range. There are two primary partitioners: Random and Order Preserving. They are appropriately named. The RandomPartitioner hashes the rowkeys into tokens. With the RandomPartitioner, the token is a hash of the rowkey. This does a good job of evenly distributing your data across a set of nodes, but makes querying a range of the rowkey space incredibly difficult. From only a "start rowkey" value and an "end rowkey" value, Cassandra can't determine what range of the token space you need. It essentially needs to perform a "table scan" to answer the query, and a "table scan" in Cassandra is bad because it needs to go to each machine (most likely ALL machines if you have a good hash function) to answer the query. Now, at the great cost of even data distribution, you can employ the OrderPreservingPartitioner (OPP). I am *not* down with OPP. The OPP preserves order as it translates rowkeys into tokens. Now, given a start rowkey value and a end rowkey value, Cassandra *can* determine exactly which hosts have the data you are looking for. It computes the start value to a token the end value to a token, and simply selects and returns everything in between. BUT, by preserving order, unless your rowkeys are evenly distributed across the space, your tokens won't be either and you'll get a lopsided cluster, which greatly increases the cost of configuration and administration of the cluster. (not worth it) The Good : Secondary Indexes Cassandra does provide a native indexing mechanism in Secondary Indexes. Secondary Indexes work off of the columns values. You declare a secondary index on a Column Family. Datastax has good documentation on the usage. Under the hood, Cassandra maintains a "hidden column family" as the index. (See Ed Anuff's presentation for specifics) Since Cassandra doesn't maintain column value information in any one node, and secondary indexes are on columns value (rather than rowkeys), a query still needs to be sent to all nodes. Additionally, secondary indexes are not recommended for high-cardinality sets. I haven't looked yet, but I'm assuming this is because of the data model used within the "hidden column family". If the hidden column family stores a row per unique value (with rowkeys as columns), then it would mean scanning the rows to determine if they are within the range in the query. From Ed's presentation: Not recommended for high cardinality values(i.e.timestamps,birthdates,keywords,etc.) Requires at least one equality comparison in a query--not great for less-than/greater-than/range queries Unsorted - results are in token order, not query value order Limited to search on datatypes, Cassandra natively understands With all that said, secondary indexes work out of the box and we've had good success using them on simple values. The Ugly : Do-It-Yourself (DIY) / Wide-Rows Now, beauty is in the eye of the beholder. One of the beautiful things about NoSQL is the simplicity. The constructs are simple: Keyspaces, Column Families, Rows and Columns. Keeping it simple however means sometimes you need to take things into your own hands. This is the case with wide-row indexes. Utilizing Cassandra's storage model, its easy to build your own indexes where each row-key becomes a column in the index. This is sometimes hard to get your head around, but lets imagine we have a case whereby we want to select all users in a zip code. The main users column family is keyed on userid, zip code is a column on each user row. We could use secondary indexes, but there are quite a few zip codes. Instead we could maintain a column family with a single row called "idx_zipcode". We could then write columns into this row of the form "zipcode_userid". Since the columns are stored in sorted order, it is fast to query for all columns that start with "18964" (e.g. we could use 18964_ and 18964_ZZZZZZ as start and end values). One obvious downside of this approach is that rows are self-contained on a host. (again except for replicas) This means that all queries are going to hit a single node. I haven't yet found a good answer for this. Additionally, and IMHO, the ugliest part of DIY wide-row indexing is from a client perspective. In our implementation, we've done our best to be language agnostic on the client-side, allowing people to pick the best tool for the job to interact with the data in Cassandra. With that mentality, the DIY indexes present some trouble. Wide-rows often use composite keys (imagine if you had an idx_state_zip, which would allow you to query by state then zip). Although there is "native" support for composite keys, all of the client libraries implement their own version of them (Hector, Astyanax, and Thrift). This means that client needing to query data needs to have the added logic to first query the index, and additionally all clients need to construct the composite key in the same manner. Making It Better... For this very reason, we've decided to release two open source projects that help push this logic to the server-side. The first project is Cassandra-Triggers. This allows you to attached asynchronous activities to writes in Cassandra. (one such activity could be indexing) We've also released Cassandra-Indexing. This is hot off the presses and is still in its infancy (e.g. it only supports UT8Types in the index), but the intent is to provide a generic server-side mechanism that indexes data as its written to Cassandra. Employing the same server-side technique we used in Cassandra-Indexing, you simply configure the columns you want indexed, and the AOP code does the rest as you write to the target CF. As always, questions, comments and thoughts are welcome. (especially if I'm off-base somewhere)
March 23, 2012
by Brian O' Neill
· 35,558 Views
article thumbnail
PHP objects in MongoDB with Doctrine
An is equivalent to an Object-Relational Mapper, but with its targets are documents of a NoSQL database instead of table rows. No one said that a Data Mapper must always rely on a relational database as its back end. In the PHP world, probably the Doctrine ODM for MongoDB is the most successful. This followes to the opularity of Mongo, which is a transitional product between SQL and NoSQL, still based on some relational concepts like queries. Lots of features The Doctrine Mongo ODM supports mapping of objects via annotations placed in the class source code, or via external XML or YAML files. In this and in many aspects it is based on the same concepts as the Doctrine ORM: it features a Facade DocumentManager object and a Unit Of Work that batches changes to the database when objects are added to it. Moreover, two different types of relationships between objects are supported: references and embedded documents. The first is the equivalent of the classical pointer to another row which ORM always transform object references into; the second actually stores an object inside another one, like you would do with a Value Object. Thus, at least in Doctrine's case, it is easier to map objects as documents that as rows. As said before, the ODM borrows some concepts and classes from the ORM, in particular from the Doctrine\Common package which features a standard collection class. So if you have built objects mapped with the Doctrine ORM nothing changes for persisting them in MongoDB, except for the mapping metadata itself. Advantages If an ORM is sometimes a leaky abstraction, an ODM probably becomes an issue less often. It has less overhead than an ORM, since there is no schema to define and the ability to embed objects means there should be no compromises between the object model and the capabilities of the database. How many times we have renounced introducing a potential Value Object because of the difficulty in persisting it? The case for an ODM over a plain Mongo connection object is easy to make: you will still be able to use objects with proper encapsulation (like private fields and associations) and behavior (many methods) instead of extracting just a JSON package from your database. Installation A prerequisite for the ODM is the presence of the mongo extension, that can be installed via pecl. After having verified the extension is present, grab the Doctrine\Common as the 2.2.x package, and a zip of the doctrine-mongodb and doctrine-mongodb-odm projects from Github. Decompress everything into a Doctrine/ folder. After having setup autoloading for classes in Doctrine\, use this bootstrap to get a DocumentManager (the equivalent of EntityManager): use Doctrine\Common\Annotations\AnnotationReader, Doctrine\ODM\MongoDB\DocumentManager, Doctrine\MongoDB\Connection, Doctrine\ODM\MongoDB\Configuration, Doctrine\ODM\MongoDB\Mapping\Driver\AnnotationDriver; private function getADm() { $config = new Configuration(); $config->setProxyDir(__DIR__ . '/mongocache'); $config->setProxyNamespace('MongoProxies'); $config->setDefaultDB('test'); $config->setHydratorDir(__DIR__ . '/mongocache'); $config->setHydratorNamespace('MongoHydrators'); $reader = new AnnotationReader(); $config->setMetadataDriverImpl(new AnnotationDriver($reader, __DIR__ . '/Documents')); return DocumentManager::create(new Connection(), $config); } You will be able to call persist() and flush() on the DocumentManager, along with a set of other methods for querying like find() and getRepository(). Integration with an ORM We are researching a solution for versioning objects mapped with the Doctrine ORM. Doing this with a version column would be invasive, and also strange where multiple objects are involved (do you version just the root of an object graph? Duplicate the other ones when they change? How can you detect that?) The idea is taking a snapshot and putting it in a read only MongoDB instance, where all previous versions can be retrieved later for auditing (business reasons). This has been verified to be technically possible: the DocumentManager and EntityManager are totally separate object graphs, so they won't clash with each other. The only point of conflict is the annotations of model classes, since both use different version of @Id, and can see the other's annotation like @Entity and @Document while parsing. This can be solved by using aliases for all the annotations, using their parent namespace basename as a prefix: model = $model; } public function __toString() { return "Car #$this->document_id: $this->id, $this->model"; } } This make us able to save a copy of an ORM object into Mongo: $car = new Car('Ford'); $this->em->persist($car); $this->em->flush(); $this->dm->persist($car); $this->dm->flush(); var_dump($car->__toString()); $this->assertTrue(strlen($car->__toString()) > 20); The output produces by this test is: .string(38) "Car #4f61a8322f762f1121000000: 3, Ford" When retrieving the object, one of the two ids will be null as it is ignored by the ORM or ODM. I am not using the same field because I want to store multiple copies of a row, so it's id alone won't be unique. If you're interested, checkout my hack on Github. It contains the running example presented in this post. Remember to create the relational schema with: $ php doctrine.php orm:schema-tool:create before running the test with phpunit --bootstrap bootstrap.php DoubleMappingTest.php MongoDB won't need the schema setup, of course. There are still some use cases to test, like the behavior in the presence of proxies, but it seems that non-invasive approach of Data Mappers like Doctrine 2 is paying off: try mapping an object in multiple database with Active Records.
March 20, 2012
by Giorgio Sironi
· 22,414 Views
article thumbnail
Adding a .first() method to Django's QuerySet
In my last Django project, we had a set of helper functions that we used a lot. The most used was helpers.first, which takes a query set and returns the first element, or None if the query set was empty. Instead of writing this: try: object = MyModel.objects.get(key=value) except model.DoesNotExist: object = None You can write this: def first(query): try: return query.all()[0] except: return None object = helpers.first(MyModel.objects.filter(key=value)) Note, that this is not identical. The get method will ensure that there is exactly one row in the database that matches the query. The helper.first() method will silently eat all but the first matching row. As long as you're aware of that, you might choose to use the second form in some cases, primarily for style reasons. But the syntax on the helper is a little verbose, plus you're constantly including helpers.py. Here is a version that makes this available as a method on the end of your query set chain. All you have to do is have your models inherit from this AbstractModel. class FirstQuerySet(models.query.QuerySet): def first(self): try: return self[0] except: return None class ManagerWithFirstQuery(models.Manager): def get_query_set(self): return FirstQuerySet(self.model) class AbstractModel(models.Model): objects = ManagerWithFirstQuery() class Meta: abstract = True class MyModel(AbstractModel): ... Now, you can do the following. object = MyModel.objects.filter(key=value).first()
March 19, 2012
by Chase Seibert
· 12,626 Views
article thumbnail
Display an OLE Object from a Microsoft Access Database using OLE Stripper
In database programming, it happens a lot that you need to bind a picture box to a field with type of photo or image. For example, if you want to show an Employee’s picture from Northwind.mdb database, you might want to try the following code: picEmployees.DataBindings.Add(“Image”, bsEmployees, “Photo”, true); This code works if the images are stored in the database with no OLE header or the images stored as a raw image file formats. As the pictures stored in the Northwind database in are not stored in raw image file formats and they are stored as an OLE image documents, then you have to strip off the OLE header to work with the image properly. Binding imageBinding = new Binding("Image", bsEmployees, "ImageBlob.ImageBlob", true); imageBinding.Format += new ConvertEventHandler(this.PictureFormat); private void PictureFormat(object sender, ConvertEventArgs e) { Byte[] img = (Byte[])e.Value; MemoryStream ms = new MemoryStream(); int offset = 78; ms.Write(img, offset, img.Length - offset); Bitmap bmp = new Bitmap(ms); ms.Close(); // Writes the new value back e.Value = bmp; } Fortunately, there are some overload methods in .NET Framework to take care of this mechanism, but it cannot be guaranteed whether you need to strip off the OLE object by yourself or not. For example, you can use the following technique to access the images of the Northwind.mdb that ships with Microsoft Access and they will be rendered properly. picEmployees.DataBindings.Add(“Image”, bsEmployees, “Photo”, true, DataSourceUpdateMode.Never, new Bitmap(typeof(Button), “Button.bmp”)); Unfortunately, there are some scenarios that you need a better solution. For example, the Xtreme.mdb database that ships with Crystal Reports has a photo filed that cannot be handled by the preceding methods. For these complex scenarios, you can download the OLEStripper classes from here and re-write the PictureFormat method as it is shown below: private void PictureFormat(object sender, ConvertEventArgs e) { // photoIndex is same as Employee ID int photoIndex = Convert.ToInt32(e.Value); // Read the original OLE object ReadOLE olePhoto = new ReadOLE(); string PhotoPath = olePhoto.GetOLEPhoto(photoIndex); // Strip the original OLE object StripOLE stripPhoto = new StripOLE(); string StripPhotoPath = stripPhoto.GetStripOLE(PhotoPath); FileStream PhotoStream = new FileStream(StripPhotoPath , FileMode.Open); Image EmployeePhoto = Image.FromStream(PhotoStream); e.Value = EmployeePhoto; PhotoStream.Close(); }
March 15, 2012
by Amir Ahani
· 11,113 Views
article thumbnail
Circos: An Amazing Tool for Visualizing Big Data
storing massive amounts of data in a nosql data store is just one side of the big data equation. being able to visualize your data in such a way that you can easily gain deeper insights , is where things really start to get interesting. lately, i've been exploring various options for visualizing (directed) graphs, including circos . circos is an amazing software package that visualizes your data through a circular layout . although it's originally designed for displaying genomic data , it allows to create good-looking figures from data in any field. just transform your data set into a tabular format and you are ready to go. the figure below illustrates the core concept behind circos. the table's columns and rows are represented by segments around the circle. individual cells are shown as ribbons , which connect the corresponding row and column segments. the ribbons themselves are proportional in width to the value in the cell. when visualizing a directed graph , nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. the proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. in my case, i want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, ...) and how do they navigate between pages. the rest of this article details how to 1) retrieve the raw visit information through the google analytics api, 2) persist this information as a graph in neo4j and 3) query and preprocess this data for visualization through circos. as always, the complete source code can be found on the datablend public github repository . 1. retrieving your google analytics data let's start by retrieving the raw google analytics data . the google analytics data api provides access to all dimensions and metrics that can be queried through the web application. in my case, i'm interested in retrieving the previous page path property for each page view. if a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance) . otherwise, it contains the internal path . we will use google's java data api to connect and retrieve this information. we are particularly interested in the pagepath , pagetitle , previouspagepath and medium dimensions, while our metric of choice is the number of pageviews . after setting the date range, the feed of entries that satisfy this criteria can be retrieved. for ease of use, we transform this data to a domain entity and filter/clean the data accordingly. if a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, ...) as previous path. // authenticate analyticsservice = new analyticsservice(configuration.service); analyticsservice.setusercredentials(configuration.client_username, configuration.client_pass); // create query dataquery query = new dataquery(new url(configuration.data_url)); query.setids(configuration.table_id); query.setdimensions("ga:medium,ga:previouspagepath,ga:pagepath,ga:pagetitle"); query.setmetrics("ga:pageviews"); query.setstartdate(datestring); query.setenddate(datestring); // execute datafeed feed = analyticsservice.getfeed(createqueryurl(date), datafeed.class); // iterate and clean for (dataentry entry : feed.getentries()) { string pagepath = entry.stringvalueof("ga:pagepath"); string pagetitle = entry.stringvalueof("ga:pagetitle"); string previouspagepath = entry.stringvalueof("ga:previouspagepath"); string medium = entry.stringvalueof("ga:medium"); long views = entry.longvalueof("ga:pageviews"); // filter the data if (filter(pagepath) && filter(previouspagepath) && (!clean(previouspagepath).equals(clean(pagepath)))) { // check criteria are satisfied navigation navigation = new navigation(clean(previouspagepath), clean(pagepath), pagetitle, date, views); if (navigation.getsource().equals("(entrance)")) { // in case of an entrace, save its medium instead navigation.setsource(medium); } navigations.add(navigation); } } 2. storing navigational data as a directed graph in neo4j the set of site navigations can easily be stored as a directed graph in the neo4j graph database . nodes are site paths (or mediums), while relationships are the navigations themselves. we start by retrieving the navigations for a particular date range and retrieve (or lazily create) the nodes representing the source and target paths (or mediums). next we de-normalize the pageviews metric (for instance, 6 individual relationships will be created for 6 page-views). although this de-normalization step is not really required, i did so to make sure that the degree of my nodes is correct if i would perform other types of calculations. for each individual navigation relationship, we also store the date of visit . // retrieve navigations for a particular date list navigations = retrieval.getnavigations(date); // save them in the graph database transaction tx = graphdb.begintx(); // iterate and create for (navigation nav : navigations) { node source = getpath(nav.getsource()); node target = getpath(nav.gettarget()); if (!target.hasproperty("title")) { target.setproperty("title", nav.gettargettitle()); } for (long i = 0; i < nav.getamount(); i++) { // duplicate relationships relationship transition = source.createrelationshipto(target, relationships.navigation); transition.setproperty("date", date.gettime()); // save time as long } } // commit tx.success(); tx.finish(); 3. creating the circos tabular data format the circos tabular data format is quite easy to construct. it's basically a tab-delimited file with row and column headers. a cell is interpreted as a value that flows from the row entity to the column entity . we will use the neo4j cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period . doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time. // access the graph database graphdb = new embeddedgraphdatabase("var/analytics"); engine = new executionengine(graphdb); // execute the data range cypher query map params = new hashmap(); params.put("fromdate", from.gettime()); params.put("todate", to.gettime()); // execute the query executionresult result = engine.execute("start sourcepath=node:index(\"path:*\") " + "match sourcepath-[r]->targetpath " + "where r.date >= {fromdate} and r.date <= {todate} " + "return sourcepath,targetpath", params); next, we create the tab delimited file itself. we iterate through all entries (i.e. navigations) that match our cypher query and store them in a temporary list. afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. at the end, we filter this occurrence matrix on the minimal number of required navigations. this ensures that we will only create segments for paths that are relevant in the total population. as a final step, we print the occurrences matrix as a tab-delimited file. for each path, we will use a shorthand as the circos renderer seems to have problem with long string identifiers. // retrieve the results iterator> it = result.javaiterator(); list navigations = new arraylist(); map titles = new hashmap(); set paths = new hashset(); // iterate the results while (it.hasnext()) { map record = it.next(); string source = (string)((node) record.get("sourcepath")).getproperty("path"); string target = (string) ((node) record.get("targetpath")).getproperty("path"); string targettitle = (string) ((node) record.get("targetpath")).getproperty("title"); // reuse the navigation object as temorary holder navigations.add(new navigation(source, target, targettitle, new date(), 1)); paths.add(source); paths.add(target); if (!titles.containskey(target)) { titles.put(target, targettitle); } } // retrieve the various paths list pathids = arrays.aslist(paths.toarray(new string[]{})); // create the matrix that holds the info int[][] occurences = new int[pathids.size()][pathids.size()]; // iterate through all the navigations and update accordingly for (navigation navigation : navigations) { int sourceindex = pathids.indexof(navigation.getsource()); int targetindex = pathids.indexof(navigation.gettarget()); occurences[sourceindex][targetindex] = occurences[sourceindex][targetindex] + 1; } // matrix build, filter on threshold for (int i = 0; i < occurences.length; i++) { for (int j = 0; j < occurences.length; j++) { if (occurences[i][j] < threshold) { occurences[i][j] = 0; } } // print printcircosdata(pathids, titles, occurences); the text below is a sample of the output generated by the printcircosdata method. it first prints the legend (matching shorthands with actual paths). next it prints the tab-delimited circos table. link0 - /?p=411/wp-admin - storing and querying rdf data in neo4j through sail - datablend link1 - /?p=1146 - visualizing rdf schema inferencing through neo4j, tinkerpop, sail and gephi - datablend link2 - /?p=164 - big data / concise articles - datablend link3 - referral - null link4 - /?p=1400 - the joy of algorithms and nosql revisited: the mongodb aggregation framework - datablend ... datal0l1l2l3l4... l000000 l100000 l200000 l3059400197 l400000 4. use the circos power although circos can be installed on your local computer, we will use its online version to create the visualization of our data. upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site's navigation information. with just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. the outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. in case of referrals, no navigations have this path as target (indicated by the empty middle ring). its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. the l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). this segment visualizes the navigation data related to my "the joy of algorithms and nosql: a mongodb example" -article. most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. the l15-segment (my blog's main page) is the only path that receives an almost equal amount of incoming and outgoing traffic. with just a few tweaks to the circos input data, we can easily focus on particular types of navigation data. in the figure below, i made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors. 5. conclusions in the era of big data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. circos specializes in a very specific type of visualization, but does its job extremely well. i would be delighted to hear about other types of visualizations for directed graphs.
March 13, 2012
by Davy Suvee
· 36,376 Views · 2 Likes
article thumbnail
Joins with MapReduce
i have been reading up on join implementations available for hadoop for past few days. in this post i recap some techniques i learnt during the process. the joins can be done at both map side and join side according to the nature of data sets of to be joined. reduce side join let’s take the following tables containing employee and department data. let’s see how join query below can be achieved using reduce side join. select employees.name, employees.age, department.name from employees inner join department on employees.dept_id=department.dept_id map side is responsible for emitting the join predicate values along with the corresponding record from each table so that records having same department id in both tables will end up at on same reducer which would then do the joining of records having same department id. however it is also required to tag the each record to indicate from which table the record originated so that joining happens between records of two tables. following diagram illustrates the reduce side join process. here is the pseudo code for map function for this scenario. map (k table, v rec) { dept_id = rec.dept_id tagged_rec.tag = table tagged_rec.rec = rec emit(dept_id, tagged_rec) } at reduce side join happens within records having different tags. reduce (k dept_id, list tagged_recs) { for (tagged_rec : tagged_recs) { for (tagged_rec1 : taagged_recs) { if (tagged_rec.tag != tagged_rec1.tag) { joined_rec = join(tagged_rec, tagged_rec1) } emit (tagged_rec.rec.dept_id, joined_rec) } } map side join (replicated join) using distributed cache on smaller table for this implementation to work one relation has to fit in to memory. the smaller table is replicated to each node and loaded to the memory. the join happens at map side without reducer involvement which significantly speeds up the process since this avoids shuffling all data across the network even-though most of the records not matching are later dropped. smaller table can be populated to a hash-table so look-up by dept_id can be done. the pseudo code is outlined below. map (k table, v rec) { list recs = lookup(rec.dept_id) // get smaller table records having this dept_id for (small_table_rec : recs) { joined_rec = join (small_table_rec, rec) } emit (rec.dept_id, joined_rec) } using distributed cache on filtered table if the smaller table doesn’t fit the memory it may be possible to prune the contents of it if filtering expression has been specified in the query. consider following query. select employees.name, employees.age, department.name from employees inner join department on employees.dept_id=department.dept_id where department.name="eng" here a smaller data set can be derived from department table by filtering out records having department names other than “eng”. now it may be possible to do replicated map side join with this smaller data set. replicated semi-join reduce side join with map side filtering even of the filtered data of small table doesn’t fit in to the memory it may be possible to include just the dept_id s of filtered records in the replicated data set. then at map side this cache can be used to filter out records which would be sent over to reduce side thus reducing the amount of data moved between the mappers and reducers. the map side logic would look as follows. map (k table, v rec) { // check if this record needs to be sent to reducer boolean sendtoreducer = check_cache(rec.dept_id) if (sendtoreducer) { dept_id = rec.dept_id tagged_rec.tag = table tagged_rec.rec = rec emit(dept_id, tagged_rec) } } reducer side logic would be same as the reduce side join case. using a bloom filter a bloom filter is a construct which can be used to test the containment of a given element in a set. a smaller representation of filtered dept_ids can be derived if dept_id values can be augmented in to a bloom filter. then this bloom filter can be replicated to each node. at the map side for each record fetched from the smaller table the bloom filter can be used to check whether the dept_id in the record is present in the bloom filter and only if so to emit that particular record to reduce side. since a bloom filter is guaranteed not to provide false negatives the result would be accurate. references [1] hadoop in action [2] hadoop : the definitive guide
March 12, 2012
by Buddhika Chamith
· 31,062 Views
article thumbnail
Using Maven's -U Command Line Option
My prefered solution was to use the Maven ‘update snapshots’ command line argument.
March 11, 2012
by Roger Hughes
· 106,960 Views · 1 Like
article thumbnail
Resetting the Database Connection in Django
Django handles database connections transparently in almost all cases. It will start a new connection when your request starts up, and commit it at the end of the request lifetime. Other times you need to dive in further and do your own granular transaction management. But for the most part, it's fully automatic. However, sometimes your use case may require that you close the current database connection and open a new one. While this is possible in Django, it's not well documented. Why would you want to do this? I my case, I was writing an automation test framework. Some of the automation tests make database calls through the Django ORM to setup records, clean up after the test, etc. Each test is executed in the same process space, via a thread pool. We found that if one of the early tests threw an unrecoverable database error, such as an IntegrityError due to violating a unique constraint, the database connection would be aborted. Subsequent tests that tried to use the database would raise a DatabaseError: Traceback (most recent call last): File /home/user/project/app/test.py, line 73, in tearDown MyModel.objects.all() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 444, in delete collector.collect(del_query) File /usr/local/lib/python2.6/dist-packages/django/db/models/deletion.py, line 146, in collect reverse_dependency=reverse_dependency) File /usr/local/lib/python2.6/dist-packages/django/db/models/deletion.py, line 91, in add if not objs: File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 113, in __nonzero__ iter(self).next() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 107, in _result_iter self._fill_cache() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 772, in _fill_cache self._result_cache.append(self._iter.next()) File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 273, in iterator for row in compiler.results_iter(): File /usr/local/lib/python2.6/dist-packages/django/db/models/sql/compiler.py, line 680, in results_iter for rows in self.execute_sql(MULTI): File /usr/local/lib/python2.6/dist-packages/django/db/models/sql/compiler.py, line 735, in execute_sql cursor.execute(sql, params) File /usr/local/lib/python2.6/dist-packages/django/db/backends/postgresql_psycopg2/base.py, line 44, in execute return self.cursor.execute(query, args) DatabaseError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. It turns out that it's relatively easy to reset the database connection. We just called the following function at the start of every test. Django is smart enough to re-initialize the connection the next time it's used, assuming that it's disconnected properly. def reset_database_connection(): from django import db db.close_connection()
March 9, 2012
by Chase Seibert
· 9,198 Views
article thumbnail
Connecting to Multiple Databases Using Hibernate
In a recent project, I had a requirement of connecting to multiple databases using hibernate. As tapestry-hibernate module does not provide an out-of-box support, I thought of adding one. https://github.com/tawus/tapestry5 Now that the application is in production, I thought of writing a simple “How to”. I have cloned the latest stable(5.3.2) tapestry project at https://github.com/tawus/tapestry5 and have added multiple database support to it. Single Database It is almost fully compatible with the previous integration when using a single database except for a few things 1) HibernateConfigurer has changed public interface HibernateConfigurer { /** * Passed the configuration so as to make changes. */ void configure(Configuration configuration); /** * Factory Id for which this configurer is meant for */ Class getMarker(); /** * Entity package names * * @return */ String[] getPackageNames(); } 2) There is no HibernateEntityPackageManager, as the packages can be contributed by adding more HibernateConfigurers with the same Markers. Multiple databases For multiple database, a marker has to be used for accessing Session or HibernateSessionManager @Inject @XDB private Session session; @Inject @YDB private HibernateSessionManager sessionManager; @XDB @CommitAfter void myMethod(){ } Also you have to define a HibernateSessionManager and a Session for the secondary database in the Module class. @Scope(ScopeConstants.PERTHREAD) @Marker(DatabaseTwo.class) public static HibernateSessionManager buildHibernateSessionManagerForFinacle( HibernateSessionSource sessionSource, PerthreadManager perthreadManager) { HibernateSessionManagerImpl service = new HibernateSessionManagerImpl(sessionSource, DatabaseTwo.class); perthreadManager.addThreadCleanupListener(service); return service; } @Marker(DatabaseTwo.class) public static Session buildSessionForFinacle( @Local HibernateSessionManager sessionManager, PropertyShadowBuilder propertyShadowBuilder) { return propertyShadowBuilder.build(sessionManager, "session", Session.class); } Notice an annotation @DatabaseTwo.class. This is a Factory marker and is used to identify a service related to a particular SessionFactory. @Retention(RetentionPolicy.RUNTIME) @Target( {ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD}) @FactoryMarker @Documented public @interface DatabaseTwo { } A typical AppModule for two databases will be public class AppModule { public static void bind(ServiceBinder binder) { binder.bind(DemoService.class, DemoServiceImpl.class); } @Contribute(HibernateSessionSource.class) public static void configureHibernateSources(OrderedConfiguration configurers) { configurers.add("databaseOne", new HibernateConfigurer() { public void configure(org.hibernate.cfg.Configuration configuration) { configuration.configure("/databaseOne.xml"); } public Class getMarker() { return DefaultFactory.class; } public String[] getPackageNames() { return new String[] {"org.example.demo.one"}; } }); configurers.add("databaseTwo", new HibernateConfigurer() { public void configure(org.hibernate.cfg.Configuration configuration) { configuration.configure("/databaseTwo.xml"); } public Class getMarker() { return DatabaseTwo.class; } public String[] getPackageNames() { return new String[] {"org.example.demo.two"}; } }); } @Contribute(SymbolProvider.class) @ApplicationDefaults public static void addSymbols(MappedConfiguration configuration) { configuration.add(HibernateSymbols.DEFAULT_CONFIGURATION, "false"); configuration.add("tapestry.app-package", "org.example.demo"); } @Scope(ScopeConstants.PERTHREAD) @Marker(DatabaseTwo.class) public static HibernateSessionManager buildHibernateSessionManagerForFinacle( HibernateSessionSource sessionSource, PerthreadManager perthreadManager) { HibernateSessionManagerImpl service = new HibernateSessionManagerImpl(sessionSource, DatabaseTwo.class); perthreadManager.addThreadCleanupListener(service); return service; } @Marker(DatabaseTwo.class) public static Session buildSessionForFinacle( @Local HibernateSessionManager sessionManager, PropertyShadowBuilder propertyShadowBuilder) { return propertyShadowBuilder.build(sessionManager, "session", Session.class); } } Injecting into Services You can inject a session in a service using the marker. As DatabaseOne is being used as the default configuration, in order to inject its Session, you have to annotate it with @DefaultFactory. For DatabaseTwo, you can use @DatabaseTwo annotation. public class DemoServiceImpl implements DemoService { private Session sessionOne; private Session sessionTwo; public DemoServiceImpl( @DefaultFactory Session sessionOne, @DatabaseTwo Session sessionTwo) { this.sessionOne = sessionOne; this.sessionTwo = sessionTwo; } @SuppressWarnings("unchecked") public List listOnes() { return sessionOne.createCriteria(EntityOne.class).list(); } @SuppressWarnings("unchecked") public List listTwos() { return sessionTwo.createCriteria(EntityTwo.class).list(); } public void save(EntityOne entityOne) { sessionOne.saveOrUpdate(entityOne); } public void save(EntityTwo entityTwo) { sessionTwo.saveOrUpdate(entityTwo); } } Using @CommitAfter You can add an advice the same way you used to. The only change is in @CommitAfter. You have to additionally annotate the method with the respective marker. public interface DemoService { List listOnes(); List listTwos(); @CommitAfter @DefaultFactory void save(EntityOne entityOne); @CommitAfter @DatabaseTwo void save(EntityTwo entityTwo); } Here is an example. From http://tawus.wordpress.com/2012/03/03/tapestry-hibernate-multiple-databases/
March 7, 2012
by Taha Siddiqi
· 100,294 Views · 3 Likes
article thumbnail
Blobs and More: Storing Images and Files in IndexedDB
The desired future approach for storing things client-side in web browsers is utilizing IndexedDB. Here I’ll walk you through how to store images and files in IndexedDB and then present them through an ObjectURL. The general approach First, let’s talk about the steps we will go through to create an IndexedDB data base, save the file into it and then read it out and present in the page: Create or open a database. Create an objectStore (if it doesn’t already exist) Retrieve an image file as a blob Initiate a database transaction Save that blob into the database Read out that saved file and create an ObjectURL from it and set it as the src of an image element in the page Creating the code Let’s break down all parts of the code that we need to do this: Create or open a database. // IndexedDB var indexedDB = window.indexedDB || window.webkitIndexedDB || window.mozIndexedDB || window.OIndexedDB || window.msIndexedDB, IDBTransaction = window.IDBTransaction || window.webkitIDBTransaction || window.OIDBTransaction || window.msIDBTransaction, dbVersion = 1; // Create/open database var request = indexedDB.open("elephantFiles", dbVersion); request.onsuccess = function (event) { console.log("Success creating/accessing IndexedDB database"); db = request.result; db.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; // Interim solution for Google Chrome to create an objectStore. Will be deprecated if (db.setVersion) { if (db.version != dbVersion) { var setVersion = db.setVersion(dbVersion); setVersion.onsuccess = function () { createObjectStore(db); getImageFile(); }; } else { getImageFile(); } } else { getImageFile(); } } // For future use. Currently only in latest Firefox versions request.onupgradeneeded = function (event) { createObjectStore(event.target.result); }; The intended way to use this is to have the onupgradeneeded event triggered when a database is created or gets a higher version number. This is currently only supported in Firefox, but will soon be in other web browsers. If the web browser doesn’t support this event, you can use the deprecated setVersion method and connect to its onsuccess event. Create an objectStore (if it doesn’t already exist) // Create an objectStore console.log("Creating objectStore") dataBase.createObjectStore("elephants"); Here you create an ObjectStore that you will store your data – or in our case, files – and once created you don’t need to recreate it, just update its contents. Retrieve an image file as a blob // Create XHR var xhr = new XMLHttpRequest(), blob; xhr.open("GET", "elephant.png", true); // Set the responseType to blob xhr.responseType = "blob"; xhr.addEventListener("load", function () { if (xhr.status === 200) { console.log("Image retrieved"); // File as response blob = xhr.response; // Put the received blob into IndexedDB putElephantInDb(blob); } }, false); // Send XHR xhr.send(); This code gets the contents of a file as a blob directly. Currently that’s only supported in Firefox. Once you have received the entire file, you send the blob to the function to store it in the database. Initiate a database transaction // Open a transaction to the database var transaction = db.transaction(["elephants"], IDBTransaction.READ_WRITE); To start writing something to the database, you need to initiate a transaction with an objectStore name and the type of action you want to do – in this case read and write. Save that blob into the database // Put the blob into the dabase transaction.objectStore("elephants").put(blob, "image"); Once the transaction is in place, you get a reference to the desired objectStore and then put your blob into it and give it a key. Read out that saved file and create an ObjectURL from it and set it as the src of an image element in the page // Retrieve the file that was just stored transaction.objectStore("elephants").get("image").onsuccess = function (event) { var imgFile = event.target.result; console.log("Got elephant!" + imgFile); // Get window.URL object var URL = window.URL || window.webkitURL; // Create and revoke ObjectURL var imgURL = URL.createObjectURL(imgFile); // Set img src to ObjectURL var imgElephant = document.getElementById("elephant"); imgElephant.setAttribute("src", imgURL); // Revoking ObjectURL URL.revokeObjectURL(imgURL); }; Use the same transaction to get the image file you just stored, and then create an objectURL and set it to the src of an image in the page. This could just as well, for instance, have been a JavaScript file that you attached to a script element, and then it would parse the JavaScript. The complete code So, here’s the complete working code: (function () { // IndexedDB var indexedDB = window.indexedDB || window.webkitIndexedDB || window.mozIndexedDB || window.OIndexedDB || window.msIndexedDB, IDBTransaction = window.IDBTransaction || window.webkitIDBTransaction || window.OIDBTransaction || window.msIDBTransaction, dbVersion = 1.0; // Create/open database var request = indexedDB.open("elephantFiles", dbVersion), db, createObjectStore = function (dataBase) { // Create an objectStore console.log("Creating objectStore") dataBase.createObjectStore("elephants"); }, getImageFile = function () { // Create XHR var xhr = new XMLHttpRequest(), blob; xhr.open("GET", "elephant.png", true); // Set the responseType to blob xhr.responseType = "blob"; xhr.addEventListener("load", function () { if (xhr.status === 200) { console.log("Image retrieved"); // Blob as response blob = xhr.response; console.log("Blob:" + blob); // Put the received blob into IndexedDB putElephantInDb(blob); } }, false); // Send XHR xhr.send(); }, putElephantInDb = function (blob) { console.log("Putting elephants in IndexedDB"); // Open a transaction to the database var transaction = db.transaction(["elephants"], IDBTransaction.READ_WRITE); // Put the blob into the dabase var put = transaction.objectStore("elephants").put(blob, "image"); // Retrieve the file that was just stored transaction.objectStore("elephants").get("image").onsuccess = function (event) { var imgFile = event.target.result; console.log("Got elephant!" + imgFile); // Get window.URL object var URL = window.URL || window.webkitURL; // Create and revoke ObjectURL var imgURL = URL.createObjectURL(imgFile); // Set img src to ObjectURL var imgElephant = document.getElementById("elephant"); imgElephant.setAttribute("src", imgURL); // Revoking ObjectURL URL.revokeObjectURL(imgURL); }; }; request.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; request.onsuccess = function (event) { console.log("Success creating/accessing IndexedDB database"); db = request.result; db.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; // Interim solution for Google Chrome to create an objectStore. Will be deprecated if (db.setVersion) { if (db.version != dbVersion) { var setVersion = db.setVersion(dbVersion); setVersion.onsuccess = function () { createObjectStore(db); getImageFile(); }; } else { getImageFile(); } } else { getImageFile(); } } // For future use. Currently only in latest Firefox versions request.onupgradeneeded = function (event) { createObjectStore(event.target.result); }; })(); Web browser support IndexedDB Supported since long (a number of versions back) in Firefox and Google Chrome. Planned to be in IE10, unclear about Safari and Opera. onupgradeneeded Supported in latest Firefox. Planned to be in Google Chrome soon and hopefully IE10. Unclear about Safari and Opera. Storing files in IndexedDB Supported in Firefox 11 and later. Planned to be supported in Google Chrome. Hopefully IE10 will support it. Unclear about Safari and Opera. XMLHttpRequest Level 2 Supported in Firefox and Google Chrome since long, Safari 5+ and planned to be in IE10 and Opera 12. responseType “blob” Currently only supported in Firefox. Will soon be in Google Chrome and is planned to be in IE10. Unclear about Safari and Opera. Demo and code I’ve put together a demo with IndexedDB and saving images and files in it where you can see it all in action. Make sure to use any Developer Tool to Inspect Element on the image to see the value of its src attribute. Also make sure to check the console.log messages to follow the actions. The code for storing files in IndexedDB is also available on GitHub, so go play now!
March 6, 2012
by Robert Nyman
· 23,723 Views
article thumbnail
Solr Date Math, NOW and filter queries
Or “How to never re-use cached filter query results even though you meant to”: Filter queries (“fq” clauses) are a means to restrict the number of documents that are considered for scoring. A common use of “fq” clauses is to restrict the dates of documents returned, things like “in the last day”, “in the last week” etc. You find this pattern often used in conjunction with faceting. Filter queries make use of a filterCache (see solrconfig.xml) to calculate the set of documents satisfying the query once and then re-use that result set. Often, using NOW in filter queries causes this caching to be useless. Here’s why. Solr maintains a filterCache, where it stores the results of “fq” clauses. You can think of it as a map, where the key is the “fq” clause and the value is the set of documents satisfying that clause. I’m going to skip the details of how the document set (the “value” in this map) is stored, since this post is really concentrating on the key. So, let’s say you have two filter queries (whether they’re in the same query or not is irrelevant), something like: “fq=category:books&fq=source:library”. There will be two entries in the filterCache, something like: category:books => 1, 2, 5, 89… source:library => 7, 45, 101… All well and good so far. I’ll add one short diversion here. This bears on why it is often better to have several “fq” clauses than a single one. The same results could be obtained by “fq=category:books AND source:library”, but then the filter cache would look like: category:books AND source:library => 1, 2, 5, 7, 45, 89, 101….. and an fq like “fq=category:books” would NOT re-use the cache since the key is much different. But enough of a diversion… OK, you mentioned dates. Get to the point. It’s common to have date ranges as filter queries, things like “in the last day”, “in the last week”, etc. And there’s the convenient date math to make this easy. So it’s tempting, very tempting to have filter clauses on date ranges like “fq=date:[NOW-1DAY TO NOW]“. Be careful when using NOW! Here’s the problem. In the above example, date:[NOW-1DAY TO NOW] is not what’s used as the key for the fq in the filterCache, the expansion is used as the key. This translates into a form like: “date:[2012-01-20T08:56:23Z TO 2012-01-27T8:56:23Z]” for the key into the filter cache. Now the user adds a term to the “q” and re-submits the query 30 seconds after the first one. The fq clause now looks something like: “fq=date:[2012-01-20T08:56:53Z TO 2012-01-27T8:56:53Z]” note that the seconds went from 23 to 53! The key for this fq does not match the key for the first, even though it’s often the case that the intent is that submitting this kind of fq 30 seconds later would result in the same set of documents matching the filter. Bare NOW entries in filter clauses will pretty much guarantee that the cached result sets will never be reused. Fine. What do you do to make it better? Here’s where rounding makes sense. Using midnight can make sense from two perspectives. The sense you often want is “anything with a timestamp in a particular day” (or month or year or hour or….). So just using NOW for the lower bound would miss anything published between midnight and whenever the user happens to submit the query on the day (in this example) of the lower bound. Re-using the filter cache can substantially speed up your queries, especially if you’re providing links like “in the last day”, “1-7 days ago” etc. So your fq clauses start to look like “fq=date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY]“. The thing to note about the date math “/” operator is that is is a “round down” operator. So let’s break this up a bit: NOW/DAY rounds down to midnight last night. -7DAYS subtracts 7 whole days. So the lower limit is really “midnight 7 days ago”. Similarly, NOW/DAY rounds to midnight last night and +1DAY moves that to midnight tonight for the upper limit. These clauses are invariant until after midnight tonight so these clauses will return the same results all day today, and only the first submission of this fq will incur the cost of figuring out which documents satisfy it, all the queries after the first will just read the cached result set from the filterCache. Of course the caches are invalidated if you update your index and/or a replication happens, but that’s always the case. You will note that there is a bit of “slop” here. If your index has dates in the future, you may get them too. Suppose you have a situation where your index contains documents you don’t want the users to see until it’s later than their timestamp. I actually have a hard time contriving an example here, but let’s just assume it’s the case. Also say it’s noon and your index contains timestamps on documents through midnight tonight. The above technique will show documents that will be officially published at, say, 15:00 even though it’s only 12:00 and you may not want that. In that case, you’ll have to use a bare NOW clause and live with the fact that your cache isn’t being used for these clauses. Like I said, this is contrived, but I mention it for completeness’ sake. A couple of notes about dates: Before I finish, a couple of notes about dates. Use the coarsest dates you can and still satisfy your use case. This is especially true if you’re sorting by dates. The sorting resource requirements go up by the number of unique terms. So storing millisecond resolution when all you care about is day can be wasteful. This is also true when faceting. It’s often useful to index multiple fields with some date data, especially if you intend to facet. The above examples in the 3.x code line have a slight problem when more than one adjoining range is required. The range operator “[]” is inclusive, so if you have a document indexed at exactly midnight in these examples, it might be included in two ranges. Trunk Solr (4.0) allows mixing inclusive “[]” and exclusive “{}” endpoints, so expressions like “date:[NOW/DAY-1DAY TO NOW/DAY+1DAY}” are possible. An exercise for the reader: What are the consequences of using different kinds of rounding? E.g. NOW/5MIN, NOW/72MIN (does this even work?).
February 27, 2012
by Erick Erickson
· 52,604 Views
article thumbnail
How to migrate databases between SQL Server and SQL Server Compact
In this post, I will try to give an overview of the free tools available for developers to move databases from SQL Server to SQL Server Compact and vice versa. I will also show how you can do this with the SQL Server Compact Toolbox (add-in and standalone editions). Moving databases from SQL Server Compact to SQL Server This can be useful for situations where you already have developed an application that depends on SQL Server Compact, and would like the increased power of SQL Server or would like to use some feature, that is not available on SQL Server Compact. I have an informal comparison of the two products here. Microsoft offers a GUI based tool and a command line tool to do this: WebMatrix and MsDeploy. You can also use the ExportSqlCe command line tool or the SQL Server Compact Toolbox to do this. To use the ExportSqlCE (or ExportSqlCE40) command line, use a command similar to: ExportSQLCE.exe "Data Source=D:\Northwind.sdf;" Northwind.sql The resulting script file (Northwind.sql) can the be run against a SQL Server database, using for example the SQL Server sqlcmd command line utility: sqlcmd -S mySQLServer –d NorthWindSQL -i C:\Northwind.sql To use the SQL Server Compact Toolbox: Connect the Toolbox to the database file that you want to move to SQL Server: Right click the database connection, and select to script Schema and Data: Optionally, select which tables to script and click OK: Enter the filename for the script, default extension is .sqlce: Click OK to the confirmation message: You can now open the generated script in Management Studio and execute it against a SQL Server database, or run it with sqlcmd as described above. Moving databases from SQL Server to SQL Server Compact Microsoft offers no tools for doing this “downsizing” of a SQL Server database to SQL Server Compact, and of course not all objects in a SQL Server database CAN be downsized, as only tables exists in a SQL Server Compact database, so no stored procedures, views, triggers, users, schema and so on. I have blogged about how this can be done from the command line, and you can also do this with the SQL Server Compact Toolbox (of course): From the root node, select Script Server Data and Schema: Follow a procedure like the one above, but connecting to a SQL Server database instead. The export process will convert the SQL Server data types to a matching SQL Server Compact data type, for example varchar(50) becomes nvarchar(50) and so on. Any unsupported data types will be ignored, this includes for example computed columns and sql_variant. The new date types in SQL Server 2008+, like date, time, datetime2 will be converted to nvarchar based data types, as only datetime is supported in SQL Server Compact. A full list of the SQL Server Compact data types is available here.
February 26, 2012
by Erik Ejlskov Jensen
· 16,927 Views
article thumbnail
wxPython: Creating a "Dark Mode"
One day at work, I was told that we had a feature request for one of my programs. They wanted a “dark mode” for when they used my application at night as the normal colors were kind of glaring. My program is used in laptops in police cars, so I could understand their frustration. I spent some time looking into the matter and got a mostly working script put together which I’m going to share with my readers. Of course, if you’re a long time reader, you probably know I’m talking about a wxPython program. I write almost all my GUIs using wxPython. Anyway, let’s get on with the story! Into the Darkness Getting the widgets to change color in wxPython is quite easy. The only two methods you need are SetBackgroundColour and SetForegroundColour. The only major problem I ran into when I was doing this was getting my ListCtrl / ObjectListView widget to change colors appropriately. You need to loop over each ListItem and change their colors individually. I alternate row colors, so that made things more interesting. The other problem I had was restoring the ListCtrl’s background color. Normally you can set a widget’s background color to wx.NullColour (or wx.NullColor) and it will go back to its default color. However, some widgets don’t work that way and you have to actually specify a color. It should also be noted that some widgets don’t seem to pay any attention to SetBackgroundColour at all. One such widget that I’ve found is the wx.ToggleButton. Now you know what I know, so let’s look at the code I came up with to solve my issue: import wx try: from ObjectListView import ObjectListView except: ObjectListView = False #---------------------------------------------------------------------- def getWidgets(parent): """ Return a list of all the child widgets """ items = [parent] for item in parent.GetChildren(): items.append(item) if hasattr(item, "GetChildren"): for child in item.GetChildren(): items.append(child) return items #---------------------------------------------------------------------- def darkRowFormatter(listctrl, dark=False): """ Toggles the rows in a ListCtrl or ObjectListView widget. Based loosely on the following documentation: http://objectlistview.sourceforge.net/python/recipes.html#recipe-formatter and http://objectlistview.sourceforge.net/python/cellEditing.html """ listItems = [listctrl.GetItem(i) for i in range(listctrl.GetItemCount())] for index, item in enumerate(listItems): if dark: if index % 2: item.SetBackgroundColour("Dark Grey") else: item.SetBackgroundColour("Light Grey") else: if index % 2: item.SetBackgroundColour("Light Blue") else: item.SetBackgroundColour("Yellow") listctrl.SetItem(item) #---------------------------------------------------------------------- def darkMode(self, normalPanelColor): """ Toggles dark mode """ widgets = getWidgets(self) panel = widgets[0] if normalPanelColor == panel.GetBackgroundColour(): dark_mode = True else: dark_mode = False for widget in widgets: if dark_mode: if isinstance(widget, ObjectListView) or isinstance(widget, wx.ListCtrl): darkRowFormatter(widget, dark=True) widget.SetBackgroundColour("Dark Grey") widget.SetForegroundColour("White") else: if isinstance(widget, ObjectListView) or isinstance(widget, wx.ListCtrl): darkRowFormatter(widget) widget.SetBackgroundColour("White") widget.SetForegroundColour("Black") continue widget.SetBackgroundColour(wx.NullColor) widget.SetForegroundColour("Black") self.Refresh() return dark_mode This code is a little convoluted, but it gets the job done. Let’s break it down a bit and see how it works. First off, we try to import ObjectListView, a cool 3rd party widget that wraps wx.ListCtrl and makes it a LOT easier to use. However, it’s not part of wxPython right now, so you need to test for it’s existence. I just set it to False if it doesn’t exist. The GetWidgets function takes a parent parameter, which would usually be a wx.Frame or wx.Panel and goes through all of its children to create a list of widgets, which it then returns to the calling function. The main function is darkMode. It takes two parameters too, the poorly named “self”, which refers to a parent widget, and a default panel color. It calls GetWidgets and then uses a conditional statement to decide if dark mode should be enabled or not. Next it loops over the widgets and changes the colors accordingly. When it’s done, it will refresh the passed in parent and return a bool to let you know if dark mode is on or off. There is one more function called darkRowFormatter that is only for setting the colors of the ListItems in a wx.ListCtrl or an ObjectListView widget. Here we use a list comprehension to create a list of wx.ListItems that we then iterate over, changing their colors. To actually apply the color change, we need to call SetItem and pass it a wx.ListItem object instance. Trying Out Dark Mode So now you’re probably wondering how to actually use the script above. Well, this section will show you how it’s done. Here’s a simple program with a list control in it and a toggle button too! import wx import darkMode ######################################################################## class MyPanel(wx.Panel): """""" #---------------------------------------------------------------------- def __init__(self, parent): """Constructor""" wx.Panel.__init__(self, parent) self.defaultColor = self.GetBackgroundColour() rows = [("Ford", "Taurus", "1996", "Blue"), ("Nissan", "370Z", "2010", "Green"), ("Porche", "911", "2009", "Red") ] self.list_ctrl = wx.ListCtrl(self, style=wx.LC_REPORT) self.list_ctrl.InsertColumn(0, "Make") self.list_ctrl.InsertColumn(1, "Model") self.list_ctrl.InsertColumn(2, "Year") self.list_ctrl.InsertColumn(3, "Color") index = 0 for row in rows: self.list_ctrl.InsertStringItem(index, row[0]) self.list_ctrl.SetStringItem(index, 1, row[1]) self.list_ctrl.SetStringItem(index, 2, row[2]) self.list_ctrl.SetStringItem(index, 3, row[3]) if index % 2: self.list_ctrl.SetItemBackgroundColour(index, "white") else: self.list_ctrl.SetItemBackgroundColour(index, "yellow") index += 1 btn = wx.ToggleButton(self, label="Toggle Dark") btn.Bind(wx.EVT_TOGGLEBUTTON, self.onToggleDark) normalBtn = wx.Button(self, label="Test") sizer = wx.BoxSizer(wx.VERTICAL) sizer.Add(self.list_ctrl, 0, wx.ALL|wx.EXPAND, 5) sizer.Add(btn, 0, wx.ALL, 5) sizer.Add(normalBtn, 0, wx.ALL, 5) self.SetSizer(sizer) #---------------------------------------------------------------------- def onToggleDark(self, event): """""" darkMode.darkMode(self, self.defaultColor) ######################################################################## class MyFrame(wx.Frame): """""" #---------------------------------------------------------------------- def __init__(self): """Constructor""" wx.Frame.__init__(self, None, wx.ID_ANY, "MvP ListCtrl Dark Mode Demo") panel = MyPanel(self) self.Show() #---------------------------------------------------------------------- if __name__ == "__main__": app = wx.App(False) frame = MyFrame() app.MainLoop() If you run the program above, you should see something like this: If you click the ToggleButton, you should see something like this: Notice how the toggle button was unaffected by the SetBackgroundColour method. Also notice that the list control’s column headers don’t change colors either. Unfortunately, wxPython doesn’t expose access to the column headers, so there’s no way to manipulate their color. Anyway, let’s take a moment to see how the dark mode code is used. First we need to import it. In this case, the module is called darkMode. To actually call it, we need to look at the ToggleButton’s event handler: darkMode.darkMode(self, self.defaultColor) As you can see, all we did was call darkMode.darkMode with the panel object (the “self) and a defaultColor that we set at the beginning of the wx.Panel’s init method. That’s all we had to do too. We should probably set it up with a variable to catch the returned value, but for this example we don’t really care. Wrapping Up Now we’re done and you too can create a “dark mode” for your applications. At some point, I’d like to generalize this some more to make into a color changer script where I can pass whatever colors I want to it. What would be really cool is to make it into a mixin. But that’s something for the future. For now, enjoy! Further Reading ObjectListView documentation An ObjectListView tutorial wx.ListCtrl documentation Source Code 2011-11-5-wxPython-dark-mode You can also pull the source from Bitbucket Source: http://www.blog.pythonlibrary.org/2011/11/05/wxpython-creating-a-dark-mode/
February 26, 2012
by Mike Driscoll
· 7,680 Views
  • Previous
  • ...
  • 514
  • 515
  • 516
  • 517
  • 518
  • 519
  • 520
  • 521
  • 522
  • 523
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×