Databases Resources

The Latest Databases Topics

F1 Live Timing Map

this is a live timing map application for f1 championship races made using javascript and google maps markers. the live timing data is supplied by formula1.com. it’s interactive, you can press over a driver to track him or press into an empty map zone to untrack and have a general view. it has also been made with a responsive design to adapt it to mobile browsers using jquerymobile framework. how it works: the client side: until the race start date a countdown and a demo race is showed. when the countdown finishes it will connect to server (using ajax) to get the live timing data from server (every five seconds) and the interface will be updated using this data. the server side: it uses a django app for the web page and the static race data (circuit, laps, drivers) is put into the html using the django template system. for the dynamic data (live timing) i have modified the source of a c program for the linux terminal called live-f1 to generate a json with the data that the client requires instead of printing it on terminal screen. enjoy the race!

April 12, 2012

by Luis Sobrecueva

· 16,116 Views

Configuring Quartz With JDBCJobStore in Spring

I am starting a little series about Quartz scheduler internals, tips and tricks, this is chapter 0 - how to configure persistent job store.

April 7, 2012

by Tomasz Nurkiewicz

· 37,803 Views

Wrapping Begin/End Async API Into C#5 Tasks

Microsoft offered programmers several different ways of dealing with the asynchronous programming since .NET 1.0. The first model was Asynchronous programming model or APM for short. The pattern is implemented with two methods named BeginOperation and EndOperation. .NET 4 introduced new pattern – Task Asynchronous Pattern and with the introduction of .NET 4.5, Microsoft added language support for language integrated asynchronous coding style. You can check the MSDN for more samples and information. I will assume that you are familiar with it and have written code using it. You can wrap existing APM pattern into TPL pattern using the Task.Factory.FromAsync methods. For example: public static Task> ExecuteAsync(this DataServiceQuery query, object state) { return Task.Factory.FromAsync>(query.BeginExecute, query.EndExecute, state); } It is easy to wrap most of the asynchronous functions this way, but some cannot be since the wrapper functions assume that the last two parameters to the BeginOperation are AsyncCallback and object, and there are some versions of asynchronous operations that have different specifications. Examples: Extra parameters after the object state parameter: IAsyncResult DataServiceContext.BeginExecuteBatch( AsyncCallback callback, object state, params DataServiceRequest[] queries); Missing the expected object state parameter and different return type: ICancelableAsyncResult BeginQuery(AsyncCallback callBack); WorkItemCollection EndQuery(ICancelableAsyncResult car); Short solution for the first example The short and elegant way for wrapping the first example is to provide the following wrapper: public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { if (context == null) throw new ArgumentNullException("context"); return Task.Factory.FromAsync( context.BeginExecuteBatch(null, state, queries), context.EndExecuteBatch); } We simply call the Begin method ourselves and then wrap it using an another overload for FromAsync function. The longer way However, we can fully wrap it ourselves by simulating what the FromAsync wrapper does. The complete code is listed below. public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { // this will be our sentry that will know when our async operation is completed var tcs = new TaskCompletionSource(); try { context.BeginExecuteBatch((iar) => { try { var result = context.EndExecuteBatch(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { // if the inner operation was canceled, this task is cancelled too tcs.TrySetCanceled(); } catch (Exception ex) { // general exception has been set bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }, state, queries); } catch { tcs.TrySetResult(default(DataServiceResponse)); // propagate exceptions to the outside throw; } return tcs.Task; } Besides educational benefits, writing the full wrapper code allows us to add cancellation, logging and diagnostic information. Once we understand how to wrap APM pattern, We can now tackle the second problem easily. Handling the BeginQuery/EndQuery We will first create our own wrapper function in the spirit of the above code with the notable difference that we use the ICancelableAsyncResult interface instead of the IAsyncResult. public static class TaskEx { public static Task FromAsync(Func beginMethod, Func endMethod) { if (beginMethod == null) throw new ArgumentNullException("beginMethod"); if (endMethod == null) throw new ArgumentNullException("endMethod"); var tcs = new TaskCompletionSource(); try { beginMethod((iar) => { try { var result = endMethod(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { tcs.TrySetCanceled(); } catch (Exception ex) { bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }); } catch { tcs.TrySetResult(default(TResult)); throw; } return tcs.Task; } } The code is pretty self-explanatory and we can go ahead with the wrapping. There are four different operations that are exposed both in synchronous and asynchronous version: Query, LinkQuery, CountOnlyQuery and RegularQuery. The extension methods are short since we have already created our generic wrapper above: public static Task RunQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginQuery, query.EndQuery); } public static Task RunLinkQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginLinkQuery, query.EndLinkQuery); } public static Task RunCountOnlyQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginCountOnlyQuery, query.EndCountOnlyQuery); } public static Task RunRegularQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginRegularQuery, query.EndRegularQuery); } That is it for today, you can write your own handy extensions easily for APM functions out there.

April 2, 2012

by Toni Petrina

· 11,308 Views

Cassandra Indexing: The Good, the Bad and the Ugly

Within NoSQL, the operations of indexing, fetching and searching for information are intimately tied to the physical storage mechanisms. It is important to remember that rows are stored across hosts, but a single row is stored on a single host. (with replicas) Columns families are stored in sorted order, which makes querying a set of columns efficient (provided you are spanning rows). The Bad : Partitioning One of the tough things to get used to at first is that without any indexes queries that span rows can (very) be bad. Thinking back to our storage model however, that isn't surprising. The strategy that Cassandra uses to distribute the rows across hosts is called Partitioning. Partitioning is the act of carving up the range of rowkeys assigning them into the "token ring", which also assigns responsibility for a segment (i.e. partition) of the rowkey range to each host. You've probably seen this when you initialized your cluster with a "token". The token gives the host a location along the token ring, which assigns responsibility for a section of the token range. Partitioning is the act of mapping the rowkey into the token range. There are two primary partitioners: Random and Order Preserving. They are appropriately named. The RandomPartitioner hashes the rowkeys into tokens. With the RandomPartitioner, the token is a hash of the rowkey. This does a good job of evenly distributing your data across a set of nodes, but makes querying a range of the rowkey space incredibly difficult. From only a "start rowkey" value and an "end rowkey" value, Cassandra can't determine what range of the token space you need. It essentially needs to perform a "table scan" to answer the query, and a "table scan" in Cassandra is bad because it needs to go to each machine (most likely ALL machines if you have a good hash function) to answer the query. Now, at the great cost of even data distribution, you can employ the OrderPreservingPartitioner (OPP). I am *not* down with OPP. The OPP preserves order as it translates rowkeys into tokens. Now, given a start rowkey value and a end rowkey value, Cassandra *can* determine exactly which hosts have the data you are looking for. It computes the start value to a token the end value to a token, and simply selects and returns everything in between. BUT, by preserving order, unless your rowkeys are evenly distributed across the space, your tokens won't be either and you'll get a lopsided cluster, which greatly increases the cost of configuration and administration of the cluster. (not worth it) The Good : Secondary Indexes Cassandra does provide a native indexing mechanism in Secondary Indexes. Secondary Indexes work off of the columns values. You declare a secondary index on a Column Family. Datastax has good documentation on the usage. Under the hood, Cassandra maintains a "hidden column family" as the index. (See Ed Anuff's presentation for specifics) Since Cassandra doesn't maintain column value information in any one node, and secondary indexes are on columns value (rather than rowkeys), a query still needs to be sent to all nodes. Additionally, secondary indexes are not recommended for high-cardinality sets. I haven't looked yet, but I'm assuming this is because of the data model used within the "hidden column family". If the hidden column family stores a row per unique value (with rowkeys as columns), then it would mean scanning the rows to determine if they are within the range in the query. From Ed's presentation: Not recommended for high cardinality values(i.e.timestamps,birthdates,keywords,etc.) Requires at least one equality comparison in a query--not great for less-than/greater-than/range queries Unsorted - results are in token order, not query value order Limited to search on datatypes, Cassandra natively understands With all that said, secondary indexes work out of the box and we've had good success using them on simple values. The Ugly : Do-It-Yourself (DIY) / Wide-Rows Now, beauty is in the eye of the beholder. One of the beautiful things about NoSQL is the simplicity. The constructs are simple: Keyspaces, Column Families, Rows and Columns. Keeping it simple however means sometimes you need to take things into your own hands. This is the case with wide-row indexes. Utilizing Cassandra's storage model, its easy to build your own indexes where each row-key becomes a column in the index. This is sometimes hard to get your head around, but lets imagine we have a case whereby we want to select all users in a zip code. The main users column family is keyed on userid, zip code is a column on each user row. We could use secondary indexes, but there are quite a few zip codes. Instead we could maintain a column family with a single row called "idx_zipcode". We could then write columns into this row of the form "zipcode_userid". Since the columns are stored in sorted order, it is fast to query for all columns that start with "18964" (e.g. we could use 18964_ and 18964_ZZZZZZ as start and end values). One obvious downside of this approach is that rows are self-contained on a host. (again except for replicas) This means that all queries are going to hit a single node. I haven't yet found a good answer for this. Additionally, and IMHO, the ugliest part of DIY wide-row indexing is from a client perspective. In our implementation, we've done our best to be language agnostic on the client-side, allowing people to pick the best tool for the job to interact with the data in Cassandra. With that mentality, the DIY indexes present some trouble. Wide-rows often use composite keys (imagine if you had an idx_state_zip, which would allow you to query by state then zip). Although there is "native" support for composite keys, all of the client libraries implement their own version of them (Hector, Astyanax, and Thrift). This means that client needing to query data needs to have the added logic to first query the index, and additionally all clients need to construct the composite key in the same manner. Making It Better... For this very reason, we've decided to release two open source projects that help push this logic to the server-side. The first project is Cassandra-Triggers. This allows you to attached asynchronous activities to writes in Cassandra. (one such activity could be indexing) We've also released Cassandra-Indexing. This is hot off the presses and is still in its infancy (e.g. it only supports UT8Types in the index), but the intent is to provide a generic server-side mechanism that indexes data as its written to Cassandra. Employing the same server-side technique we used in Cassandra-Indexing, you simply configure the columns you want indexed, and the AOP code does the rest as you write to the target CF. As always, questions, comments and thoughts are welcome. (especially if I'm off-base somewhere)

March 23, 2012

by Brian O' Neill

· 35,617 Views

PHP objects in MongoDB with Doctrine

An is equivalent to an Object-Relational Mapper, but with its targets are documents of a NoSQL database instead of table rows. No one said that a Data Mapper must always rely on a relational database as its back end. In the PHP world, probably the Doctrine ODM for MongoDB is the most successful. This followes to the opularity of Mongo, which is a transitional product between SQL and NoSQL, still based on some relational concepts like queries. Lots of features The Doctrine Mongo ODM supports mapping of objects via annotations placed in the class source code, or via external XML or YAML files. In this and in many aspects it is based on the same concepts as the Doctrine ORM: it features a Facade DocumentManager object and a Unit Of Work that batches changes to the database when objects are added to it. Moreover, two different types of relationships between objects are supported: references and embedded documents. The first is the equivalent of the classical pointer to another row which ORM always transform object references into; the second actually stores an object inside another one, like you would do with a Value Object. Thus, at least in Doctrine's case, it is easier to map objects as documents that as rows. As said before, the ODM borrows some concepts and classes from the ORM, in particular from the Doctrine\Common package which features a standard collection class. So if you have built objects mapped with the Doctrine ORM nothing changes for persisting them in MongoDB, except for the mapping metadata itself. Advantages If an ORM is sometimes a leaky abstraction, an ODM probably becomes an issue less often. It has less overhead than an ORM, since there is no schema to define and the ability to embed objects means there should be no compromises between the object model and the capabilities of the database. How many times we have renounced introducing a potential Value Object because of the difficulty in persisting it? The case for an ODM over a plain Mongo connection object is easy to make: you will still be able to use objects with proper encapsulation (like private fields and associations) and behavior (many methods) instead of extracting just a JSON package from your database. Installation A prerequisite for the ODM is the presence of the mongo extension, that can be installed via pecl. After having verified the extension is present, grab the Doctrine\Common as the 2.2.x package, and a zip of the doctrine-mongodb and doctrine-mongodb-odm projects from Github. Decompress everything into a Doctrine/ folder. After having setup autoloading for classes in Doctrine\, use this bootstrap to get a DocumentManager (the equivalent of EntityManager): use Doctrine\Common\Annotations\AnnotationReader, Doctrine\ODM\MongoDB\DocumentManager, Doctrine\MongoDB\Connection, Doctrine\ODM\MongoDB\Configuration, Doctrine\ODM\MongoDB\Mapping\Driver\AnnotationDriver; private function getADm() { $config = new Configuration(); $config->setProxyDir(__DIR__ . '/mongocache'); $config->setProxyNamespace('MongoProxies'); $config->setDefaultDB('test'); $config->setHydratorDir(__DIR__ . '/mongocache'); $config->setHydratorNamespace('MongoHydrators'); $reader = new AnnotationReader(); $config->setMetadataDriverImpl(new AnnotationDriver($reader, __DIR__ . '/Documents')); return DocumentManager::create(new Connection(), $config); } You will be able to call persist() and flush() on the DocumentManager, along with a set of other methods for querying like find() and getRepository(). Integration with an ORM We are researching a solution for versioning objects mapped with the Doctrine ORM. Doing this with a version column would be invasive, and also strange where multiple objects are involved (do you version just the root of an object graph? Duplicate the other ones when they change? How can you detect that?) The idea is taking a snapshot and putting it in a read only MongoDB instance, where all previous versions can be retrieved later for auditing (business reasons). This has been verified to be technically possible: the DocumentManager and EntityManager are totally separate object graphs, so they won't clash with each other. The only point of conflict is the annotations of model classes, since both use different version of @Id, and can see the other's annotation like @Entity and @Document while parsing. This can be solved by using aliases for all the annotations, using their parent namespace basename as a prefix: model = $model; } public function __toString() { return "Car #$this->document_id: $this->id, $this->model"; } } This make us able to save a copy of an ORM object into Mongo: $car = new Car('Ford'); $this->em->persist($car); $this->em->flush(); $this->dm->persist($car); $this->dm->flush(); var_dump($car->__toString()); $this->assertTrue(strlen($car->__toString()) > 20); The output produces by this test is: .string(38) "Car #4f61a8322f762f1121000000: 3, Ford" When retrieving the object, one of the two ids will be null as it is ignored by the ORM or ODM. I am not using the same field because I want to store multiple copies of a row, so it's id alone won't be unique. If you're interested, checkout my hack on Github. It contains the running example presented in this post. Remember to create the relational schema with: $ php doctrine.php orm:schema-tool:create before running the test with phpunit --bootstrap bootstrap.php DoubleMappingTest.php MongoDB won't need the schema setup, of course. There are still some use cases to test, like the behavior in the presence of proxies, but it seems that non-invasive approach of Data Mappers like Doctrine 2 is paying off: try mapping an object in multiple database with Active Records.

March 20, 2012

by Giorgio Sironi

· 22,481 Views

Adding a .first() method to Django's QuerySet

In my last Django project, we had a set of helper functions that we used a lot. The most used was helpers.first, which takes a query set and returns the first element, or None if the query set was empty. Instead of writing this: try: object = MyModel.objects.get(key=value) except model.DoesNotExist: object = None You can write this: def first(query): try: return query.all()[0] except: return None object = helpers.first(MyModel.objects.filter(key=value)) Note, that this is not identical. The get method will ensure that there is exactly one row in the database that matches the query. The helper.first() method will silently eat all but the first matching row. As long as you're aware of that, you might choose to use the second form in some cases, primarily for style reasons. But the syntax on the helper is a little verbose, plus you're constantly including helpers.py. Here is a version that makes this available as a method on the end of your query set chain. All you have to do is have your models inherit from this AbstractModel. class FirstQuerySet(models.query.QuerySet): def first(self): try: return self[0] except: return None class ManagerWithFirstQuery(models.Manager): def get_query_set(self): return FirstQuerySet(self.model) class AbstractModel(models.Model): objects = ManagerWithFirstQuery() class Meta: abstract = True class MyModel(AbstractModel): ... Now, you can do the following. object = MyModel.objects.filter(key=value).first()

March 19, 2012

by Chase Seibert

· 12,651 Views

Display an OLE Object from a Microsoft Access Database using OLE Stripper

In database programming, it happens a lot that you need to bind a picture box to a field with type of photo or image. For example, if you want to show an Employee’s picture from Northwind.mdb database, you might want to try the following code: picEmployees.DataBindings.Add(“Image”, bsEmployees, “Photo”, true); This code works if the images are stored in the database with no OLE header or the images stored as a raw image file formats. As the pictures stored in the Northwind database in are not stored in raw image file formats and they are stored as an OLE image documents, then you have to strip off the OLE header to work with the image properly. Binding imageBinding = new Binding("Image", bsEmployees, "ImageBlob.ImageBlob", true); imageBinding.Format += new ConvertEventHandler(this.PictureFormat); private void PictureFormat(object sender, ConvertEventArgs e) { Byte[] img = (Byte[])e.Value; MemoryStream ms = new MemoryStream(); int offset = 78; ms.Write(img, offset, img.Length - offset); Bitmap bmp = new Bitmap(ms); ms.Close(); // Writes the new value back e.Value = bmp; } Fortunately, there are some overload methods in .NET Framework to take care of this mechanism, but it cannot be guaranteed whether you need to strip off the OLE object by yourself or not. For example, you can use the following technique to access the images of the Northwind.mdb that ships with Microsoft Access and they will be rendered properly. picEmployees.DataBindings.Add(“Image”, bsEmployees, “Photo”, true, DataSourceUpdateMode.Never, new Bitmap(typeof(Button), “Button.bmp”)); Unfortunately, there are some scenarios that you need a better solution. For example, the Xtreme.mdb database that ships with Crystal Reports has a photo filed that cannot be handled by the preceding methods. For these complex scenarios, you can download the OLEStripper classes from here and re-write the PictureFormat method as it is shown below: private void PictureFormat(object sender, ConvertEventArgs e) { // photoIndex is same as Employee ID int photoIndex = Convert.ToInt32(e.Value); // Read the original OLE object ReadOLE olePhoto = new ReadOLE(); string PhotoPath = olePhoto.GetOLEPhoto(photoIndex); // Strip the original OLE object StripOLE stripPhoto = new StripOLE(); string StripPhotoPath = stripPhoto.GetStripOLE(PhotoPath); FileStream PhotoStream = new FileStream(StripPhotoPath , FileMode.Open); Image EmployeePhoto = Image.FromStream(PhotoStream); e.Value = EmployeePhoto; PhotoStream.Close(); }

March 15, 2012

by Amir Ahani

· 11,140 Views

Circos: An Amazing Tool for Visualizing Big Data

storing massive amounts of data in a nosql data store is just one side of the big data equation. being able to visualize your data in such a way that you can easily gain deeper insights , is where things really start to get interesting. lately, i've been exploring various options for visualizing (directed) graphs, including circos . circos is an amazing software package that visualizes your data through a circular layout . although it's originally designed for displaying genomic data , it allows to create good-looking figures from data in any field. just transform your data set into a tabular format and you are ready to go. the figure below illustrates the core concept behind circos. the table's columns and rows are represented by segments around the circle. individual cells are shown as ribbons , which connect the corresponding row and column segments. the ribbons themselves are proportional in width to the value in the cell. when visualizing a directed graph , nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. the proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. in my case, i want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, ...) and how do they navigate between pages. the rest of this article details how to 1) retrieve the raw visit information through the google analytics api, 2) persist this information as a graph in neo4j and 3) query and preprocess this data for visualization through circos. as always, the complete source code can be found on the datablend public github repository . 1. retrieving your google analytics data let's start by retrieving the raw google analytics data . the google analytics data api provides access to all dimensions and metrics that can be queried through the web application. in my case, i'm interested in retrieving the previous page path property for each page view. if a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance) . otherwise, it contains the internal path . we will use google's java data api to connect and retrieve this information. we are particularly interested in the pagepath , pagetitle , previouspagepath and medium dimensions, while our metric of choice is the number of pageviews . after setting the date range, the feed of entries that satisfy this criteria can be retrieved. for ease of use, we transform this data to a domain entity and filter/clean the data accordingly. if a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, ...) as previous path. // authenticate analyticsservice = new analyticsservice(configuration.service); analyticsservice.setusercredentials(configuration.client_username, configuration.client_pass); // create query dataquery query = new dataquery(new url(configuration.data_url)); query.setids(configuration.table_id); query.setdimensions("ga:medium,ga:previouspagepath,ga:pagepath,ga:pagetitle"); query.setmetrics("ga:pageviews"); query.setstartdate(datestring); query.setenddate(datestring); // execute datafeed feed = analyticsservice.getfeed(createqueryurl(date), datafeed.class); // iterate and clean for (dataentry entry : feed.getentries()) { string pagepath = entry.stringvalueof("ga:pagepath"); string pagetitle = entry.stringvalueof("ga:pagetitle"); string previouspagepath = entry.stringvalueof("ga:previouspagepath"); string medium = entry.stringvalueof("ga:medium"); long views = entry.longvalueof("ga:pageviews"); // filter the data if (filter(pagepath) && filter(previouspagepath) && (!clean(previouspagepath).equals(clean(pagepath)))) { // check criteria are satisfied navigation navigation = new navigation(clean(previouspagepath), clean(pagepath), pagetitle, date, views); if (navigation.getsource().equals("(entrance)")) { // in case of an entrace, save its medium instead navigation.setsource(medium); } navigations.add(navigation); } } 2. storing navigational data as a directed graph in neo4j the set of site navigations can easily be stored as a directed graph in the neo4j graph database . nodes are site paths (or mediums), while relationships are the navigations themselves. we start by retrieving the navigations for a particular date range and retrieve (or lazily create) the nodes representing the source and target paths (or mediums). next we de-normalize the pageviews metric (for instance, 6 individual relationships will be created for 6 page-views). although this de-normalization step is not really required, i did so to make sure that the degree of my nodes is correct if i would perform other types of calculations. for each individual navigation relationship, we also store the date of visit . // retrieve navigations for a particular date list navigations = retrieval.getnavigations(date); // save them in the graph database transaction tx = graphdb.begintx(); // iterate and create for (navigation nav : navigations) { node source = getpath(nav.getsource()); node target = getpath(nav.gettarget()); if (!target.hasproperty("title")) { target.setproperty("title", nav.gettargettitle()); } for (long i = 0; i < nav.getamount(); i++) { // duplicate relationships relationship transition = source.createrelationshipto(target, relationships.navigation); transition.setproperty("date", date.gettime()); // save time as long } } // commit tx.success(); tx.finish(); 3. creating the circos tabular data format the circos tabular data format is quite easy to construct. it's basically a tab-delimited file with row and column headers. a cell is interpreted as a value that flows from the row entity to the column entity . we will use the neo4j cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period . doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time. // access the graph database graphdb = new embeddedgraphdatabase("var/analytics"); engine = new executionengine(graphdb); // execute the data range cypher query map params = new hashmap(); params.put("fromdate", from.gettime()); params.put("todate", to.gettime()); // execute the query executionresult result = engine.execute("start sourcepath=node:index(\"path:*\") " + "match sourcepath-[r]->targetpath " + "where r.date >= {fromdate} and r.date <= {todate} " + "return sourcepath,targetpath", params); next, we create the tab delimited file itself. we iterate through all entries (i.e. navigations) that match our cypher query and store them in a temporary list. afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. at the end, we filter this occurrence matrix on the minimal number of required navigations. this ensures that we will only create segments for paths that are relevant in the total population. as a final step, we print the occurrences matrix as a tab-delimited file. for each path, we will use a shorthand as the circos renderer seems to have problem with long string identifiers. // retrieve the results iterator> it = result.javaiterator(); list navigations = new arraylist(); map titles = new hashmap(); set paths = new hashset(); // iterate the results while (it.hasnext()) { map record = it.next(); string source = (string)((node) record.get("sourcepath")).getproperty("path"); string target = (string) ((node) record.get("targetpath")).getproperty("path"); string targettitle = (string) ((node) record.get("targetpath")).getproperty("title"); // reuse the navigation object as temorary holder navigations.add(new navigation(source, target, targettitle, new date(), 1)); paths.add(source); paths.add(target); if (!titles.containskey(target)) { titles.put(target, targettitle); } } // retrieve the various paths list pathids = arrays.aslist(paths.toarray(new string[]{})); // create the matrix that holds the info int[][] occurences = new int[pathids.size()][pathids.size()]; // iterate through all the navigations and update accordingly for (navigation navigation : navigations) { int sourceindex = pathids.indexof(navigation.getsource()); int targetindex = pathids.indexof(navigation.gettarget()); occurences[sourceindex][targetindex] = occurences[sourceindex][targetindex] + 1; } // matrix build, filter on threshold for (int i = 0; i < occurences.length; i++) { for (int j = 0; j < occurences.length; j++) { if (occurences[i][j] < threshold) { occurences[i][j] = 0; } } // print printcircosdata(pathids, titles, occurences); the text below is a sample of the output generated by the printcircosdata method. it first prints the legend (matching shorthands with actual paths). next it prints the tab-delimited circos table. link0 - /?p=411/wp-admin - storing and querying rdf data in neo4j through sail - datablend link1 - /?p=1146 - visualizing rdf schema inferencing through neo4j, tinkerpop, sail and gephi - datablend link2 - /?p=164 - big data / concise articles - datablend link3 - referral - null link4 - /?p=1400 - the joy of algorithms and nosql revisited: the mongodb aggregation framework - datablend ... datal0l1l2l3l4... l000000 l100000 l200000 l3059400197 l400000 4. use the circos power although circos can be installed on your local computer, we will use its online version to create the visualization of our data. upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site's navigation information. with just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. the outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. in case of referrals, no navigations have this path as target (indicated by the empty middle ring). its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. the l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). this segment visualizes the navigation data related to my "the joy of algorithms and nosql: a mongodb example" -article. most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. the l15-segment (my blog's main page) is the only path that receives an almost equal amount of incoming and outgoing traffic. with just a few tweaks to the circos input data, we can easily focus on particular types of navigation data. in the figure below, i made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors. 5. conclusions in the era of big data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. circos specializes in a very specific type of visualization, but does its job extremely well. i would be delighted to hear about other types of visualizations for directed graphs.

March 13, 2012

by Davy Suvee

· 36,471 Views · 2 Likes

Joins with MapReduce

i have been reading up on join implementations available for hadoop for past few days. in this post i recap some techniques i learnt during the process. the joins can be done at both map side and join side according to the nature of data sets of to be joined. reduce side join let’s take the following tables containing employee and department data. let’s see how join query below can be achieved using reduce side join. select employees.name, employees.age, department.name from employees inner join department on employees.dept_id=department.dept_id map side is responsible for emitting the join predicate values along with the corresponding record from each table so that records having same department id in both tables will end up at on same reducer which would then do the joining of records having same department id. however it is also required to tag the each record to indicate from which table the record originated so that joining happens between records of two tables. following diagram illustrates the reduce side join process. here is the pseudo code for map function for this scenario. map (k table, v rec) { dept_id = rec.dept_id tagged_rec.tag = table tagged_rec.rec = rec emit(dept_id, tagged_rec) } at reduce side join happens within records having different tags. reduce (k dept_id, list tagged_recs) { for (tagged_rec : tagged_recs) { for (tagged_rec1 : taagged_recs) { if (tagged_rec.tag != tagged_rec1.tag) { joined_rec = join(tagged_rec, tagged_rec1) } emit (tagged_rec.rec.dept_id, joined_rec) } } map side join (replicated join) using distributed cache on smaller table for this implementation to work one relation has to fit in to memory. the smaller table is replicated to each node and loaded to the memory. the join happens at map side without reducer involvement which significantly speeds up the process since this avoids shuffling all data across the network even-though most of the records not matching are later dropped. smaller table can be populated to a hash-table so look-up by dept_id can be done. the pseudo code is outlined below. map (k table, v rec) { list recs = lookup(rec.dept_id) // get smaller table records having this dept_id for (small_table_rec : recs) { joined_rec = join (small_table_rec, rec) } emit (rec.dept_id, joined_rec) } using distributed cache on filtered table if the smaller table doesn’t fit the memory it may be possible to prune the contents of it if filtering expression has been specified in the query. consider following query. select employees.name, employees.age, department.name from employees inner join department on employees.dept_id=department.dept_id where department.name="eng" here a smaller data set can be derived from department table by filtering out records having department names other than “eng”. now it may be possible to do replicated map side join with this smaller data set. replicated semi-join reduce side join with map side filtering even of the filtered data of small table doesn’t fit in to the memory it may be possible to include just the dept_id s of filtered records in the replicated data set. then at map side this cache can be used to filter out records which would be sent over to reduce side thus reducing the amount of data moved between the mappers and reducers. the map side logic would look as follows. map (k table, v rec) { // check if this record needs to be sent to reducer boolean sendtoreducer = check_cache(rec.dept_id) if (sendtoreducer) { dept_id = rec.dept_id tagged_rec.tag = table tagged_rec.rec = rec emit(dept_id, tagged_rec) } } reducer side logic would be same as the reduce side join case. using a bloom filter a bloom filter is a construct which can be used to test the containment of a given element in a set. a smaller representation of filtered dept_ids can be derived if dept_id values can be augmented in to a bloom filter. then this bloom filter can be replicated to each node. at the map side for each record fetched from the smaller table the bloom filter can be used to check whether the dept_id in the record is present in the bloom filter and only if so to emit that particular record to reduce side. since a bloom filter is guaranteed not to provide false negatives the result would be accurate. references [1] hadoop in action [2] hadoop : the definitive guide

March 12, 2012

by Buddhika Chamith

· 31,087 Views

Using Maven's -U Command Line Option

My prefered solution was to use the Maven ‘update snapshots’ command line argument.

March 11, 2012

by Roger Hughes

· 107,058 Views · 1 Like

Resetting the Database Connection in Django

Django handles database connections transparently in almost all cases. It will start a new connection when your request starts up, and commit it at the end of the request lifetime. Other times you need to dive in further and do your own granular transaction management. But for the most part, it's fully automatic. However, sometimes your use case may require that you close the current database connection and open a new one. While this is possible in Django, it's not well documented. Why would you want to do this? I my case, I was writing an automation test framework. Some of the automation tests make database calls through the Django ORM to setup records, clean up after the test, etc. Each test is executed in the same process space, via a thread pool. We found that if one of the early tests threw an unrecoverable database error, such as an IntegrityError due to violating a unique constraint, the database connection would be aborted. Subsequent tests that tried to use the database would raise a DatabaseError: Traceback (most recent call last): File /home/user/project/app/test.py, line 73, in tearDown MyModel.objects.all() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 444, in delete collector.collect(del_query) File /usr/local/lib/python2.6/dist-packages/django/db/models/deletion.py, line 146, in collect reverse_dependency=reverse_dependency) File /usr/local/lib/python2.6/dist-packages/django/db/models/deletion.py, line 91, in add if not objs: File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 113, in __nonzero__ iter(self).next() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 107, in _result_iter self._fill_cache() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 772, in _fill_cache self._result_cache.append(self._iter.next()) File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 273, in iterator for row in compiler.results_iter(): File /usr/local/lib/python2.6/dist-packages/django/db/models/sql/compiler.py, line 680, in results_iter for rows in self.execute_sql(MULTI): File /usr/local/lib/python2.6/dist-packages/django/db/models/sql/compiler.py, line 735, in execute_sql cursor.execute(sql, params) File /usr/local/lib/python2.6/dist-packages/django/db/backends/postgresql_psycopg2/base.py, line 44, in execute return self.cursor.execute(query, args) DatabaseError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. It turns out that it's relatively easy to reset the database connection. We just called the following function at the start of every test. Django is smart enough to re-initialize the connection the next time it's used, assuming that it's disconnected properly. def reset_database_connection(): from django import db db.close_connection()

March 9, 2012

by Chase Seibert

· 9,215 Views

Connecting to Multiple Databases Using Hibernate

In a recent project, I had a requirement of connecting to multiple databases using hibernate. As tapestry-hibernate module does not provide an out-of-box support, I thought of adding one. https://github.com/tawus/tapestry5 Now that the application is in production, I thought of writing a simple “How to”. I have cloned the latest stable(5.3.2) tapestry project at https://github.com/tawus/tapestry5 and have added multiple database support to it. Single Database It is almost fully compatible with the previous integration when using a single database except for a few things 1) HibernateConfigurer has changed public interface HibernateConfigurer { /** * Passed the configuration so as to make changes. */ void configure(Configuration configuration); /** * Factory Id for which this configurer is meant for */ Class getMarker(); /** * Entity package names * * @return */ String[] getPackageNames(); } 2) There is no HibernateEntityPackageManager, as the packages can be contributed by adding more HibernateConfigurers with the same Markers. Multiple databases For multiple database, a marker has to be used for accessing Session or HibernateSessionManager @Inject @XDB private Session session; @Inject @YDB private HibernateSessionManager sessionManager; @XDB @CommitAfter void myMethod(){ } Also you have to define a HibernateSessionManager and a Session for the secondary database in the Module class. @Scope(ScopeConstants.PERTHREAD) @Marker(DatabaseTwo.class) public static HibernateSessionManager buildHibernateSessionManagerForFinacle( HibernateSessionSource sessionSource, PerthreadManager perthreadManager) { HibernateSessionManagerImpl service = new HibernateSessionManagerImpl(sessionSource, DatabaseTwo.class); perthreadManager.addThreadCleanupListener(service); return service; } @Marker(DatabaseTwo.class) public static Session buildSessionForFinacle( @Local HibernateSessionManager sessionManager, PropertyShadowBuilder propertyShadowBuilder) { return propertyShadowBuilder.build(sessionManager, "session", Session.class); } Notice an annotation @DatabaseTwo.class. This is a Factory marker and is used to identify a service related to a particular SessionFactory. @Retention(RetentionPolicy.RUNTIME) @Target( {ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD}) @FactoryMarker @Documented public @interface DatabaseTwo { } A typical AppModule for two databases will be public class AppModule { public static void bind(ServiceBinder binder) { binder.bind(DemoService.class, DemoServiceImpl.class); } @Contribute(HibernateSessionSource.class) public static void configureHibernateSources(OrderedConfiguration configurers) { configurers.add("databaseOne", new HibernateConfigurer() { public void configure(org.hibernate.cfg.Configuration configuration) { configuration.configure("/databaseOne.xml"); } public Class getMarker() { return DefaultFactory.class; } public String[] getPackageNames() { return new String[] {"org.example.demo.one"}; } }); configurers.add("databaseTwo", new HibernateConfigurer() { public void configure(org.hibernate.cfg.Configuration configuration) { configuration.configure("/databaseTwo.xml"); } public Class getMarker() { return DatabaseTwo.class; } public String[] getPackageNames() { return new String[] {"org.example.demo.two"}; } }); } @Contribute(SymbolProvider.class) @ApplicationDefaults public static void addSymbols(MappedConfiguration configuration) { configuration.add(HibernateSymbols.DEFAULT_CONFIGURATION, "false"); configuration.add("tapestry.app-package", "org.example.demo"); } @Scope(ScopeConstants.PERTHREAD) @Marker(DatabaseTwo.class) public static HibernateSessionManager buildHibernateSessionManagerForFinacle( HibernateSessionSource sessionSource, PerthreadManager perthreadManager) { HibernateSessionManagerImpl service = new HibernateSessionManagerImpl(sessionSource, DatabaseTwo.class); perthreadManager.addThreadCleanupListener(service); return service; } @Marker(DatabaseTwo.class) public static Session buildSessionForFinacle( @Local HibernateSessionManager sessionManager, PropertyShadowBuilder propertyShadowBuilder) { return propertyShadowBuilder.build(sessionManager, "session", Session.class); } } Injecting into Services You can inject a session in a service using the marker. As DatabaseOne is being used as the default configuration, in order to inject its Session, you have to annotate it with @DefaultFactory. For DatabaseTwo, you can use @DatabaseTwo annotation. public class DemoServiceImpl implements DemoService { private Session sessionOne; private Session sessionTwo; public DemoServiceImpl( @DefaultFactory Session sessionOne, @DatabaseTwo Session sessionTwo) { this.sessionOne = sessionOne; this.sessionTwo = sessionTwo; } @SuppressWarnings("unchecked") public List listOnes() { return sessionOne.createCriteria(EntityOne.class).list(); } @SuppressWarnings("unchecked") public List listTwos() { return sessionTwo.createCriteria(EntityTwo.class).list(); } public void save(EntityOne entityOne) { sessionOne.saveOrUpdate(entityOne); } public void save(EntityTwo entityTwo) { sessionTwo.saveOrUpdate(entityTwo); } } Using @CommitAfter You can add an advice the same way you used to. The only change is in @CommitAfter. You have to additionally annotate the method with the respective marker. public interface DemoService { List listOnes(); List listTwos(); @CommitAfter @DefaultFactory void save(EntityOne entityOne); @CommitAfter @DatabaseTwo void save(EntityTwo entityTwo); } Here is an example. From http://tawus.wordpress.com/2012/03/03/tapestry-hibernate-multiple-databases/

March 7, 2012

by Taha Siddiqi

· 100,317 Views · 3 Likes

Blobs and More: Storing Images and Files in IndexedDB

The desired future approach for storing things client-side in web browsers is utilizing IndexedDB. Here I’ll walk you through how to store images and files in IndexedDB and then present them through an ObjectURL. The general approach First, let’s talk about the steps we will go through to create an IndexedDB data base, save the file into it and then read it out and present in the page: Create or open a database. Create an objectStore (if it doesn’t already exist) Retrieve an image file as a blob Initiate a database transaction Save that blob into the database Read out that saved file and create an ObjectURL from it and set it as the src of an image element in the page Creating the code Let’s break down all parts of the code that we need to do this: Create or open a database. // IndexedDB var indexedDB = window.indexedDB || window.webkitIndexedDB || window.mozIndexedDB || window.OIndexedDB || window.msIndexedDB, IDBTransaction = window.IDBTransaction || window.webkitIDBTransaction || window.OIDBTransaction || window.msIDBTransaction, dbVersion = 1; // Create/open database var request = indexedDB.open("elephantFiles", dbVersion); request.onsuccess = function (event) { console.log("Success creating/accessing IndexedDB database"); db = request.result; db.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; // Interim solution for Google Chrome to create an objectStore. Will be deprecated if (db.setVersion) { if (db.version != dbVersion) { var setVersion = db.setVersion(dbVersion); setVersion.onsuccess = function () { createObjectStore(db); getImageFile(); }; } else { getImageFile(); } } else { getImageFile(); } } // For future use. Currently only in latest Firefox versions request.onupgradeneeded = function (event) { createObjectStore(event.target.result); }; The intended way to use this is to have the onupgradeneeded event triggered when a database is created or gets a higher version number. This is currently only supported in Firefox, but will soon be in other web browsers. If the web browser doesn’t support this event, you can use the deprecated setVersion method and connect to its onsuccess event. Create an objectStore (if it doesn’t already exist) // Create an objectStore console.log("Creating objectStore") dataBase.createObjectStore("elephants"); Here you create an ObjectStore that you will store your data – or in our case, files – and once created you don’t need to recreate it, just update its contents. Retrieve an image file as a blob // Create XHR var xhr = new XMLHttpRequest(), blob; xhr.open("GET", "elephant.png", true); // Set the responseType to blob xhr.responseType = "blob"; xhr.addEventListener("load", function () { if (xhr.status === 200) { console.log("Image retrieved"); // File as response blob = xhr.response; // Put the received blob into IndexedDB putElephantInDb(blob); } }, false); // Send XHR xhr.send(); This code gets the contents of a file as a blob directly. Currently that’s only supported in Firefox. Once you have received the entire file, you send the blob to the function to store it in the database. Initiate a database transaction // Open a transaction to the database var transaction = db.transaction(["elephants"], IDBTransaction.READ_WRITE); To start writing something to the database, you need to initiate a transaction with an objectStore name and the type of action you want to do – in this case read and write. Save that blob into the database // Put the blob into the dabase transaction.objectStore("elephants").put(blob, "image"); Once the transaction is in place, you get a reference to the desired objectStore and then put your blob into it and give it a key. Read out that saved file and create an ObjectURL from it and set it as the src of an image element in the page // Retrieve the file that was just stored transaction.objectStore("elephants").get("image").onsuccess = function (event) { var imgFile = event.target.result; console.log("Got elephant!" + imgFile); // Get window.URL object var URL = window.URL || window.webkitURL; // Create and revoke ObjectURL var imgURL = URL.createObjectURL(imgFile); // Set img src to ObjectURL var imgElephant = document.getElementById("elephant"); imgElephant.setAttribute("src", imgURL); // Revoking ObjectURL URL.revokeObjectURL(imgURL); }; Use the same transaction to get the image file you just stored, and then create an objectURL and set it to the src of an image in the page. This could just as well, for instance, have been a JavaScript file that you attached to a script element, and then it would parse the JavaScript. The complete code So, here’s the complete working code: (function () { // IndexedDB var indexedDB = window.indexedDB || window.webkitIndexedDB || window.mozIndexedDB || window.OIndexedDB || window.msIndexedDB, IDBTransaction = window.IDBTransaction || window.webkitIDBTransaction || window.OIDBTransaction || window.msIDBTransaction, dbVersion = 1.0; // Create/open database var request = indexedDB.open("elephantFiles", dbVersion), db, createObjectStore = function (dataBase) { // Create an objectStore console.log("Creating objectStore") dataBase.createObjectStore("elephants"); }, getImageFile = function () { // Create XHR var xhr = new XMLHttpRequest(), blob; xhr.open("GET", "elephant.png", true); // Set the responseType to blob xhr.responseType = "blob"; xhr.addEventListener("load", function () { if (xhr.status === 200) { console.log("Image retrieved"); // Blob as response blob = xhr.response; console.log("Blob:" + blob); // Put the received blob into IndexedDB putElephantInDb(blob); } }, false); // Send XHR xhr.send(); }, putElephantInDb = function (blob) { console.log("Putting elephants in IndexedDB"); // Open a transaction to the database var transaction = db.transaction(["elephants"], IDBTransaction.READ_WRITE); // Put the blob into the dabase var put = transaction.objectStore("elephants").put(blob, "image"); // Retrieve the file that was just stored transaction.objectStore("elephants").get("image").onsuccess = function (event) { var imgFile = event.target.result; console.log("Got elephant!" + imgFile); // Get window.URL object var URL = window.URL || window.webkitURL; // Create and revoke ObjectURL var imgURL = URL.createObjectURL(imgFile); // Set img src to ObjectURL var imgElephant = document.getElementById("elephant"); imgElephant.setAttribute("src", imgURL); // Revoking ObjectURL URL.revokeObjectURL(imgURL); }; }; request.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; request.onsuccess = function (event) { console.log("Success creating/accessing IndexedDB database"); db = request.result; db.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; // Interim solution for Google Chrome to create an objectStore. Will be deprecated if (db.setVersion) { if (db.version != dbVersion) { var setVersion = db.setVersion(dbVersion); setVersion.onsuccess = function () { createObjectStore(db); getImageFile(); }; } else { getImageFile(); } } else { getImageFile(); } } // For future use. Currently only in latest Firefox versions request.onupgradeneeded = function (event) { createObjectStore(event.target.result); }; })(); Web browser support IndexedDB Supported since long (a number of versions back) in Firefox and Google Chrome. Planned to be in IE10, unclear about Safari and Opera. onupgradeneeded Supported in latest Firefox. Planned to be in Google Chrome soon and hopefully IE10. Unclear about Safari and Opera. Storing files in IndexedDB Supported in Firefox 11 and later. Planned to be supported in Google Chrome. Hopefully IE10 will support it. Unclear about Safari and Opera. XMLHttpRequest Level 2 Supported in Firefox and Google Chrome since long, Safari 5+ and planned to be in IE10 and Opera 12. responseType “blob” Currently only supported in Firefox. Will soon be in Google Chrome and is planned to be in IE10. Unclear about Safari and Opera. Demo and code I’ve put together a demo with IndexedDB and saving images and files in it where you can see it all in action. Make sure to use any Developer Tool to Inspect Element on the image to see the value of its src attribute. Also make sure to check the console.log messages to follow the actions. The code for storing files in IndexedDB is also available on GitHub, so go play now!

March 6, 2012

by Robert Nyman

· 23,860 Views

Solr Date Math, NOW and filter queries

Or “How to never re-use cached filter query results even though you meant to”: Filter queries (“fq” clauses) are a means to restrict the number of documents that are considered for scoring. A common use of “fq” clauses is to restrict the dates of documents returned, things like “in the last day”, “in the last week” etc. You find this pattern often used in conjunction with faceting. Filter queries make use of a filterCache (see solrconfig.xml) to calculate the set of documents satisfying the query once and then re-use that result set. Often, using NOW in filter queries causes this caching to be useless. Here’s why. Solr maintains a filterCache, where it stores the results of “fq” clauses. You can think of it as a map, where the key is the “fq” clause and the value is the set of documents satisfying that clause. I’m going to skip the details of how the document set (the “value” in this map) is stored, since this post is really concentrating on the key. So, let’s say you have two filter queries (whether they’re in the same query or not is irrelevant), something like: “fq=category:books&fq=source:library”. There will be two entries in the filterCache, something like: category:books => 1, 2, 5, 89… source:library => 7, 45, 101… All well and good so far. I’ll add one short diversion here. This bears on why it is often better to have several “fq” clauses than a single one. The same results could be obtained by “fq=category:books AND source:library”, but then the filter cache would look like: category:books AND source:library => 1, 2, 5, 7, 45, 89, 101….. and an fq like “fq=category:books” would NOT re-use the cache since the key is much different. But enough of a diversion… OK, you mentioned dates. Get to the point. It’s common to have date ranges as filter queries, things like “in the last day”, “in the last week”, etc. And there’s the convenient date math to make this easy. So it’s tempting, very tempting to have filter clauses on date ranges like “fq=date:[NOW-1DAY TO NOW]“. Be careful when using NOW! Here’s the problem. In the above example, date:[NOW-1DAY TO NOW] is not what’s used as the key for the fq in the filterCache, the expansion is used as the key. This translates into a form like: “date:[2012-01-20T08:56:23Z TO 2012-01-27T8:56:23Z]” for the key into the filter cache. Now the user adds a term to the “q” and re-submits the query 30 seconds after the first one. The fq clause now looks something like: “fq=date:[2012-01-20T08:56:53Z TO 2012-01-27T8:56:53Z]” note that the seconds went from 23 to 53! The key for this fq does not match the key for the first, even though it’s often the case that the intent is that submitting this kind of fq 30 seconds later would result in the same set of documents matching the filter. Bare NOW entries in filter clauses will pretty much guarantee that the cached result sets will never be reused. Fine. What do you do to make it better? Here’s where rounding makes sense. Using midnight can make sense from two perspectives. The sense you often want is “anything with a timestamp in a particular day” (or month or year or hour or….). So just using NOW for the lower bound would miss anything published between midnight and whenever the user happens to submit the query on the day (in this example) of the lower bound. Re-using the filter cache can substantially speed up your queries, especially if you’re providing links like “in the last day”, “1-7 days ago” etc. So your fq clauses start to look like “fq=date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY]“. The thing to note about the date math “/” operator is that is is a “round down” operator. So let’s break this up a bit: NOW/DAY rounds down to midnight last night. -7DAYS subtracts 7 whole days. So the lower limit is really “midnight 7 days ago”. Similarly, NOW/DAY rounds to midnight last night and +1DAY moves that to midnight tonight for the upper limit. These clauses are invariant until after midnight tonight so these clauses will return the same results all day today, and only the first submission of this fq will incur the cost of figuring out which documents satisfy it, all the queries after the first will just read the cached result set from the filterCache. Of course the caches are invalidated if you update your index and/or a replication happens, but that’s always the case. You will note that there is a bit of “slop” here. If your index has dates in the future, you may get them too. Suppose you have a situation where your index contains documents you don’t want the users to see until it’s later than their timestamp. I actually have a hard time contriving an example here, but let’s just assume it’s the case. Also say it’s noon and your index contains timestamps on documents through midnight tonight. The above technique will show documents that will be officially published at, say, 15:00 even though it’s only 12:00 and you may not want that. In that case, you’ll have to use a bare NOW clause and live with the fact that your cache isn’t being used for these clauses. Like I said, this is contrived, but I mention it for completeness’ sake. A couple of notes about dates: Before I finish, a couple of notes about dates. Use the coarsest dates you can and still satisfy your use case. This is especially true if you’re sorting by dates. The sorting resource requirements go up by the number of unique terms. So storing millisecond resolution when all you care about is day can be wasteful. This is also true when faceting. It’s often useful to index multiple fields with some date data, especially if you intend to facet. The above examples in the 3.x code line have a slight problem when more than one adjoining range is required. The range operator “[]” is inclusive, so if you have a document indexed at exactly midnight in these examples, it might be included in two ranges. Trunk Solr (4.0) allows mixing inclusive “[]” and exclusive “{}” endpoints, so expressions like “date:[NOW/DAY-1DAY TO NOW/DAY+1DAY}” are possible. An exercise for the reader: What are the consequences of using different kinds of rounding? E.g. NOW/5MIN, NOW/72MIN (does this even work?).

February 27, 2012

by Erick Erickson

· 52,669 Views

How to migrate databases between SQL Server and SQL Server Compact

In this post, I will try to give an overview of the free tools available for developers to move databases from SQL Server to SQL Server Compact and vice versa. I will also show how you can do this with the SQL Server Compact Toolbox (add-in and standalone editions). Moving databases from SQL Server Compact to SQL Server This can be useful for situations where you already have developed an application that depends on SQL Server Compact, and would like the increased power of SQL Server or would like to use some feature, that is not available on SQL Server Compact. I have an informal comparison of the two products here. Microsoft offers a GUI based tool and a command line tool to do this: WebMatrix and MsDeploy. You can also use the ExportSqlCe command line tool or the SQL Server Compact Toolbox to do this. To use the ExportSqlCE (or ExportSqlCE40) command line, use a command similar to: ExportSQLCE.exe "Data Source=D:\Northwind.sdf;" Northwind.sql The resulting script file (Northwind.sql) can the be run against a SQL Server database, using for example the SQL Server sqlcmd command line utility: sqlcmd -S mySQLServer –d NorthWindSQL -i C:\Northwind.sql To use the SQL Server Compact Toolbox: Connect the Toolbox to the database file that you want to move to SQL Server: Right click the database connection, and select to script Schema and Data: Optionally, select which tables to script and click OK: Enter the filename for the script, default extension is .sqlce: Click OK to the confirmation message: You can now open the generated script in Management Studio and execute it against a SQL Server database, or run it with sqlcmd as described above. Moving databases from SQL Server to SQL Server Compact Microsoft offers no tools for doing this “downsizing” of a SQL Server database to SQL Server Compact, and of course not all objects in a SQL Server database CAN be downsized, as only tables exists in a SQL Server Compact database, so no stored procedures, views, triggers, users, schema and so on. I have blogged about how this can be done from the command line, and you can also do this with the SQL Server Compact Toolbox (of course): From the root node, select Script Server Data and Schema: Follow a procedure like the one above, but connecting to a SQL Server database instead. The export process will convert the SQL Server data types to a matching SQL Server Compact data type, for example varchar(50) becomes nvarchar(50) and so on. Any unsupported data types will be ignored, this includes for example computed columns and sql_variant. The new date types in SQL Server 2008+, like date, time, datetime2 will be converted to nvarchar based data types, as only datetime is supported in SQL Server Compact. A full list of the SQL Server Compact data types is available here.

February 26, 2012

by Erik Ejlskov Jensen

· 16,944 Views

wxPython: Creating a "Dark Mode"

One day at work, I was told that we had a feature request for one of my programs. They wanted a “dark mode” for when they used my application at night as the normal colors were kind of glaring. My program is used in laptops in police cars, so I could understand their frustration. I spent some time looking into the matter and got a mostly working script put together which I’m going to share with my readers. Of course, if you’re a long time reader, you probably know I’m talking about a wxPython program. I write almost all my GUIs using wxPython. Anyway, let’s get on with the story! Into the Darkness Getting the widgets to change color in wxPython is quite easy. The only two methods you need are SetBackgroundColour and SetForegroundColour. The only major problem I ran into when I was doing this was getting my ListCtrl / ObjectListView widget to change colors appropriately. You need to loop over each ListItem and change their colors individually. I alternate row colors, so that made things more interesting. The other problem I had was restoring the ListCtrl’s background color. Normally you can set a widget’s background color to wx.NullColour (or wx.NullColor) and it will go back to its default color. However, some widgets don’t work that way and you have to actually specify a color. It should also be noted that some widgets don’t seem to pay any attention to SetBackgroundColour at all. One such widget that I’ve found is the wx.ToggleButton. Now you know what I know, so let’s look at the code I came up with to solve my issue: import wx try: from ObjectListView import ObjectListView except: ObjectListView = False #---------------------------------------------------------------------- def getWidgets(parent): """ Return a list of all the child widgets """ items = [parent] for item in parent.GetChildren(): items.append(item) if hasattr(item, "GetChildren"): for child in item.GetChildren(): items.append(child) return items #---------------------------------------------------------------------- def darkRowFormatter(listctrl, dark=False): """ Toggles the rows in a ListCtrl or ObjectListView widget. Based loosely on the following documentation: http://objectlistview.sourceforge.net/python/recipes.html#recipe-formatter and http://objectlistview.sourceforge.net/python/cellEditing.html """ listItems = [listctrl.GetItem(i) for i in range(listctrl.GetItemCount())] for index, item in enumerate(listItems): if dark: if index % 2: item.SetBackgroundColour("Dark Grey") else: item.SetBackgroundColour("Light Grey") else: if index % 2: item.SetBackgroundColour("Light Blue") else: item.SetBackgroundColour("Yellow") listctrl.SetItem(item) #---------------------------------------------------------------------- def darkMode(self, normalPanelColor): """ Toggles dark mode """ widgets = getWidgets(self) panel = widgets[0] if normalPanelColor == panel.GetBackgroundColour(): dark_mode = True else: dark_mode = False for widget in widgets: if dark_mode: if isinstance(widget, ObjectListView) or isinstance(widget, wx.ListCtrl): darkRowFormatter(widget, dark=True) widget.SetBackgroundColour("Dark Grey") widget.SetForegroundColour("White") else: if isinstance(widget, ObjectListView) or isinstance(widget, wx.ListCtrl): darkRowFormatter(widget) widget.SetBackgroundColour("White") widget.SetForegroundColour("Black") continue widget.SetBackgroundColour(wx.NullColor) widget.SetForegroundColour("Black") self.Refresh() return dark_mode This code is a little convoluted, but it gets the job done. Let’s break it down a bit and see how it works. First off, we try to import ObjectListView, a cool 3rd party widget that wraps wx.ListCtrl and makes it a LOT easier to use. However, it’s not part of wxPython right now, so you need to test for it’s existence. I just set it to False if it doesn’t exist. The GetWidgets function takes a parent parameter, which would usually be a wx.Frame or wx.Panel and goes through all of its children to create a list of widgets, which it then returns to the calling function. The main function is darkMode. It takes two parameters too, the poorly named “self”, which refers to a parent widget, and a default panel color. It calls GetWidgets and then uses a conditional statement to decide if dark mode should be enabled or not. Next it loops over the widgets and changes the colors accordingly. When it’s done, it will refresh the passed in parent and return a bool to let you know if dark mode is on or off. There is one more function called darkRowFormatter that is only for setting the colors of the ListItems in a wx.ListCtrl or an ObjectListView widget. Here we use a list comprehension to create a list of wx.ListItems that we then iterate over, changing their colors. To actually apply the color change, we need to call SetItem and pass it a wx.ListItem object instance. Trying Out Dark Mode So now you’re probably wondering how to actually use the script above. Well, this section will show you how it’s done. Here’s a simple program with a list control in it and a toggle button too! import wx import darkMode ######################################################################## class MyPanel(wx.Panel): """""" #---------------------------------------------------------------------- def __init__(self, parent): """Constructor""" wx.Panel.__init__(self, parent) self.defaultColor = self.GetBackgroundColour() rows = [("Ford", "Taurus", "1996", "Blue"), ("Nissan", "370Z", "2010", "Green"), ("Porche", "911", "2009", "Red") ] self.list_ctrl = wx.ListCtrl(self, style=wx.LC_REPORT) self.list_ctrl.InsertColumn(0, "Make") self.list_ctrl.InsertColumn(1, "Model") self.list_ctrl.InsertColumn(2, "Year") self.list_ctrl.InsertColumn(3, "Color") index = 0 for row in rows: self.list_ctrl.InsertStringItem(index, row[0]) self.list_ctrl.SetStringItem(index, 1, row[1]) self.list_ctrl.SetStringItem(index, 2, row[2]) self.list_ctrl.SetStringItem(index, 3, row[3]) if index % 2: self.list_ctrl.SetItemBackgroundColour(index, "white") else: self.list_ctrl.SetItemBackgroundColour(index, "yellow") index += 1 btn = wx.ToggleButton(self, label="Toggle Dark") btn.Bind(wx.EVT_TOGGLEBUTTON, self.onToggleDark) normalBtn = wx.Button(self, label="Test") sizer = wx.BoxSizer(wx.VERTICAL) sizer.Add(self.list_ctrl, 0, wx.ALL|wx.EXPAND, 5) sizer.Add(btn, 0, wx.ALL, 5) sizer.Add(normalBtn, 0, wx.ALL, 5) self.SetSizer(sizer) #---------------------------------------------------------------------- def onToggleDark(self, event): """""" darkMode.darkMode(self, self.defaultColor) ######################################################################## class MyFrame(wx.Frame): """""" #---------------------------------------------------------------------- def __init__(self): """Constructor""" wx.Frame.__init__(self, None, wx.ID_ANY, "MvP ListCtrl Dark Mode Demo") panel = MyPanel(self) self.Show() #---------------------------------------------------------------------- if __name__ == "__main__": app = wx.App(False) frame = MyFrame() app.MainLoop() If you run the program above, you should see something like this: If you click the ToggleButton, you should see something like this: Notice how the toggle button was unaffected by the SetBackgroundColour method. Also notice that the list control’s column headers don’t change colors either. Unfortunately, wxPython doesn’t expose access to the column headers, so there’s no way to manipulate their color. Anyway, let’s take a moment to see how the dark mode code is used. First we need to import it. In this case, the module is called darkMode. To actually call it, we need to look at the ToggleButton’s event handler: darkMode.darkMode(self, self.defaultColor) As you can see, all we did was call darkMode.darkMode with the panel object (the “self) and a defaultColor that we set at the beginning of the wx.Panel’s init method. That’s all we had to do too. We should probably set it up with a variable to catch the returned value, but for this example we don’t really care. Wrapping Up Now we’re done and you too can create a “dark mode” for your applications. At some point, I’d like to generalize this some more to make into a color changer script where I can pass whatever colors I want to it. What would be really cool is to make it into a mixin. But that’s something for the future. For now, enjoy! Further Reading ObjectListView documentation An ObjectListView tutorial wx.ListCtrl documentation Source Code 2011-11-5-wxPython-dark-mode You can also pull the source from Bitbucket Source: http://www.blog.pythonlibrary.org/2011/11/05/wxpython-creating-a-dark-mode/

February 26, 2012

by Mike Driscoll

· 7,710 Views

Our experience with Domain Events

domain-driven design background there are a series of domain model patterns that describe objects and objects group built with domain-driven design. aggregates describe cohesive object graph with a single point of entry, called root: the internal objects of the aggregate cannot be persistently references from the outside. the domain classes whose instances are inside aggregates are subdivided into entities and value objects: the former have a lifecycle (like a post or a user), while the latter are just values with methods, equivalent to strings and other domain concepts. a prerequisite of these patterns is the immutability of value objects , which can then be shared between aggregates, just like string instances can be in many languages. value objects such as numbers and colors are modified by calling a method on them that return a new instance: every change to their state should produce a new value object. repositories are collection of aggregates: they model operations such as finding an aggregate or persisting a new one. a great departure of modern ddd from the entity/relationship modelling everyone knows is the duplication of data between aggregates to support new scenarios: it's possible some field or object is repeated in different aggregates. when there is an update to an aggregate, it's not necessarily atomically reflected to the other copies of its data. i'll refer to writing calls for generality, to indicate the command side of the command query sepration, which corresponds to everything that causes a change in state in the domain objects (in opposition to the reading side). events as mail messages thus it has become common to copy data between objects in different aggregates : for example, think of a document and invoice object that share the same start/finish date interval. traditionally this duplication is dealt with by extracting a common object, mapped to a common row in the database, with a name invented on the spot. domain events are an alternative that allows for duplicating these data: they reflect changes happened in a single aggregate, and are sent to other aggregates so that they can update themselves. technically speaking, domain events are plain old $yourlanguage objects, containing the modified data but not related to the orm like the main domain objects. domain events are handy for modelling "when" rules that should always be respected no matter who is writing to an aggregate; moreover, their handling can take place in the same transaction or even in a new one. my skeptic view of events was that it can be unclear which events are communicated between objects. after a while, i accepted that unit tests tell us that; moreover, communicating with events is a further level of abstraction which is unnecessary in simple domains but just a giant observer pattern in others. the underlying idea is that no matter who applies a command or modifies a domain object, we already configured the event handling mechanism so that consistency across aggregates is reached according to our policy defined in the event handlers (which may be immediate consistency, or eventual one. or it may result in sending a mail to a human asking him to review the changes: whatever you want.) the only alternative to propagate changes between aggregates would be to have many collaborators passed to the various repositories, but this solution couples the aggregates with each other in many way, while with events you're forced to define one-way messages. the event generator does not make any assumption about who will listen to the event and if it will be listened to at all: events are a point of decoupling like interface are for object collaboration. and it's not that we call static methods by passing a string. we have a clear contract, a domainevents static class, and we publish interesting events (like createdcar or updatedvoyageplan) as plain old domain object which contain all the information about the update, often even composing the relevant domain object. udi dahan discourages the reference to domain objects, and consider events just special value objects ; indeed as our solution matures we are moving towards simpler objects. this choice may force us to consider just what needs to be inserted in the message instead of a full reference (where and if serialization is used to transmit the event, it's simpler to use a value object in fact). moreover, it avoids possible further accidental writing calls to the domain object originating the event. in the application layer events are published by calling a static domain class: as a result event launchers cannot be decoupled from the event (as in udi dahan's approach). we launch events from the repository after an update has been performed, either by choosing an event class directly (in case of an update or creation) or by collecting the events from a queue on the relevant domain object, usually the root of the aggregate. this was a nice idea from a colleague of mine that let us decouple at least entities from the domainevents static class. for now we do not have the requirement to decouple the handling of events from the transaction , so the application layer (which is over the domain layer) open and commits/aborts a transaction, while reconstituting an aggregate, doing some "writing" work (updating it or executing a command) and saving it. the save triggers the event launch, which may trigger work on other aggregates through the configured handler: in case of an error the whole transaction is aborted, ensuring immediate consistency. so we aren't getting the scaling advantages of deferred handling (we're not interested in that for now), but the simplicity of communicating with events while writing code. dynamic language this a php-specific section: however, domain events are an approach typical of java or .net enterprise applications. we use php classes (or interfaces) for routing the events with instanceof; php is a shared nothing environment, so event configuration is done now on a per-action basis to avoid having to create all the objects handling events on each request. however, we want to move the configuration to the application level , with some lazy-loading: for example, configuring lazy event handlers as methods on factory objects that create the real handler and return it along with the name of the method to call. all communication between aggregates happen in a single process and a single address space (for now), so we don't use a bit of the decoupling properties of events. we map value objects into the relational database either as on the parent entity's table (decomposing their fields onto the entity) or as row of their own table. in any case, we have to ensure immutability via encapsulation and only assignign to $this->anyfield into the constructor. our standard pattern is to define setters as new self($this->field1, ..., $nwfieldvalue, ..., $this->fieldn); where n is a small number of fields. we map all domain object with the hibernate-equivalent doctrine 2. we are investigating how to deal with orphaned value objects, which are not reached anymore by any other entity.

February 23, 2012

by Giorgio Sironi

· 26,865 Views

A Performance Comparison of LevelDB and MySQL

In January, Google released LevelDB, "a fast and lightweight key/value database library." In a recent post on the "High Availability MySQL" blog has generated a discussion around the possibility of LevelDB being a storage engine for MySQL due to its performance benefits. The discussion generated some insight LevelDB's comparative performance to MySQL. The LevelDB site provides some insight into these performance benefits. When creating a brand new database, various methods shows a range of speeds from .4 MB/s to 62.7 MB/s in Write performance. In Read performance, LevelDB ranged from 152 MB/s to 232 MB/s. You can see a more detailed explanation of these benchmarks by checking out the LewisDB site here. The "High Availability MySQL" blog also suggests that LevelDB may be a "great fit" for MongoDB because it does not require multi-statement transactions. Commenters pointed out a few more details about LevelDB that may limit its performance: Unfortunately, there is a trade off between number of SST files and query latency variation: the larger single storage file is - the more time will require to compact it -- Vladmir Rodionov A recent GitHub post also compared MySQL and LevelDB. For sequential insert performance, LevelDB was found to get higher throughput/lower latency overall, although MySQL was more stable. For both average latency and update performance, MySQL and LevelDB performed essentially the same. Have you had a chance to use LevelDB? How does it compare to other libraries? Please post your comments below.

February 14, 2012

by Eric Genesky

· 15,504 Views · 2 Likes

Using Self Referencing Tables With Entity Framework

Since EF was released I have been a fan. However, every once in a while I’ll run into a table design situation that I am not sure how to handle with EF. This week, I needed to setup a self-referencing table in order to store some hierarchical data. A self referencing table is a table where the primary key on the table is also defined as a foreign key. Sounds a little confusing right? Let’s clarify the solution with an example. Let’s say I am building an application where I have a list of categories and subcategories. One of my top level categories is “Programming Languages” and under programming languages I have to subcategories which are “C#” and “Java”. In order to store this data I can use a single table with the following structure: The actual data would look like this: Just to clarify, a top level category will have a null value for the ParentId field. For all child categories the ParentId field is used as to represent its parent’s primary key value. As a programmer you may want to think about the ParentId field as a pointer. To complete the example lets take a look at the SQL used to create the table. CREATE TABLE [dbo].[Categories] ( [CategoryId] [int] IDENTITY(1,1) NOT NULL, [Name] [nvarchar](255) NOT NULL, [ParentId] [int] NULL, PRIMARY KEY CLUSTERED ( [CategoryId] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] GO ALTER TABLE [dbo].[Categories] WITH CHECK ADD CONSTRAINT [Category_Parent] FOREIGN KEY([ParentId]) REFERENCES [dbo].[Categories] ([CategoryId]) GO ALTER TABLE [dbo].[Categories] CHECK CONSTRAINT [Category_Parent] GO Upon examining the SQL, you should have noticed that the CategoryId is the primary key on the table and the ParentId field is a foreign key which points back to the CategoryId field. Since we have a key referencing a another key on the same table we can classify this this as a self-referencing table. Now that we fully understand what a self-referencing table is, we can move forward to the Entity Framework code. To get started we first need to create a simple C# object to represent the Category table. Of course, keep in mind that if you are using EF Code first you do not need to create the table or database ahead of time. I only showed the table first because I wanted to better illustrate what a self referencing table is. public class Category { public int CategoryId { get; set; } public string Name { get; set; } public int? ParentId { get; set; } } So far the Category class is very simple. However, we really want to add a few more properties in order to make this class useful. For example, if you are a child category you really want to be able to use dot notation to get the name of the parent category (e.g. subCategory.Parent.Name). Using EF, we will create a virtual property named Parent. By making the property virtual we are letting EF know that when this property is accessed we want to load some data. Based on your configuration settings and the code you use to retrieve your data (whether or not you used DbSet.Include), EF will lazy load or eager load this data. public class Category { public int CategoryId { get; set; } public string Name { get; set; } public int? ParentId { get; set; } public virtual Category Parent { get; set; } } Finally, we also want a property called Children so we can use dot notation to enumerate over the child categories. Once again, here is the modified class: public class Category { public int CategoryId { get; set; } public string Name { get; set; } public int? ParentId { get; set; } public virtual Category Parent { get; set; } public virtual ICollection Children { get; set; } } The final step is to let EF know how these properties are related to one another. This can be done using EF's fluent API. If you are new to EF and are unaware of the fluent API then you may want to read this article first. public class CommodityCategoryMap : EntityTypeConfiguration { public CommodityCategoryMap() { HasKey(x => x.CategoryId); Property(x => x.CategoryId) .HasDatabaseGeneratedOption(DatabaseGeneratedOption.Identity); Property(x => x.Name) .IsRequired() .HasMaxLength(255); HasOptional(x => x.Parent) .WithMany(x => x.Children) .HasForeignKey(x => x.ParentId) .WillCascadeOnDelete(false); } } Hopefully you paid careful attention to the last section of code where we state the a Category has an optional Parent property. In database speak, this simply means that the ParentID field is nullable. The code also states that if a Category object can have zero or many children. In order to specify that a record is a child, we leverage the ParentId field to hold the primary key value of the parent record. As I mentioned earlier, if you are a programmer its easier to think of the ParentId field as a pointer. Finally, I disabled the cascade on delete option. This step is optional and probably based on your own personal preferences. If you enable cascade on delete and you delete a category that has 100 children then you will effectively remove 101 records. For whatever reason this scares me a little bit. Perhaps, my short career as a DBA caused me to not trust people with large volume delete statements. However, you may decide differently depending on your circumstances. Hopefully, this short EF tutorial will help you if you are working through a scenario where you need to capture and manipulate hierarchical data. If you have any questions please leave a comment.

February 13, 2012

by Michael Ceranski

· 71,978 Views · 2 Likes

StAXON - JSON via StAX

XML is for dinosaurs, right? Everybody uses JSON these days. So you do, don’t you? But what about things like XSD, XSLT, JAXB, XPath, etc – is it all evil? In this article, I’d like to introduce the StAXON project (APL2) which tries to give you the best from both worlds: JSON outside, but XML inside. One benefit from this is that you can integrate JSON with powerful XML-related technologies for free. StAXON lets you read and write JSON using the Java Streaming API for XML (javax.xml.stream), also known as StAX. More specifically, StAXON provides implementations of the StAX Cursor API (XMLStreamReader and XMLStreamWriter) StAX Event API (XMLEventReader and XMLEventWriter) StAX Factory API (XMLInputFactory and XMLOutputFactory) for JSON. You may know the Jettison project, which also has XMLStreamReader and XMLStreamWriter implementations. However, StAXON aims to provide a more comprehensive and consistent solution and tries to avoid some of the issues users are having with Jettison. Anyway, let’s get started and see what this “anti-aging substance” for XML can do. Setup Add the following dependency to your Maven POM file: de.odysseus.staxon staxon 1.0 or get the latest StAXON JAR from the Downloads page and add it to your classpath. Mapping Convention The purpose of StAXON’s mapping convention is to generate a more compact JSON. It borrows the "$" syntax for text elements from the Badgerfish convention but attempts to avoid needless text-only JSON objects: Element names become object properties: <–> {"alice":null} Attributes go in properties whose name begin with "@": <–> {"alice":{"@charlie":"david"} Text-only elements go to a simple key/value property: bob <–> {"alice":"bob"} Otherwise, text content is mapped to the "$" property: bob <–> {"alice":{"@charlie":"david","$":"bob"} Nested elements go to nested properties: charlie <–> {"alice":{"bob":"charlie"} A default namespace declaration goes in the element’s "@xmlns" property: <–> {"alice":{"@xmlns":"http://foo.com"} A prefixed namespace declaration goes in the element’s "@xmlns:" property: John Doe555-1111 However, with our JSON-based writer, the output is {"customer":{"name":"John Doe","phone":"555-1111"} Reading JSON Create a JSON-based reader: String json = "{\"customer\":{\"name\":\"John Doe\",\"phone\":\"555-1111\"}"; XMLInputFactory factory = new JsonXMLInputFactory(); XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(json)); Read your document: assert reader.getEventType() == XMLStreamConstants.START_DOCUMENT; reader.nextTag(); assert reader.isStartElement() && "customer".equals(reader.getLocalName()); reader.next(); assert reader.isStartElement() && "name".equals(reader.getLocalName()); reader.next(); assert reader.hasText() && "John Doe".equals(reader.getText()); reader.nextTag(); assert reader.isEndElement(); reader.next(); assert reader.isStartElement() && "phone".equals(reader.getLocalName()); reader.next(); assert reader.hasText() && "555-111".equals(reader.getText()); reader.nextTag(); assert reader.isEndElement(); reader.next(); assert reader.isEndElement(); reader.next(); assert reader.getEventType() == XMLStreamConstants.END_DOCUMENT; reader.close(); Factory Configuration The JsonXMLInputFactory and JsonXMLOutputFactory classes can be configured via the standard setProperty(String, Object) API. The factory classes define several constants for properties they support. However, the JsonXMLConfig interface provides a convenient way to hold the configuration of both - input and output - factories: JsonXMLConfig config = new JsonXMLConfigBuilder(). virtualRoot("customer"). prettyPrint(true). build(); XMLInputFactory inputFactory = new JsonXMLInputFactory(config); ... XMLOutputFactory outputFactory = new JsonXMLOutputFactory(config); ... Virtual Roots Set the virtualRoot configuration property to strip the root element from the JSON representation, e.g. { "name" : "John Doe", "phone" : "555-1111" } As XML requires a single root element, but JSON documents often don’t have one, this is an important feature required to read and write existing JSON formats. Mastering Arrays What about JSON arrays? Unfortunately, there’s nothing like this in XML. And to be honest, this causes most of the trouble when writing JSON via an XML API like StAX. Simply omitting the array boundaries would lead to non-unique JSON properties, which is usually not desired. StAXON provides several ways to deal with JSON arrays. At the core is the idea to leverage XML processing instructions to tell the writer about to start an array: the processing instruction maps a sequence of XML elements with the same name to a JSON array. The processing instruction optionally takes the array element tag name (with prefix) as data. There’s no end array hint as StAXON detects the end of an array sequence and closes it automatically. Consider the following JSON document: { "alice" : { "bob" : [ "edgar", "charlie" ], "peter" : null } } In order to get a "bob" array instead of two separate "bob" properties, we need to provide XML events corresponding to edgar charlie I.e., with the cursor API, you would just insert writer.writeProcessingInstruction(JsonXMLStreamConstants.MULTIPLE_PI_TARGET); // to start an array. Initiating Arrays with Element Paths Sometimes it is not desired or even impossible to generate processing instruction to control arrays. This may be the case if the actual writing isn’t done by your code, but some other framework like JAXB or similar, and you only provide a stream writer. Addressing such a scenario, wouldn’t it be nice being able to tell the writer beforehand, which elements should trigger a JSON array? This is where the XMLMultipleStreamWriter and XMLMultipleEventWriter wrappers step in. E.g., to specify a sequence of bob elements below root element alice as a multiple path: writer = new XMLMultipleStreamWriter(writer, true, "/alice/bob"); The boolean parameter specifies whether our paths include the root node (alice) from the paths. That is, we could also use writer = new XMLMultipleStreamWriter(writer, false, "/bob"); To wrap all bob fields into arrays (not just alice children), we can use a relative path, without a leading slash: writer = new XMLMultipleStreamWriter(writer, false, "bob"); Now we (or some legacy code, framework, …) may write our document, and the writer will take care to trigger the bob array for us. Triggering Arrays automatically Finally, if nothing else works for you, you may also let StAXON fully automatically determine array boundaries. Use this only if you cannot provide processing instructions and cannot provide the paths of the elements that should be wrapped into JSON arrays. However, using this method has several drawbacks: The writer basically needs to cache the entire document in memory, eating both space and time. The writer will not be able to produce empty arrays or arrays with a single element. To enable this feature, set the JsonXMLOutputFactory.PROP_AUTO_ARRAY property to true. Triggering Document Arrays StAXON’s writer implementation allows you to wrap a sequence of documents into a JSON array. To do this, write the PI before writing anything else: writer.writeProcessingInstruction(JsonXMLStreamConstants.MULTIPLE_PI_TARGET); writer.writeStartDocument(); // first array component ... writer.writeEndDocument(); writer.writeStartDocument(); // second array component ... writer.writeEndDocument(); ... writer.close(); The writer.close() call is crucial here, as it will close the JSON array. Using JAXB Consider a JAXB-annotated Customer class: @JsonXML(virtualRoot = true, prettyPrint = true, multiplePaths = "phone") @XmlRootElement public class Customer { public String name; public List phone; } The @JsonXML annotation is used to configure the mapping details. In the above example, the customer root element is stripped from the JSON representation, phone elements are wrapped into an array and JSON output is nicely formatted, e.g. { "name" : "John Doe", "phone" : [ "555-1111" ] } Now, the JsonXMLMapper class enables for dead-simple mapping to and from JSON: /* * Create mapper instance. */ JsonXMLMapper mapper = new JsonXMLMapper(Customer.class); /* * Read customer. */ InputStream input = getClass().getResourceAsStream("input.json"); Customer customer = mapper.readObject(input); input.close(); /* * Write back to console */ mapper.writeObject(System.out, customer); Using JAX-RS StAXON provides the staxon-jaxrs module, which enables your RESTful services to serialize/deserialize JAXB-annotated classes to/from JSON. It includes the following JAX-RS @Provider classes: de.odysseus.staxon.json.jaxrs.jaxb.JsonXMLObjectProvider is used to read and write JSON objects de.odysseus.staxon.json.jaxrs.jaxb.JsonXMLArrayProvider is used to read and write JSON arrays In order to select the StAXON message body readers/writers for your resource, a @JsonXML annotation is required. When used with JAX-RS, the @JsonXML annotation can be placed on a model type (@XmlRootElement or @XmlType) to configure its serialization and deserialization a JAX-RS resource method to configure serialization of the result type a parameter of a JAX-RS resource method to configure deserialization of the parameter type If a @JsonXML annotation is present at a model type and a resource method or parameter, the latter will override the model type annotation. If neither is present, StAXON will not handle the resource. You can find a sample project using Jersey with StAXON here. Using XPath XPath is another standard that can be easily adopted for use with JSON. The Java XPath API (javax.xml.xpath) doesn’t let us provide an XMLStreamReader or similar as a source, but requires a Document Object Model (DOM). Therefore, we need to read our JSON into a DOM first to apply expressions against that DOM. This could be done by performing an XSLT identity transformation to a DOMResult. However, StAXON provides the DOMEventConsumer class to translate XML events to DOM nodes, which should be faster and simpler than leveraging XSLT. Once we have a DOM, there’s nothing special with applying XPath expressions. StringReader json = new StringReader("{\"edgar\":\"david\",\"bob\":\"charlie\"}"); /* * Our sample JSON has no root element, so specify "alice" as virtual root */ JsonXMLConfig config = new JsonXMLConfigBuilder().virtualRoot("alice").build(); /* * create event reader */ XMLEventReader reader = new JsonXMLInputFactory(config).createXMLEventReader(json); /* * parse JSON into Document Object Model (DOM) */ Document document = DOMEventConsumer.consume(reader); /* * evaluate an XPath expression */ XPath xpath = XPathFactory.newInstance().newXPath(); System.out.println(xpath.evaluate("//alice/bob", document)); Running the above sample will print charlie to the console. What else? In the end, using an XML API to read and write JSON may still look like a compromise, but it may turn out to be a good choice. The availability of a StAX implementation for JSON acts as a door opener to powerful XML related technologies and easily enables for dual-format (XML and JSON) services. There’s more we can do with StAXON: XSD, XSLT, XQuery, XML-JSON/JSON-XML conversions, to name a few. Please check the Wiki for some of those.

February 8, 2012

by Christoph Beck

· 23,013 Views