Data Engineering Resources

The Latest Data Engineering Topics

An is equivalent to an Object-Relational Mapper, but with its targets are documents of a NoSQL database instead of table rows. No one said that a Data Mapper must always rely on a relational database as its back end. In the PHP world, probably the Doctrine ODM for MongoDB is the most successful. This followes to the opularity of Mongo, which is a transitional product between SQL and NoSQL, still based on some relational concepts like queries. Lots of features The Doctrine Mongo ODM supports mapping of objects via annotations placed in the class source code, or via external XML or YAML files. In this and in many aspects it is based on the same concepts as the Doctrine ORM: it features a Facade DocumentManager object and a Unit Of Work that batches changes to the database when objects are added to it. Moreover, two different types of relationships between objects are supported: references and embedded documents. The first is the equivalent of the classical pointer to another row which ORM always transform object references into; the second actually stores an object inside another one, like you would do with a Value Object. Thus, at least in Doctrine's case, it is easier to map objects as documents that as rows. As said before, the ODM borrows some concepts and classes from the ORM, in particular from the Doctrine\Common package which features a standard collection class. So if you have built objects mapped with the Doctrine ORM nothing changes for persisting them in MongoDB, except for the mapping metadata itself. Advantages If an ORM is sometimes a leaky abstraction, an ODM probably becomes an issue less often. It has less overhead than an ORM, since there is no schema to define and the ability to embed objects means there should be no compromises between the object model and the capabilities of the database. How many times we have renounced introducing a potential Value Object because of the difficulty in persisting it? The case for an ODM over a plain Mongo connection object is easy to make: you will still be able to use objects with proper encapsulation (like private fields and associations) and behavior (many methods) instead of extracting just a JSON package from your database. Installation A prerequisite for the ODM is the presence of the mongo extension, that can be installed via pecl. After having verified the extension is present, grab the Doctrine\Common as the 2.2.x package, and a zip of the doctrine-mongodb and doctrine-mongodb-odm projects from Github. Decompress everything into a Doctrine/ folder. After having setup autoloading for classes in Doctrine\, use this bootstrap to get a DocumentManager (the equivalent of EntityManager): use Doctrine\Common\Annotations\AnnotationReader, Doctrine\ODM\MongoDB\DocumentManager, Doctrine\MongoDB\Connection, Doctrine\ODM\MongoDB\Configuration, Doctrine\ODM\MongoDB\Mapping\Driver\AnnotationDriver; private function getADm() { $config = new Configuration(); $config->setProxyDir(__DIR__ . '/mongocache'); $config->setProxyNamespace('MongoProxies'); $config->setDefaultDB('test'); $config->setHydratorDir(__DIR__ . '/mongocache'); $config->setHydratorNamespace('MongoHydrators'); $reader = new AnnotationReader(); $config->setMetadataDriverImpl(new AnnotationDriver($reader, __DIR__ . '/Documents')); return DocumentManager::create(new Connection(), $config); } You will be able to call persist() and flush() on the DocumentManager, along with a set of other methods for querying like find() and getRepository(). Integration with an ORM We are researching a solution for versioning objects mapped with the Doctrine ORM. Doing this with a version column would be invasive, and also strange where multiple objects are involved (do you version just the root of an object graph? Duplicate the other ones when they change? How can you detect that?) The idea is taking a snapshot and putting it in a read only MongoDB instance, where all previous versions can be retrieved later for auditing (business reasons). This has been verified to be technically possible: the DocumentManager and EntityManager are totally separate object graphs, so they won't clash with each other. The only point of conflict is the annotations of model classes, since both use different version of @Id, and can see the other's annotation like @Entity and @Document while parsing. This can be solved by using aliases for all the annotations, using their parent namespace basename as a prefix: model = $model; } public function __toString() { return "Car #$this->document_id: $this->id, $this->model"; } } This make us able to save a copy of an ORM object into Mongo: $car = new Car('Ford'); $this->em->persist($car); $this->em->flush(); $this->dm->persist($car); $this->dm->flush(); var_dump($car->__toString()); $this->assertTrue(strlen($car->__toString()) > 20); The output produces by this test is: .string(38) "Car #4f61a8322f762f1121000000: 3, Ford" When retrieving the object, one of the two ids will be null as it is ignored by the ORM or ODM. I am not using the same field because I want to store multiple copies of a row, so it's id alone won't be unique. If you're interested, checkout my hack on Github. It contains the running example presented in this post. Remember to create the relational schema with: $ php doctrine.php orm:schema-tool:create before running the test with phpunit --bootstrap bootstrap.php DoubleMappingTest.php MongoDB won't need the schema setup, of course. There are still some use cases to test, like the behavior in the presence of proxies, but it seems that non-invasive approach of Data Mappers like Doctrine 2 is paying off: try mapping an object in multiple database with Active Records.

March 20, 2012

by Giorgio Sironi

· 21,956 Views

Adding a .first() method to Django's QuerySet

In my last Django project, we had a set of helper functions that we used a lot. The most used was helpers.first, which takes a query set and returns the first element, or None if the query set was empty. Instead of writing this: try: object = MyModel.objects.get(key=value) except model.DoesNotExist: object = None You can write this: def first(query): try: return query.all()[0] except: return None object = helpers.first(MyModel.objects.filter(key=value)) Note, that this is not identical. The get method will ensure that there is exactly one row in the database that matches the query. The helper.first() method will silently eat all but the first matching row. As long as you're aware of that, you might choose to use the second form in some cases, primarily for style reasons. But the syntax on the helper is a little verbose, plus you're constantly including helpers.py. Here is a version that makes this available as a method on the end of your query set chain. All you have to do is have your models inherit from this AbstractModel. class FirstQuerySet(models.query.QuerySet): def first(self): try: return self[0] except: return None class ManagerWithFirstQuery(models.Manager): def get_query_set(self): return FirstQuerySet(self.model) class AbstractModel(models.Model): objects = ManagerWithFirstQuery() class Meta: abstract = True class MyModel(AbstractModel): ... Now, you can do the following. object = MyModel.objects.filter(key=value).first()

March 19, 2012

by Chase Seibert

· 12,364 Views

GapList – a Lightning-Fast List Implementation

This article introduces GapList, an implementation which strives for combining the strengths of both ArrayList and LinkedList.

March 19, 2012

by Thomas Mauch

· 64,354 Views · 4 Likes

Defensive Programming vs. Batshit Crazy Paranoid Programming

Hey, let’s be careful out there. --Sergeant Esterhaus, daily briefing to the force of Hill Street Blues When developers run into an unexpected bug and can’t fix it, they’ll “add some defensive code” to make the code safer and to make it easier to find the problem. Sometimes just doing this will make the problem go away. They’ll tighten up data validation – making sure to check input and output fields and return values. Review and improve error handling – maybe add some checking around “impossible” conditions. Add some helpful logging and diagnostics. In other words, the kind of code that should have been there in the first place. Expect the Unexpected The whole point of defensive programming is guarding against errors you don’t expect. ---Steve McConnell, Code Complete The few basic rules of defensive programming are explained in a short chapter in Steve McConnell’s classic book on programming, Code Complete: Protect your code from invalid data coming from “outside”, wherever you decide “outside” is. Data from an external system or the user or a file, or any data from outside of the module/component. Establish “barricades” or “safe zones” or “trust boundaries” – everything outside of the boundary is dangerous, everything inside of the boundary is safe. In the barricade code, validate all input data: check all input parameters for the correct type, length, and range of values. Double check for limits and bounds. After you have checked for bad data, decide how to handle it. Defensive Programming is NOT about swallowing errors or hiding bugs. It’s about deciding on the trade-off between robustness (keep running if there is a problem you can deal with) and correctness (never return inaccurate results). Choose a strategy to deal with bad data: return an error and stop right away (fast fail), return a neutral value, substitute data values, … Make sure that the strategy is clear and consistent. Don’t assume that a function call or method call outside of your code will work as advertised. Make sure that you understand and test error handling around external APIs and libraries. Use assertions to document assumptions and to highlight “impossible” conditions, at least in development and testing. This is especially important in large systems that have been maintained by different people over time, or in high-reliability code. Add diagnostic code, logging and tracing intelligently to help explain what’s going on at run-time, especially if you run into a problem. Standardize error handling. Decide how to handle “normal errors” or “expected errors” and warnings, and do all of this consistently. Use exception handling only when you need to, and make sure that you understand the language’s exception handler inside out. Programs that use exceptions as part of their normal processing suffer from all the readability and maintainability problems of classic spaghetti code. --The Pragmatic Programmer I would add a couple of other rules. From Michael Nygard’s Release It! n Never ever wait forever on an external call, especially a remote call. Forever can be a long time when something goes wrong. Use time-out/retry logic and his Circuit Breaker stability pattern to deal with remote failures. And for languages like C and C++, defensive programming also includes using safe function calls to avoid buffer overflows and common coding mistakes. Different Kinds of Paranoia The Pragmatic Programmer describes defensive programming as “Pragmatic Paranoia”. Protect your code from other people’s mistakes, and your own mistakes. If in doubt, validate. Check for data consistency and integrity. You can’t test for every error, so use assertions and exception handlers for things that “can’t happen”. Learn from failures in test and production – if this failed, look for what else can fail. Focus on critical sections of code – the core, the code that runs the business. Healthy Paranoid Programming is the right kind of programming. But paranoia can be taken too far. In the Error Handling chapter of Clean Code, Michael Feathers cautions that “many code bases are dominated by error handling” --Michael Feathers, Clean Code Too much error handling code not only obscures the main path of the code (what the code is actually trying to do), but it also obscures the error handling logic itself – so that it is harder to get it right, harder to review and test, and harder to change without making mistakes. Instead of making the code more resilient and safer, it can actually make the code more error-prone and brittle. There’s healthy paranoia, then there’s over-the-top-error-checking, and then there’s bat shit crazy crippling paranoia – where defensive programming takes over and turns in on itself. The first real world system I worked on was a “Store and Forward” network control system for servers (they were called minicomputers back then) across the US and Canada. It shared data between distributed systems, scheduled jobs, and coordinated reporting across the network. It was designed to be resilient to network problems and automatically recover and restart from operational failures. This was ground breaking stuff at the time, and a hell of a technical challenge. The original programmer on this system didn’t trust the network, didn’t trust the O/S, didn’t trust Operations, didn’t trust other people’s code, and didn’t trust his own code – for good reason. He was a chemical engineer turned self-taught system programmer who drank a lot while coding late at night and wrote thousands of lines of unstructured FORTRAN and Assembler under the influence. The code was full of error checking and self diagnostics and error-correcting code, the files and data packets had their own checksums and file-level passwords and hidden control labels, and there was lots of code to handle sequence accounting exceptions and timing-related problems – code that mostly worked most of the time. If something went wrong that it couldn’t recover from, programs would crash and report a “label of exit” and dump the contents of variables – like today’s stack traces. You could theoretically use this information to walk back through the code to figure out what the hell happened. None of this looked anything like anything that I learned about in school. Reading and working with this code was like programming your way out of Arkham Asylum. If the programmer ran into bugs and couldn’t fix them, that wouldn’t stop him. He would find a way to work around the bugs and make the system keep running. Then later after he left the company, I would find and fix a bug and congratulate myself until it broke some “error-correcting” code somewhere else in the network that now depended on the bug being there. So after I finally figured out what was going on, I took out as much of this “protection” as I could safely remove, and cleaned up the error handling so that I could actually maintain the system without losing what was left of my mind. I setup trust boundaries for the code – although I didn’t know that’s what it was called then – deciding what data couldn’t be trusted and what could. Once this was done I was able to simplify the defensive code so that I could make changes without the system falling over itself, and still protect the core code from bad data, mistakes in the rest of the code, and operational problems. Making code safer is simple The point of defensive coding is to make the code safer and to help whoever is going to maintain and support the code – not make their job harder. Defensive code is code – all code has bugs, and, because defensive code is dealing with exceptions, it is especially hard to test and to be sure that it will work when it has to. Understanding what conditions to check for and how much defensive coding is needed takes experience, working with code in production and seeing what can go wrong in the real world. A lot of the work involved in designing and building secure, resilient systems is technically difficult or expensive. Defensive programming is neither – like defensive driving, it’s something that everyone can understand and do. It requires discipline and awareness and attention to detail, but it’s something that we all need to do if we want to make the world safe.

March 19, 2012

by Jim Bird

· 23,918 Views

Hadoop Basics—Creating a MapReduce Program

The Map Reduce Framework works in two main phases to process the data, which are the "map" phase and the "reduce" phase.

March 18, 2012

by Carlo Scarioni

· 212,270 Views · 4 Likes

Deploying an Artifact to the Local Cache in Gradle

One question that came up a couple times this week is how to set gradle up to deploy jars locally. For the most part I was satisfied with just having people push snapshot releases to our Artifactory server but some people did express a real desire to be able to publish a jar to the local resolution cache to test changes out locally. I’m still a fan of deploying snapshots from feature branches but luckily you can do a local publish and resolve with gradle. First off, ask yourself if the dependency is coupled enough to warrant being a submodule. Also, could just linking the project in your IDE be enough to get what you want done? If the answer to both questions are no then your next recourse is to use gradle’s excellent maven compatibility (don’t run!). For the project you want to publish locally you simply need to apply the maven plugin and make sure you have version and group set for the project (usually I put group and version in gradle.properties). apply plugin: 'java' apply plugin: 'maven' version = '0.5.1-SNAPSHOT' group = 'org.jamescarr.examples' That’s all you need to install it locally, just run gradle install from the project root to install it to the local m2 cache. Now let’s update your project that will depend on it. ... repositories { mavenCentral() mavenLocal() } dependencies { compile('org.jamecarr.examples:example-api:0.5.1-SNAPSHOT'){ changing=true } } The magic sauce here is using mavenLocal() as one of your resolution repositories. This will resolve against the local m2 cache. mavenCentral() can be replaced by whatever repositories you might use, it is only included since it’s the most often used. That’s it! I know some people dislike this approach due to ingrained disdain for maven but the beauty of it is that maven is silently at work and you really don’t get bothered by it.

March 16, 2012

by James Carr

· 33,245 Views

Display an OLE Object from a Microsoft Access Database using OLE Stripper

In database programming, it happens a lot that you need to bind a picture box to a field with type of photo or image. For example, if you want to show an Employee’s picture from Northwind.mdb database, you might want to try the following code: picEmployees.DataBindings.Add(“Image”, bsEmployees, “Photo”, true); This code works if the images are stored in the database with no OLE header or the images stored as a raw image file formats. As the pictures stored in the Northwind database in are not stored in raw image file formats and they are stored as an OLE image documents, then you have to strip off the OLE header to work with the image properly. Binding imageBinding = new Binding("Image", bsEmployees, "ImageBlob.ImageBlob", true); imageBinding.Format += new ConvertEventHandler(this.PictureFormat); private void PictureFormat(object sender, ConvertEventArgs e) { Byte[] img = (Byte[])e.Value; MemoryStream ms = new MemoryStream(); int offset = 78; ms.Write(img, offset, img.Length - offset); Bitmap bmp = new Bitmap(ms); ms.Close(); // Writes the new value back e.Value = bmp; } Fortunately, there are some overload methods in .NET Framework to take care of this mechanism, but it cannot be guaranteed whether you need to strip off the OLE object by yourself or not. For example, you can use the following technique to access the images of the Northwind.mdb that ships with Microsoft Access and they will be rendered properly. picEmployees.DataBindings.Add(“Image”, bsEmployees, “Photo”, true, DataSourceUpdateMode.Never, new Bitmap(typeof(Button), “Button.bmp”)); Unfortunately, there are some scenarios that you need a better solution. For example, the Xtreme.mdb database that ships with Crystal Reports has a photo filed that cannot be handled by the preceding methods. For these complex scenarios, you can download the OLEStripper classes from here and re-write the PictureFormat method as it is shown below: private void PictureFormat(object sender, ConvertEventArgs e) { // photoIndex is same as Employee ID int photoIndex = Convert.ToInt32(e.Value); // Read the original OLE object ReadOLE olePhoto = new ReadOLE(); string PhotoPath = olePhoto.GetOLEPhoto(photoIndex); // Strip the original OLE object StripOLE stripPhoto = new StripOLE(); string StripPhotoPath = stripPhoto.GetStripOLE(PhotoPath); FileStream PhotoStream = new FileStream(StripPhotoPath , FileMode.Open); Image EmployeePhoto = Image.FromStream(PhotoStream); e.Value = EmployeePhoto; PhotoStream.Close(); }

March 15, 2012

by Amir Ahani

· 10,756 Views

Circos: An Amazing Tool for Visualizing Big Data

storing massive amounts of data in a nosql data store is just one side of the big data equation. being able to visualize your data in such a way that you can easily gain deeper insights , is where things really start to get interesting. lately, i've been exploring various options for visualizing (directed) graphs, including circos . circos is an amazing software package that visualizes your data through a circular layout . although it's originally designed for displaying genomic data , it allows to create good-looking figures from data in any field. just transform your data set into a tabular format and you are ready to go. the figure below illustrates the core concept behind circos. the table's columns and rows are represented by segments around the circle. individual cells are shown as ribbons , which connect the corresponding row and column segments. the ribbons themselves are proportional in width to the value in the cell. when visualizing a directed graph , nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. the proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. in my case, i want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, ...) and how do they navigate between pages. the rest of this article details how to 1) retrieve the raw visit information through the google analytics api, 2) persist this information as a graph in neo4j and 3) query and preprocess this data for visualization through circos. as always, the complete source code can be found on the datablend public github repository . 1. retrieving your google analytics data let's start by retrieving the raw google analytics data . the google analytics data api provides access to all dimensions and metrics that can be queried through the web application. in my case, i'm interested in retrieving the previous page path property for each page view. if a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance) . otherwise, it contains the internal path . we will use google's java data api to connect and retrieve this information. we are particularly interested in the pagepath , pagetitle , previouspagepath and medium dimensions, while our metric of choice is the number of pageviews . after setting the date range, the feed of entries that satisfy this criteria can be retrieved. for ease of use, we transform this data to a domain entity and filter/clean the data accordingly. if a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, ...) as previous path. // authenticate analyticsservice = new analyticsservice(configuration.service); analyticsservice.setusercredentials(configuration.client_username, configuration.client_pass); // create query dataquery query = new dataquery(new url(configuration.data_url)); query.setids(configuration.table_id); query.setdimensions("ga:medium,ga:previouspagepath,ga:pagepath,ga:pagetitle"); query.setmetrics("ga:pageviews"); query.setstartdate(datestring); query.setenddate(datestring); // execute datafeed feed = analyticsservice.getfeed(createqueryurl(date), datafeed.class); // iterate and clean for (dataentry entry : feed.getentries()) { string pagepath = entry.stringvalueof("ga:pagepath"); string pagetitle = entry.stringvalueof("ga:pagetitle"); string previouspagepath = entry.stringvalueof("ga:previouspagepath"); string medium = entry.stringvalueof("ga:medium"); long views = entry.longvalueof("ga:pageviews"); // filter the data if (filter(pagepath) && filter(previouspagepath) && (!clean(previouspagepath).equals(clean(pagepath)))) { // check criteria are satisfied navigation navigation = new navigation(clean(previouspagepath), clean(pagepath), pagetitle, date, views); if (navigation.getsource().equals("(entrance)")) { // in case of an entrace, save its medium instead navigation.setsource(medium); } navigations.add(navigation); } } 2. storing navigational data as a directed graph in neo4j the set of site navigations can easily be stored as a directed graph in the neo4j graph database . nodes are site paths (or mediums), while relationships are the navigations themselves. we start by retrieving the navigations for a particular date range and retrieve (or lazily create) the nodes representing the source and target paths (or mediums). next we de-normalize the pageviews metric (for instance, 6 individual relationships will be created for 6 page-views). although this de-normalization step is not really required, i did so to make sure that the degree of my nodes is correct if i would perform other types of calculations. for each individual navigation relationship, we also store the date of visit . // retrieve navigations for a particular date list navigations = retrieval.getnavigations(date); // save them in the graph database transaction tx = graphdb.begintx(); // iterate and create for (navigation nav : navigations) { node source = getpath(nav.getsource()); node target = getpath(nav.gettarget()); if (!target.hasproperty("title")) { target.setproperty("title", nav.gettargettitle()); } for (long i = 0; i < nav.getamount(); i++) { // duplicate relationships relationship transition = source.createrelationshipto(target, relationships.navigation); transition.setproperty("date", date.gettime()); // save time as long } } // commit tx.success(); tx.finish(); 3. creating the circos tabular data format the circos tabular data format is quite easy to construct. it's basically a tab-delimited file with row and column headers. a cell is interpreted as a value that flows from the row entity to the column entity . we will use the neo4j cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period . doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time. // access the graph database graphdb = new embeddedgraphdatabase("var/analytics"); engine = new executionengine(graphdb); // execute the data range cypher query map params = new hashmap(); params.put("fromdate", from.gettime()); params.put("todate", to.gettime()); // execute the query executionresult result = engine.execute("start sourcepath=node:index(\"path:*\") " + "match sourcepath-[r]->targetpath " + "where r.date >= {fromdate} and r.date <= {todate} " + "return sourcepath,targetpath", params); next, we create the tab delimited file itself. we iterate through all entries (i.e. navigations) that match our cypher query and store them in a temporary list. afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. at the end, we filter this occurrence matrix on the minimal number of required navigations. this ensures that we will only create segments for paths that are relevant in the total population. as a final step, we print the occurrences matrix as a tab-delimited file. for each path, we will use a shorthand as the circos renderer seems to have problem with long string identifiers. // retrieve the results iterator> it = result.javaiterator(); list navigations = new arraylist(); map titles = new hashmap(); set paths = new hashset(); // iterate the results while (it.hasnext()) { map record = it.next(); string source = (string)((node) record.get("sourcepath")).getproperty("path"); string target = (string) ((node) record.get("targetpath")).getproperty("path"); string targettitle = (string) ((node) record.get("targetpath")).getproperty("title"); // reuse the navigation object as temorary holder navigations.add(new navigation(source, target, targettitle, new date(), 1)); paths.add(source); paths.add(target); if (!titles.containskey(target)) { titles.put(target, targettitle); } } // retrieve the various paths list pathids = arrays.aslist(paths.toarray(new string[]{})); // create the matrix that holds the info int[][] occurences = new int[pathids.size()][pathids.size()]; // iterate through all the navigations and update accordingly for (navigation navigation : navigations) { int sourceindex = pathids.indexof(navigation.getsource()); int targetindex = pathids.indexof(navigation.gettarget()); occurences[sourceindex][targetindex] = occurences[sourceindex][targetindex] + 1; } // matrix build, filter on threshold for (int i = 0; i < occurences.length; i++) { for (int j = 0; j < occurences.length; j++) { if (occurences[i][j] < threshold) { occurences[i][j] = 0; } } // print printcircosdata(pathids, titles, occurences); the text below is a sample of the output generated by the printcircosdata method. it first prints the legend (matching shorthands with actual paths). next it prints the tab-delimited circos table. link0 - /?p=411/wp-admin - storing and querying rdf data in neo4j through sail - datablend link1 - /?p=1146 - visualizing rdf schema inferencing through neo4j, tinkerpop, sail and gephi - datablend link2 - /?p=164 - big data / concise articles - datablend link3 - referral - null link4 - /?p=1400 - the joy of algorithms and nosql revisited: the mongodb aggregation framework - datablend ... datal0l1l2l3l4... l000000 l100000 l200000 l3059400197 l400000 4. use the circos power although circos can be installed on your local computer, we will use its online version to create the visualization of our data. upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site's navigation information. with just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. the outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. in case of referrals, no navigations have this path as target (indicated by the empty middle ring). its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. the l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). this segment visualizes the navigation data related to my "the joy of algorithms and nosql: a mongodb example" -article. most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. the l15-segment (my blog's main page) is the only path that receives an almost equal amount of incoming and outgoing traffic. with just a few tweaks to the circos input data, we can easily focus on particular types of navigation data. in the figure below, i made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors. 5. conclusions in the era of big data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. circos specializes in a very specific type of visualization, but does its job extremely well. i would be delighted to hear about other types of visualizations for directed graphs.

March 13, 2012

by Davy Suvee

· 35,898 Views · 2 Likes

All about JMS messages

JMS providers like ActiveMQ are based on the concept of passing one-directional messages between nodes and brokers asynchronously. A thorough knowledge of the type of messages that can be sent through a JMS middleware can simplify a lot your work in mapping the communication patterns to real code. The basic Message interface Some object members are shared by all messages: header fields, used to identify univocally a message and to route it to the right brokers and consumers. A dynamic map of properties which can be read programmatically by JMS brokers in order to filter or to route messages. A body, which is differentiated in the various implementations we'll see. Header fields The set of getJMS*() methods on the Message interface defines the available headers. Two of them are oriented to message identification: getJMSMessageID() contains a generated ID for identifying a message, unique at least for the current broker. All generated IDs start with the prefix 'ID:', but you can override it with the corresponding setter. getJMSCorrelationID() (and getJMSCorrelationID() as bytes) can link a message with another, usually one that has been sent previously. For example, a reply can carry the ID of the original message when put in another queue. Two to sender and recipient identification: getJMSDestination() returns a Destination object (a Topic or a Queue, or their temporary version) describing where the message was directed. getJMSReplyTo() is a Destination object where replies should be sent; it can be null of course. Three tune the delivery mechanism: getJMSDeliveryMode() can be DeliveryMode.NON_PERSISTENT or DeliveryMode.PERSISTENT; only persistent messages guarantee delivery in case of a crash of the brokers that transport it. getJMSExpiration() returns a timestamp indicating the expiration time of the message; it can be 0 on a message without a defined expiration. getJMSPriority() returns a 0-9 integer value (higher is better) defining the priority for delivery. It is only a best-effort value. While the remaining ones contain metadata: getJMSRedelivered() returns a boolean indicating if the message is being delivered again after a delivery which was not acknowledge. getJMSTimestamp() returns a long indicating the time of sending. getJMSType() defines a field for provider-specific or application-specific message types. Of these headers, only JMSCorrelationID, JMSReplyTo and JMSType have to be set when needed. The others are generated or managed by the send() and publish() methods if not specified (there are setters available for each of these headers.) Properties Generic properties with a String name can be added to messages and read with getBooleanProperty(), getStringProperty() and similar methods. The corresponding setters setBooleanProperty(), setStringProperty(), ... can be used for their addition. The reason for keeping some properties out of the content of the message (which a MapMessage can contain) is so they could be read before reaching the destination, for example in a JMS broker. The use case for this access to message properties is routing and filtering: downstream brokers and consumers may define a filter such as "I am interested only on messages on this Topic that have property X = 'value' and Y = 'value2'". Bodies All the subinterfaces of javax.jms.Message defined by the API provide different types of message bodies (while actual classes are defined by the providers and are not part of the API). Actual instantiation is then handled by the Session, which implements an Abstract Factory pattern. On the receival side, a cast is necessary for any message type (at least in the Java JMS Api), since only a generic Message is read. BytesMessage is the most basic type: it contains a sequence of uninterpreted bytes. Hence, it can in theory contain anything, but the generation and interpretation of the content is the client's job. BytesMessage m = session.createBytesMessage(); m.writeByte(65); m.writeBytes(new byte[] { 66, 68, 70 }); // on receival (cast shown only here) BytesMessage m = (BytesMessage) genericMessage; byte[] content = new byte[4]; m.readBytes(content); MapMessage defines a message containing an (unordered) set of key/value pairs, also called a map or dictionary or hash. However, the keys are String objects, while the values are primitives or Strings; since they are primitives, they shouldn't be null. MapMessage = session.createMapMessage(); m.setString('key', 'value'); // or m.setObject('key', 'value') to avoid specifying a type // on receival m.getString('key'); // or m.getObject('key') ObjectMessage wraps a generic Object for transmission. The Object should be Serializable. ObjectMessage m = session.createObjectMessage(); m.setObject(new ValueObject('field1', 42)); // on receival ValueObject vo = (ValueObject) m.getObject(); StreamMessage wraps a stream of primitive values of indefinite length. StreamMessage m = session.createStreamMessage(); m.writeBoolean(true); m.writeBoolean(false); m.writeBoolean(true); // receival System.out.println(m.readBoolean()); System.out.println(m.readBoolean()); System.out.println(m.readBoolean()); // prints true, false, true TextMessage wraps a String of any length. TextMessage m = session.createTextMessage("Contents"); // or use m.setText() afterwards // receival String text = m.getText(); Usually all messages are in a read-only phase after receival, so only getters can be called on them.

March 12, 2012

by Giorgio Sironi

· 61,429 Views · 1 Like

Joins with MapReduce

i have been reading up on join implementations available for hadoop for past few days. in this post i recap some techniques i learnt during the process. the joins can be done at both map side and join side according to the nature of data sets of to be joined. reduce side join let’s take the following tables containing employee and department data. let’s see how join query below can be achieved using reduce side join. select employees.name, employees.age, department.name from employees inner join department on employees.dept_id=department.dept_id map side is responsible for emitting the join predicate values along with the corresponding record from each table so that records having same department id in both tables will end up at on same reducer which would then do the joining of records having same department id. however it is also required to tag the each record to indicate from which table the record originated so that joining happens between records of two tables. following diagram illustrates the reduce side join process. here is the pseudo code for map function for this scenario. map (k table, v rec) { dept_id = rec.dept_id tagged_rec.tag = table tagged_rec.rec = rec emit(dept_id, tagged_rec) } at reduce side join happens within records having different tags. reduce (k dept_id, list tagged_recs) { for (tagged_rec : tagged_recs) { for (tagged_rec1 : taagged_recs) { if (tagged_rec.tag != tagged_rec1.tag) { joined_rec = join(tagged_rec, tagged_rec1) } emit (tagged_rec.rec.dept_id, joined_rec) } } map side join (replicated join) using distributed cache on smaller table for this implementation to work one relation has to fit in to memory. the smaller table is replicated to each node and loaded to the memory. the join happens at map side without reducer involvement which significantly speeds up the process since this avoids shuffling all data across the network even-though most of the records not matching are later dropped. smaller table can be populated to a hash-table so look-up by dept_id can be done. the pseudo code is outlined below. map (k table, v rec) { list recs = lookup(rec.dept_id) // get smaller table records having this dept_id for (small_table_rec : recs) { joined_rec = join (small_table_rec, rec) } emit (rec.dept_id, joined_rec) } using distributed cache on filtered table if the smaller table doesn’t fit the memory it may be possible to prune the contents of it if filtering expression has been specified in the query. consider following query. select employees.name, employees.age, department.name from employees inner join department on employees.dept_id=department.dept_id where department.name="eng" here a smaller data set can be derived from department table by filtering out records having department names other than “eng”. now it may be possible to do replicated map side join with this smaller data set. replicated semi-join reduce side join with map side filtering even of the filtered data of small table doesn’t fit in to the memory it may be possible to include just the dept_id s of filtered records in the replicated data set. then at map side this cache can be used to filter out records which would be sent over to reduce side thus reducing the amount of data moved between the mappers and reducers. the map side logic would look as follows. map (k table, v rec) { // check if this record needs to be sent to reducer boolean sendtoreducer = check_cache(rec.dept_id) if (sendtoreducer) { dept_id = rec.dept_id tagged_rec.tag = table tagged_rec.rec = rec emit(dept_id, tagged_rec) } } reducer side logic would be same as the reduce side join case. using a bloom filter a bloom filter is a construct which can be used to test the containment of a given element in a set. a smaller representation of filtered dept_ids can be derived if dept_id values can be augmented in to a bloom filter. then this bloom filter can be replicated to each node. at the map side for each record fetched from the smaller table the bloom filter can be used to check whether the dept_id in the record is present in the bloom filter and only if so to emit that particular record to reduce side. since a bloom filter is guaranteed not to provide false negatives the result would be accurate. references [1] hadoop in action [2] hadoop : the definitive guide

March 12, 2012

by Buddhika Chamith

· 30,743 Views

Using Maven's -U Command Line Option

My prefered solution was to use the Maven ‘update snapshots’ command line argument.

March 11, 2012

by Roger Hughes

· 104,826 Views · 1 Like

Writing a simple named pipes server in C#

I solved a little problem last night when playing with named pipes. I created a named pipe that writes all output to a file. Named pipes are opened for all users on a single machine. In this post I will show you a simple class that works as a pipe server. In .NET-based languages we can use the System.IO.Pipes namespace classes to work with named pipes. Here is my simple pipe server that writes all client output to file. public class MyPipeServer { public void Run() { var sid = new SecurityIdentifier(WellKnownSidType.WorldSid, null); var rule = new PipeAccessRule(sid, PipeAccessRights.ReadWrite, AccessControlType.Allow); var sec = new PipeSecurity(); sec.AddAccessRule(rule); using (NamedPipeServerStream pipeServer = new NamedPipeServerStream ("testpipe",PipeDirection.InOut, 100, PipeTransmissionMode.Byte, PipeOptions.None, 0, 0, sec)) { pipeServer.WaitForConnection(); var read = 0; var bytes = new byte[4096]; using(var file=File.Open(@"c:\tmp\myfile.dat", FileMode.Create)) while ((read = pipeServer.Read(bytes, 0, bytes.Length)) > 0) { file.Write(bytes, 0, read); file.Flush(); } } } } Real-life pipe scenarios are usually more complex but this simple class is good to get things running like they should be.

March 10, 2012

by Gunnar Peipman

· 30,606 Views

Best Practices for Variable and Method Naming

Use short enough and long enough variable names in each scope of code. Generally length may be 1 char for loop counters, 1 word for condition/loop variables, 1-2 words for methods, 2-3 words for classes, 3-4 words for globals. Use specific names for variables, for example "value", "equals", "data", ... are not valid names for any case. Use meaningful names for variables. Variable name must define the exact explanation of its content. Don't start variables with o_, obj_, m_ etc. A variable does not need tags which states it is a variable. Obey company naming standards and write variable names consistently in application: e.g. txtUserName, lblUserName, cmbSchoolType, ... Otherwise readability will reduce and find/replace tools will be unusable. Obey programming language standards and don't use lowercase/uppercase characters inconsistently: e.g. userName, UserName, USER_NAME, m_userName, username, ... use Camel Case (aka Upper Camel Case) for classes: VelocityResponseWriter use Lower Case for packages: com.company.project.ui use Mixed Case (aka Lower Camel Case) for variables: studentName use Upper Case for constants : MAX_PARAMETER_COUNT = 100 use Camel Case for enum class names and Upper Case for enum values. don't use '_' anywhere except constants and enum values (which are constants). For example for Java, Don't reuse same variable name in the same class in different contexts: e.g. in method, constructor, class. So you can provide more simplicity for understandability and maintainability. Don't use same variable for different purposes in a method, conditional etc. Create a new and different named variable instead. This is also important for maintainability and readability. Don't use non-ASCII chars in variable names. Those may run on your platform but may not on others. Don't use too long variable names (e.g. 50 chars). Long names will bring ugly and hard-to-read code, also may not run on some compilers because of character limit. Decide and use one natural language for naming, e.g. using mixed English and German names will be inconsistent and unreadable. Use meaningful names for methods. The name must specify the exact action of the method and for most cases must start with a verb. (e.g. createPasswordHash) Obey company naming standards and write method names consistently in application: e.g. getTxtUserName(), getLblUserName(), isStudentApproved(), ... Otherwise readability will reduce and find/replace tools will be unusable. Obey programming language standards and don't use lowercase/uppercase characters inconsistently: e.g. getUserName, GetUserName, getusername, ... For example for Java, use Mixed Case for method names: getStudentSchoolType use Mixed Case for method parameters: setSchoolName(String schoolName) Use meaningful names for method parameters, so it can documentate itself in case of no documentation.

March 10, 2012

by Cagdas Basaraner

· 153,206 Views · 5 Likes

Resetting the Database Connection in Django

Django handles database connections transparently in almost all cases. It will start a new connection when your request starts up, and commit it at the end of the request lifetime. Other times you need to dive in further and do your own granular transaction management. But for the most part, it's fully automatic. However, sometimes your use case may require that you close the current database connection and open a new one. While this is possible in Django, it's not well documented. Why would you want to do this? I my case, I was writing an automation test framework. Some of the automation tests make database calls through the Django ORM to setup records, clean up after the test, etc. Each test is executed in the same process space, via a thread pool. We found that if one of the early tests threw an unrecoverable database error, such as an IntegrityError due to violating a unique constraint, the database connection would be aborted. Subsequent tests that tried to use the database would raise a DatabaseError: Traceback (most recent call last): File /home/user/project/app/test.py, line 73, in tearDown MyModel.objects.all() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 444, in delete collector.collect(del_query) File /usr/local/lib/python2.6/dist-packages/django/db/models/deletion.py, line 146, in collect reverse_dependency=reverse_dependency) File /usr/local/lib/python2.6/dist-packages/django/db/models/deletion.py, line 91, in add if not objs: File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 113, in __nonzero__ iter(self).next() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 107, in _result_iter self._fill_cache() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 772, in _fill_cache self._result_cache.append(self._iter.next()) File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 273, in iterator for row in compiler.results_iter(): File /usr/local/lib/python2.6/dist-packages/django/db/models/sql/compiler.py, line 680, in results_iter for rows in self.execute_sql(MULTI): File /usr/local/lib/python2.6/dist-packages/django/db/models/sql/compiler.py, line 735, in execute_sql cursor.execute(sql, params) File /usr/local/lib/python2.6/dist-packages/django/db/backends/postgresql_psycopg2/base.py, line 44, in execute return self.cursor.execute(query, args) DatabaseError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. It turns out that it's relatively easy to reset the database connection. We just called the following function at the start of every test. Django is smart enough to re-initialize the connection the next time it's used, assuming that it's disconnected properly. def reset_database_connection(): from django import db db.close_connection()

March 9, 2012

by Chase Seibert

· 8,939 Views

The Dark Side of Big Data: Pseudo-Science & Fooled By Randomness

Over the last couple of months I have read up on volumes of Technical Analysis (“TA”) information, I have back tested probably hundreds of automated trading strategies against massive amounts of data, both exchange intraday- and tick data, as well as other sources. Some of these strategies have been massively profitable in back testing, others not so much. Some of the TA patterns, I’ve discarded before they even left the book, because they did not stand up to any sort of scientific scrutiny because they lacked a clear predictive thesis, where riddled with forward-looking bias (“Head and Shoulders patterns”), and in some cases where just plain bulls**t (“Elliott Wave Principle” comes to mind). The outcomes of my testing has made me think about the implications of large scale data analysis in general: it is very easy to get fooled by randomness. In many cases in my testing results have been amazing, but I cannot come up with a plausible causal explanation as to why, and when I gently nudge the parameters just ever so slightly, outcomes can look entirely different. Taking a step back from the data, looking at it in a larger perspective, I’m inclined to conclude that if data across multiple parameter variations looks like a random walk and lacks a plausible causal explanation, then it is a random walk. If I cannot say “X is caused by A and B”, I’m inclined to believe that the actual reason is “X is the result because A and B fit the historical data D, but may not do so in the future”. And herein lies the crux of the matter: how many data scientists are inclined to take a step back, rather than just assume that there is a pattern there? How many are prepared to do so if their livelihood is largely based on them finding patterns, rather than discarding them because they do not hold up to deeper scrutiny? I’d say very few. My conclusion to this is that the age of Big Data will see a radical increase of pseudo-scientific “discoveries”, driven out of an interest in announcing new great “patterns”. This pseudo-science will pervade both academia, public sector and private sector, God knows I’ve seen a fair number of academic research papers already that simply do not hold if you investigate their thesis in a deeper manner. I suspect we will arrive at a point much like with any new technology whereby people will tire of the claims made by “Big Data Scientists”, because at least half of what they say will have been proven to be hokey and pseudo-science in the pursuit of being able to make even more outlandish claims in a game of one-upping the competition. Some of this will be driven by malice and self-interest, but I suspect in equal parts it will be driven by ignorance and perverted incentives putting blinders on people in the business.

March 9, 2012

by Wille Faler

· 13,662 Views

Connecting to Multiple Databases Using Hibernate

In a recent project, I had a requirement of connecting to multiple databases using hibernate. As tapestry-hibernate module does not provide an out-of-box support, I thought of adding one. https://github.com/tawus/tapestry5 Now that the application is in production, I thought of writing a simple “How to”. I have cloned the latest stable(5.3.2) tapestry project at https://github.com/tawus/tapestry5 and have added multiple database support to it. Single Database It is almost fully compatible with the previous integration when using a single database except for a few things 1) HibernateConfigurer has changed public interface HibernateConfigurer { /** * Passed the configuration so as to make changes. */ void configure(Configuration configuration); /** * Factory Id for which this configurer is meant for */ Class getMarker(); /** * Entity package names * * @return */ String[] getPackageNames(); } 2) There is no HibernateEntityPackageManager, as the packages can be contributed by adding more HibernateConfigurers with the same Markers. Multiple databases For multiple database, a marker has to be used for accessing Session or HibernateSessionManager @Inject @XDB private Session session; @Inject @YDB private HibernateSessionManager sessionManager; @XDB @CommitAfter void myMethod(){ } Also you have to define a HibernateSessionManager and a Session for the secondary database in the Module class. @Scope(ScopeConstants.PERTHREAD) @Marker(DatabaseTwo.class) public static HibernateSessionManager buildHibernateSessionManagerForFinacle( HibernateSessionSource sessionSource, PerthreadManager perthreadManager) { HibernateSessionManagerImpl service = new HibernateSessionManagerImpl(sessionSource, DatabaseTwo.class); perthreadManager.addThreadCleanupListener(service); return service; } @Marker(DatabaseTwo.class) public static Session buildSessionForFinacle( @Local HibernateSessionManager sessionManager, PropertyShadowBuilder propertyShadowBuilder) { return propertyShadowBuilder.build(sessionManager, "session", Session.class); } Notice an annotation @DatabaseTwo.class. This is a Factory marker and is used to identify a service related to a particular SessionFactory. @Retention(RetentionPolicy.RUNTIME) @Target( {ElementType.FIELD, ElementType.PARAMETER, ElementType.METHOD}) @FactoryMarker @Documented public @interface DatabaseTwo { } A typical AppModule for two databases will be public class AppModule { public static void bind(ServiceBinder binder) { binder.bind(DemoService.class, DemoServiceImpl.class); } @Contribute(HibernateSessionSource.class) public static void configureHibernateSources(OrderedConfiguration configurers) { configurers.add("databaseOne", new HibernateConfigurer() { public void configure(org.hibernate.cfg.Configuration configuration) { configuration.configure("/databaseOne.xml"); } public Class getMarker() { return DefaultFactory.class; } public String[] getPackageNames() { return new String[] {"org.example.demo.one"}; } }); configurers.add("databaseTwo", new HibernateConfigurer() { public void configure(org.hibernate.cfg.Configuration configuration) { configuration.configure("/databaseTwo.xml"); } public Class getMarker() { return DatabaseTwo.class; } public String[] getPackageNames() { return new String[] {"org.example.demo.two"}; } }); } @Contribute(SymbolProvider.class) @ApplicationDefaults public static void addSymbols(MappedConfiguration configuration) { configuration.add(HibernateSymbols.DEFAULT_CONFIGURATION, "false"); configuration.add("tapestry.app-package", "org.example.demo"); } @Scope(ScopeConstants.PERTHREAD) @Marker(DatabaseTwo.class) public static HibernateSessionManager buildHibernateSessionManagerForFinacle( HibernateSessionSource sessionSource, PerthreadManager perthreadManager) { HibernateSessionManagerImpl service = new HibernateSessionManagerImpl(sessionSource, DatabaseTwo.class); perthreadManager.addThreadCleanupListener(service); return service; } @Marker(DatabaseTwo.class) public static Session buildSessionForFinacle( @Local HibernateSessionManager sessionManager, PropertyShadowBuilder propertyShadowBuilder) { return propertyShadowBuilder.build(sessionManager, "session", Session.class); } } Injecting into Services You can inject a session in a service using the marker. As DatabaseOne is being used as the default configuration, in order to inject its Session, you have to annotate it with @DefaultFactory. For DatabaseTwo, you can use @DatabaseTwo annotation. public class DemoServiceImpl implements DemoService { private Session sessionOne; private Session sessionTwo; public DemoServiceImpl( @DefaultFactory Session sessionOne, @DatabaseTwo Session sessionTwo) { this.sessionOne = sessionOne; this.sessionTwo = sessionTwo; } @SuppressWarnings("unchecked") public List listOnes() { return sessionOne.createCriteria(EntityOne.class).list(); } @SuppressWarnings("unchecked") public List listTwos() { return sessionTwo.createCriteria(EntityTwo.class).list(); } public void save(EntityOne entityOne) { sessionOne.saveOrUpdate(entityOne); } public void save(EntityTwo entityTwo) { sessionTwo.saveOrUpdate(entityTwo); } } Using @CommitAfter You can add an advice the same way you used to. The only change is in @CommitAfter. You have to additionally annotate the method with the respective marker. public interface DemoService { List listOnes(); List listTwos(); @CommitAfter @DefaultFactory void save(EntityOne entityOne); @CommitAfter @DatabaseTwo void save(EntityTwo entityTwo); } Here is an example. From http://tawus.wordpress.com/2012/03/03/tapestry-hibernate-multiple-databases/

March 7, 2012

by Taha Siddiqi

· 99,962 Views · 3 Likes

Blobs and More: Storing Images and Files in IndexedDB

The desired future approach for storing things client-side in web browsers is utilizing IndexedDB. Here I’ll walk you through how to store images and files in IndexedDB and then present them through an ObjectURL. The general approach First, let’s talk about the steps we will go through to create an IndexedDB data base, save the file into it and then read it out and present in the page: Create or open a database. Create an objectStore (if it doesn’t already exist) Retrieve an image file as a blob Initiate a database transaction Save that blob into the database Read out that saved file and create an ObjectURL from it and set it as the src of an image element in the page Creating the code Let’s break down all parts of the code that we need to do this: Create or open a database. // IndexedDB var indexedDB = window.indexedDB || window.webkitIndexedDB || window.mozIndexedDB || window.OIndexedDB || window.msIndexedDB, IDBTransaction = window.IDBTransaction || window.webkitIDBTransaction || window.OIDBTransaction || window.msIDBTransaction, dbVersion = 1; // Create/open database var request = indexedDB.open("elephantFiles", dbVersion); request.onsuccess = function (event) { console.log("Success creating/accessing IndexedDB database"); db = request.result; db.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; // Interim solution for Google Chrome to create an objectStore. Will be deprecated if (db.setVersion) { if (db.version != dbVersion) { var setVersion = db.setVersion(dbVersion); setVersion.onsuccess = function () { createObjectStore(db); getImageFile(); }; } else { getImageFile(); } } else { getImageFile(); } } // For future use. Currently only in latest Firefox versions request.onupgradeneeded = function (event) { createObjectStore(event.target.result); }; The intended way to use this is to have the onupgradeneeded event triggered when a database is created or gets a higher version number. This is currently only supported in Firefox, but will soon be in other web browsers. If the web browser doesn’t support this event, you can use the deprecated setVersion method and connect to its onsuccess event. Create an objectStore (if it doesn’t already exist) // Create an objectStore console.log("Creating objectStore") dataBase.createObjectStore("elephants"); Here you create an ObjectStore that you will store your data – or in our case, files – and once created you don’t need to recreate it, just update its contents. Retrieve an image file as a blob // Create XHR var xhr = new XMLHttpRequest(), blob; xhr.open("GET", "elephant.png", true); // Set the responseType to blob xhr.responseType = "blob"; xhr.addEventListener("load", function () { if (xhr.status === 200) { console.log("Image retrieved"); // File as response blob = xhr.response; // Put the received blob into IndexedDB putElephantInDb(blob); } }, false); // Send XHR xhr.send(); This code gets the contents of a file as a blob directly. Currently that’s only supported in Firefox. Once you have received the entire file, you send the blob to the function to store it in the database. Initiate a database transaction // Open a transaction to the database var transaction = db.transaction(["elephants"], IDBTransaction.READ_WRITE); To start writing something to the database, you need to initiate a transaction with an objectStore name and the type of action you want to do – in this case read and write. Save that blob into the database // Put the blob into the dabase transaction.objectStore("elephants").put(blob, "image"); Once the transaction is in place, you get a reference to the desired objectStore and then put your blob into it and give it a key. Read out that saved file and create an ObjectURL from it and set it as the src of an image element in the page // Retrieve the file that was just stored transaction.objectStore("elephants").get("image").onsuccess = function (event) { var imgFile = event.target.result; console.log("Got elephant!" + imgFile); // Get window.URL object var URL = window.URL || window.webkitURL; // Create and revoke ObjectURL var imgURL = URL.createObjectURL(imgFile); // Set img src to ObjectURL var imgElephant = document.getElementById("elephant"); imgElephant.setAttribute("src", imgURL); // Revoking ObjectURL URL.revokeObjectURL(imgURL); }; Use the same transaction to get the image file you just stored, and then create an objectURL and set it to the src of an image in the page. This could just as well, for instance, have been a JavaScript file that you attached to a script element, and then it would parse the JavaScript. The complete code So, here’s the complete working code: (function () { // IndexedDB var indexedDB = window.indexedDB || window.webkitIndexedDB || window.mozIndexedDB || window.OIndexedDB || window.msIndexedDB, IDBTransaction = window.IDBTransaction || window.webkitIDBTransaction || window.OIDBTransaction || window.msIDBTransaction, dbVersion = 1.0; // Create/open database var request = indexedDB.open("elephantFiles", dbVersion), db, createObjectStore = function (dataBase) { // Create an objectStore console.log("Creating objectStore") dataBase.createObjectStore("elephants"); }, getImageFile = function () { // Create XHR var xhr = new XMLHttpRequest(), blob; xhr.open("GET", "elephant.png", true); // Set the responseType to blob xhr.responseType = "blob"; xhr.addEventListener("load", function () { if (xhr.status === 200) { console.log("Image retrieved"); // Blob as response blob = xhr.response; console.log("Blob:" + blob); // Put the received blob into IndexedDB putElephantInDb(blob); } }, false); // Send XHR xhr.send(); }, putElephantInDb = function (blob) { console.log("Putting elephants in IndexedDB"); // Open a transaction to the database var transaction = db.transaction(["elephants"], IDBTransaction.READ_WRITE); // Put the blob into the dabase var put = transaction.objectStore("elephants").put(blob, "image"); // Retrieve the file that was just stored transaction.objectStore("elephants").get("image").onsuccess = function (event) { var imgFile = event.target.result; console.log("Got elephant!" + imgFile); // Get window.URL object var URL = window.URL || window.webkitURL; // Create and revoke ObjectURL var imgURL = URL.createObjectURL(imgFile); // Set img src to ObjectURL var imgElephant = document.getElementById("elephant"); imgElephant.setAttribute("src", imgURL); // Revoking ObjectURL URL.revokeObjectURL(imgURL); }; }; request.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; request.onsuccess = function (event) { console.log("Success creating/accessing IndexedDB database"); db = request.result; db.onerror = function (event) { console.log("Error creating/accessing IndexedDB database"); }; // Interim solution for Google Chrome to create an objectStore. Will be deprecated if (db.setVersion) { if (db.version != dbVersion) { var setVersion = db.setVersion(dbVersion); setVersion.onsuccess = function () { createObjectStore(db); getImageFile(); }; } else { getImageFile(); } } else { getImageFile(); } } // For future use. Currently only in latest Firefox versions request.onupgradeneeded = function (event) { createObjectStore(event.target.result); }; })(); Web browser support IndexedDB Supported since long (a number of versions back) in Firefox and Google Chrome. Planned to be in IE10, unclear about Safari and Opera. onupgradeneeded Supported in latest Firefox. Planned to be in Google Chrome soon and hopefully IE10. Unclear about Safari and Opera. Storing files in IndexedDB Supported in Firefox 11 and later. Planned to be supported in Google Chrome. Hopefully IE10 will support it. Unclear about Safari and Opera. XMLHttpRequest Level 2 Supported in Firefox and Google Chrome since long, Safari 5+ and planned to be in IE10 and Opera 12. responseType “blob” Currently only supported in Firefox. Will soon be in Google Chrome and is planned to be in IE10. Unclear about Safari and Opera. Demo and code I’ve put together a demo with IndexedDB and saving images and files in it where you can see it all in action. Make sure to use any Developer Tool to Inspect Element on the image to see the value of its src attribute. Also make sure to check the console.log messages to follow the actions. The code for storing files in IndexedDB is also available on GitHub, so go play now!

March 6, 2012

by Robert Nyman

· 22,039 Views

Solr Date Math, NOW and filter queries

Or “How to never re-use cached filter query results even though you meant to”: Filter queries (“fq” clauses) are a means to restrict the number of documents that are considered for scoring. A common use of “fq” clauses is to restrict the dates of documents returned, things like “in the last day”, “in the last week” etc. You find this pattern often used in conjunction with faceting. Filter queries make use of a filterCache (see solrconfig.xml) to calculate the set of documents satisfying the query once and then re-use that result set. Often, using NOW in filter queries causes this caching to be useless. Here’s why. Solr maintains a filterCache, where it stores the results of “fq” clauses. You can think of it as a map, where the key is the “fq” clause and the value is the set of documents satisfying that clause. I’m going to skip the details of how the document set (the “value” in this map) is stored, since this post is really concentrating on the key. So, let’s say you have two filter queries (whether they’re in the same query or not is irrelevant), something like: “fq=category:books&fq=source:library”. There will be two entries in the filterCache, something like: category:books => 1, 2, 5, 89… source:library => 7, 45, 101… All well and good so far. I’ll add one short diversion here. This bears on why it is often better to have several “fq” clauses than a single one. The same results could be obtained by “fq=category:books AND source:library”, but then the filter cache would look like: category:books AND source:library => 1, 2, 5, 7, 45, 89, 101….. and an fq like “fq=category:books” would NOT re-use the cache since the key is much different. But enough of a diversion… OK, you mentioned dates. Get to the point. It’s common to have date ranges as filter queries, things like “in the last day”, “in the last week”, etc. And there’s the convenient date math to make this easy. So it’s tempting, very tempting to have filter clauses on date ranges like “fq=date:[NOW-1DAY TO NOW]“. Be careful when using NOW! Here’s the problem. In the above example, date:[NOW-1DAY TO NOW] is not what’s used as the key for the fq in the filterCache, the expansion is used as the key. This translates into a form like: “date:[2012-01-20T08:56:23Z TO 2012-01-27T8:56:23Z]” for the key into the filter cache. Now the user adds a term to the “q” and re-submits the query 30 seconds after the first one. The fq clause now looks something like: “fq=date:[2012-01-20T08:56:53Z TO 2012-01-27T8:56:53Z]” note that the seconds went from 23 to 53! The key for this fq does not match the key for the first, even though it’s often the case that the intent is that submitting this kind of fq 30 seconds later would result in the same set of documents matching the filter. Bare NOW entries in filter clauses will pretty much guarantee that the cached result sets will never be reused. Fine. What do you do to make it better? Here’s where rounding makes sense. Using midnight can make sense from two perspectives. The sense you often want is “anything with a timestamp in a particular day” (or month or year or hour or….). So just using NOW for the lower bound would miss anything published between midnight and whenever the user happens to submit the query on the day (in this example) of the lower bound. Re-using the filter cache can substantially speed up your queries, especially if you’re providing links like “in the last day”, “1-7 days ago” etc. So your fq clauses start to look like “fq=date:[NOW/DAY-7DAYS TO NOW/DAY+1DAY]“. The thing to note about the date math “/” operator is that is is a “round down” operator. So let’s break this up a bit: NOW/DAY rounds down to midnight last night. -7DAYS subtracts 7 whole days. So the lower limit is really “midnight 7 days ago”. Similarly, NOW/DAY rounds to midnight last night and +1DAY moves that to midnight tonight for the upper limit. These clauses are invariant until after midnight tonight so these clauses will return the same results all day today, and only the first submission of this fq will incur the cost of figuring out which documents satisfy it, all the queries after the first will just read the cached result set from the filterCache. Of course the caches are invalidated if you update your index and/or a replication happens, but that’s always the case. You will note that there is a bit of “slop” here. If your index has dates in the future, you may get them too. Suppose you have a situation where your index contains documents you don’t want the users to see until it’s later than their timestamp. I actually have a hard time contriving an example here, but let’s just assume it’s the case. Also say it’s noon and your index contains timestamps on documents through midnight tonight. The above technique will show documents that will be officially published at, say, 15:00 even though it’s only 12:00 and you may not want that. In that case, you’ll have to use a bare NOW clause and live with the fact that your cache isn’t being used for these clauses. Like I said, this is contrived, but I mention it for completeness’ sake. A couple of notes about dates: Before I finish, a couple of notes about dates. Use the coarsest dates you can and still satisfy your use case. This is especially true if you’re sorting by dates. The sorting resource requirements go up by the number of unique terms. So storing millisecond resolution when all you care about is day can be wasteful. This is also true when faceting. It’s often useful to index multiple fields with some date data, especially if you intend to facet. The above examples in the 3.x code line have a slight problem when more than one adjoining range is required. The range operator “[]” is inclusive, so if you have a document indexed at exactly midnight in these examples, it might be included in two ranges. Trunk Solr (4.0) allows mixing inclusive “[]” and exclusive “{}” endpoints, so expressions like “date:[NOW/DAY-1DAY TO NOW/DAY+1DAY}” are possible. An exercise for the reader: What are the consequences of using different kinds of rounding? E.g. NOW/5MIN, NOW/72MIN (does this even work?).

February 27, 2012

by Erick Erickson

· 52,015 Views

Configuration Drift

In my previous article on the server lifecycle I mentioned ConfigurationDrift, a term that I’ve either coined, or I’ve forgotten where I originally heard, although most likely I got it from the Puppet Labs folks. Configuration Drift is the phenomenon where running servers in an infrastructure become more and more different as time goes on, due to manual ad-hoc changes and updates, and general entropy. A nice automated server provisioning process as I’ve advocated helps ensure machines are consistent when they are created, but during a given machine’s lifetime it will drift from the baseline, and from the other machines. There are two main methods to combat configuration drift. One is to use automated configuration tools such as Puppet or Chef, and run them frequently and repeatedly to keep machines in line. The other is to rebuild machine instances frequently, so that they don’t have much time to drift from the baseline. The challenge with automated configuration tools is that they only manage a subset of a machine’s state. Writing and maintaining manifests/recipes/scripts is time consuming, so most teams tend to focus their efforts on automating the most important areas of the system, leaving fairly large gaps. There are diminishing returns for trying to close these gaps, where you end up spending inordinate amounts of effort to nail down parts of the system that don’t change very often, and don’t matter very much day to day. On the other hand, if you rebuild machines frequently enough, you don’t need to worry about running configuration updates after provisioning happens. However, this may increase the burden of fairly trivial changes, such as tweaking a web server configuration. In practice, most infrastructures are probably best off using a combination of these methods. Use automated configuration, continuously updated, for the areas of machine configuration where it gives the most benefit, and also ensure that machines are rebuilt frequently. The frequency of rebuilds will vary depending on the nature of the services provided and the infrastructure implementation, and may even vary for different types of machines. For example, machines that provide network services such as DNS may be rebuilt weekly, while those which handle batch processing tasks may be rebuilt on demand. Source: http://kief.com/configuration-drift.html

February 27, 2012

by Kief Morris

· 25,608 Views · 3 Likes

How to migrate databases between SQL Server and SQL Server Compact

In this post, I will try to give an overview of the free tools available for developers to move databases from SQL Server to SQL Server Compact and vice versa. I will also show how you can do this with the SQL Server Compact Toolbox (add-in and standalone editions). Moving databases from SQL Server Compact to SQL Server This can be useful for situations where you already have developed an application that depends on SQL Server Compact, and would like the increased power of SQL Server or would like to use some feature, that is not available on SQL Server Compact. I have an informal comparison of the two products here. Microsoft offers a GUI based tool and a command line tool to do this: WebMatrix and MsDeploy. You can also use the ExportSqlCe command line tool or the SQL Server Compact Toolbox to do this. To use the ExportSqlCE (or ExportSqlCE40) command line, use a command similar to: ExportSQLCE.exe "Data Source=D:\Northwind.sdf;" Northwind.sql The resulting script file (Northwind.sql) can the be run against a SQL Server database, using for example the SQL Server sqlcmd command line utility: sqlcmd -S mySQLServer –d NorthWindSQL -i C:\Northwind.sql To use the SQL Server Compact Toolbox: Connect the Toolbox to the database file that you want to move to SQL Server: Right click the database connection, and select to script Schema and Data: Optionally, select which tables to script and click OK: Enter the filename for the script, default extension is .sqlce: Click OK to the confirmation message: You can now open the generated script in Management Studio and execute it against a SQL Server database, or run it with sqlcmd as described above. Moving databases from SQL Server to SQL Server Compact Microsoft offers no tools for doing this “downsizing” of a SQL Server database to SQL Server Compact, and of course not all objects in a SQL Server database CAN be downsized, as only tables exists in a SQL Server Compact database, so no stored procedures, views, triggers, users, schema and so on. I have blogged about how this can be done from the command line, and you can also do this with the SQL Server Compact Toolbox (of course): From the root node, select Script Server Data and Schema: Follow a procedure like the one above, but connecting to a SQL Server database instead. The export process will convert the SQL Server data types to a matching SQL Server Compact data type, for example varchar(50) becomes nvarchar(50) and so on. Any unsupported data types will be ignored, this includes for example computed columns and sql_variant. The new date types in SQL Server 2008+, like date, time, datetime2 will be converted to nvarchar based data types, as only datetime is supported in SQL Server Compact. A full list of the SQL Server Compact data types is available here.

February 26, 2012

by Erik Ejlskov Jensen

· 16,671 Views