DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Data Topics

article thumbnail
A Gentle Introduction to Making HTML5 Canvas Interactive
this is an big overhaul of one of my tutorials on making and moving shapes on an html5 canvas. this new tutorial is vastly cleaner than my old one, but if you still want to see that one or are looking for the concept of a “ghost context” you can find that one here. this tutorial will show you how to create a simple data structure for shapes on an html5 canvas and how to have them be selectable. the finished canvas will look like this: this article’s code is written primarily to be easy to understand. we’ll be going over a few things that are essential to interactive apps such as games (drawing loop, hit testing), and in later tutorials i will probably turn this example into a small game of some kind. i will try to accommodate javascript beginners but this introduction does expect at least a rudimentary understanding of js. not every piece of code is explained in the text, but almost every piece of code is thoroughly commented! the html5 canvas a canvas is made by using the tag in html: this text is displayed if your browser does not support html5 canvas. a canvas isn’t smart: it’s just a place for drawing pixels. if you ask it to draw something it will execute the drawing command and then immediately forget everything about what it has just drawn. this is sometimes referred to as an immediate drawing surface, as contrasted with svg as a retained drawing surface, since svg keeps a reference to everything drawn. because we have no such references, we have to keep track ourselves of all the things we want to draw (and re-draw) each frame. canvas also has no built-in way of dealing with animation. if you want to make something you’ve drawn move, you have to clear the entire canvas and redraw all of the objects with one or more of them moved. and you have to do it often, of course, if you want a semblance of animation or motion. so we’ll need to add: code for keeping track of objects code for keeping track of canvas state code for mouse events code for drawing the objects as they are made and move around keeping track of what we draw to keep things simple for this example we will start with a shape class to represent rectangular objects. javascript doesn’t technically have classes, but that isn’t a problem because javascript programmers are very good at playing pretend. functionally (well, for our example) we are going to have a shape class and create shape instances with it. what we are really doing is defining a function named shape and adding functions to shape’s prototype. you can make new instances of the function shape and all instances will share the functions defined on shape’s prototype. if you’ve never encountered prototypes in javascript before or if the above sounds confusing to you, i highly recommend reading crockford’s javascript: the good parts . the book is an intermediate overview of javascript that gives a good understanding of why programmers choose to create objects in different ways, why certain conventions are frowned upon, and just what makes javascript so different. here’s our shape constructor and one of the two prototype methods, which are comparable to a class instance methods: // constructor for shape objects to hold data for all drawn objects. // for now they will just be defined as rectangles. function shape(x, y, w, h, fill) { // this is a very simple and unsafe constructor. // all we're doing is checking if the values exist. // "x || 0" just means "if there is a value for x, use that. otherwise use 0." this.x = x || 0; this.y = y || 0; this.w = w || 1; this.h = h || 1; this.fill = fill || '#aaaaaa'; } // draws this shape to a given context shape.prototype.draw = function(ctx) { ctx.fillstyle = this.fill; ctx.fillrect(this.x, this.y, this.w, this.h); } keeping track of canvas state we’re going to have a second class/function called canvasstate. we’re only going to make one instance of this class and it will hold all of the state in this tutorial that is not associated with shapes themselves. canvasstate is going to foremost need a reference to the canvas and a few other field for convenience. we’re also going to compute and save the border and padding (if there is any) so that we can get accurate mouse coordinates. in the canvasstate constructor we will also have a collection of state relating to the objects on the canvas and the current status of dragging. we’ll make an array of shapes, a flag “dragging” that will be true while we are dragging, a field to keep track of which object is selected and a “valid” flag that will be set to false will cause the canvas to clear everything and redraw. is going to need an array of shapes to keep track of whats been drawn so far. i’m going to add a bunch of variables for keeping track of the drawing and mouse state. i already added boxes[] to keep track of each object, but we’ll also need a var for the canvas, the canvas’ 2d context (where wall drawing is done), whether the mouse is dragging, width/height of the canvas, and so on. we’ll also want to make a second canvas, for selection purposes, but i’ll talk about that later. function canvasstate(canvas) { // ... // i removed some setup code to save space // see the full source at the end // **** keep track of state! **** this.valid = false; // when set to true, the canvas will redraw everything this.shapes = []; // the collection of things to be drawn this.dragging = false; // keep track of when we are dragging // the current selected object. // in the future we could turn this into an array for multiple selection this.selection = null; this.dragoffx = 0; // see mousedown and mousemove events for explanation this.dragoffy = 0; mouse events we’ll add events for mousedown, mouseup, and mousemove that will control when an object starts and stops dragging. we’ll also disable the selectstart event, which stops double-clicking on canvas from accidentally selecting text on the page. finally we’ll add a double-click event that will create a new shape and add it to the canvasstate’s list of shapes. the mousedown event begins by calling getmouse on our canvasstate to return the x and y position of the mouse. we then iterate through the list of shapes to see if any of them contain the mouse position. we go through them backwards because they are drawn forwards, and we want to select the one that appears topmost, so we must find the potential shape that was drawn last. if we find the shape we save the offset, save that shape as our selection, set dragging to true and set the valid flag to false. already we’ve used most of our state! finally if we didn’t find any objects we need to see if there was a selection saved from last time. if there is we should clear it. since we clicked on nothing, we obviously didn’t click on the already-selected object! clearing the selection means we will have to clear the canvas and redraw everything without the selection ring, so we set the valid flag to false. // ... // (we are still in the canvasstate constructor) // this is an example of a closure! // right here "this" means the canvasstate. but we are making events on the canvas itself, // and when the events are fired on the canvas the variable "this" is going to mean the canvas! // since we still want to use this particular canvasstate in the events we have to save a reference to it. // this is our reference! var mystate = this; //fixes a problem where double clicking causes text to get selected on the canvas canvas.addeventlistener('selectstart', function(e) { e.preventdefault(); return false; }, false); // up, down, and move are for dragging canvas.addeventlistener('mousedown', function(e) { var mouse = mystate.getmouse(e); var mx = mouse.x; var my = mouse.y; var shapes = mystate.shapes; var l = shapes.length; for (var i = l-1; i >= 0; i--) { if (shapes[i].contains(mx, my)) { var mysel = shapes[i]; // keep track of where in the object we clicked // so we can move it smoothly (see mousemove) mystate.dragoffx = mx - mysel.x; mystate.dragoffy = my - mysel.y; mystate.dragging = true; mystate.selection = mysel; mystate.valid = false; return; } } // havent returned means we have failed to select anything. // if there was an object selected, we deselect it if (mystate.selection) { mystate.selection = null; mystate.valid = false; // need to clear the old selection border } }, true); the mousemove event checks to see if we have set the dragging flag to true. if we have it gets the current mouse positon and moves the selected object to that position, remembering the offset of where we were grabbing it. if the dragging flag is false the mousemove event does nothing. canvas.addeventlistener('mousemove', function(e) { if (mystate.dragging){ var mouse = mystate.getmouse(e); // we don't want to drag the object by its top-left corner, // we want to drag from where we clicked. // thats why we saved the offset and use it here mystate.selection.x = mouse.x - mystate.dragoffx; mystate.selection.y = mouse.y - mystate.dragoffy; mystate.valid = false; // something's dragging so we must redraw } }, true); the mouseup event is simple, all it has to do is update the canvasstate so that we are no longer dragging! so once you lift the mouse, the mousemove event is back to doing nothing. canvas.addeventlistener('mouseup', function(e) { mystate.dragging = false; }, true); the double click event we’ll use to add more shapes to our canvas. it calls addshape on the canvasstate with a new instance of shape. all addshape does is add the argument to the list of shapes in the canvasstate. // double click for making new shapes canvas.addeventlistener('dblclick', function(e) { var mouse = mystate.getmouse(e); mystate.addshape(new shape(mouse.x - 10, mouse.y - 10, 20, 20, 'rgba(0,255,0,.6)')); }, true); there are a few options i implemented, what the selection ring looks like and how often we redraw. setinterval simply calls our canvasstate’s draw method. our interval of 30 means that we call the draw method every 30 milliseconds. // **** options! **** this.selectioncolor = '#cc0000'; this.selectionwidth = 2; this.interval = 30; setinterval(function() { mystate.draw(); }, mystate.interval); } drawing now we’re set up to draw every 30 milliseconds, which will allow us to continuously update the canvas so it appears like the shapes we drag are smoothly moving around. however, drawing doesn’t just mean drawing the shapes over and over; we also have to clear the canvas on every draw. if we don’t clear it, dragging will look like the shape is making a solid line because none of the old shape-positions will go away. because of this, we clear the entire canvas before each draw frame. this can get expensive, and we only want to draw if something has actually changed within our framework, which is why we have the “valid” flag in our canvasstate. after everything is drawn the draw method will set the valid flag to true. then, ocne we do something like adding a new shape or trying to drag a shape, the state will get invalidated and draw() will clear, redraw all objects, and set the valid flag again. // while draw is called as often as the interval variable demands, // it only ever does something if the canvas gets invalidated by our code canvasstate.prototype.draw = function() { // if our state is invalid, redraw and validate! if (!this.valid) { var ctx = this.ctx; var shapes = this.shapes; this.clear(); // ** add stuff you want drawn in the background all the time here ** // draw all shapes var l = shapes.length; for (var i = 0; i < l; i++) { var shape = shapes[i]; // we can skip the drawing of elements that have moved off the screen: if (shape.x > this.width || shape.y > this.height || shape.x + shape.w < 0 || shape.y + shape.h < 0) return; shapes[i].draw(ctx); } // draw selection // right now this is just a stroke along the edge of the selected shape if (this.selection != null) { ctx.strokestyle = this.selectioncolor; ctx.linewidth = this.selectionwidth; var mysel = this.selection; ctx.strokerect(mysel.x,mysel.y,mysel.w,mysel.h); } // ** add stuff you want drawn on top all the time here ** this.valid = true; } } we go through all of shapes[] and draw each one in order. this will give the nice appearance of later shapes looking as if they are on top of earlier shapes. after all the shapes are drawn, a selection handle (if there is a selection) gets drawn around the shape that this.selection references. if you wanted a background (like a city) or a foreground (like clouds), one way to add them is to put them before or after the main two drawing bits. there are often better ways though, like using multiple canvases or a css background-image, but we won’t go over that here. getting mouse coordinates on canvas getting good mouse coordinates is a little tricky on canvas. you could use offsetx/y and layerx/y, but layerx/y is deprecated in webkit (chrome and safari) and firefox does not have offsetx/y. the most bulletproof way to get the correct mouse position is shown below. you have to walk up the tree adding the offsets together. then you must add any padding or border to the offset. finally, to fix coordinate problems when you have fixed-position elements on the page (like the wordpress admin bar or a stumbleupon bar) you must add the ’s offsettop and offsetleft. then you simply subtract that offset from the e.pagex/y values and you’ll get perfect coordinates in almost every possible situation. // creates an object with x and y defined, // set to the mouse position relative to the state's canvas // if you wanna be super-correct this can be tricky, // we have to worry about padding and borders canvasstate.prototype.getmouse = function(e) { var element = this.canvas, offsetx = 0, offsety = 0, mx, my; // compute the total offset if (element.offsetparent !== undefined) { do { offsetx += element.offsetleft; offsety += element.offsettop; } while ((element = element.offsetparent)); } // add padding and border style widths to offset // also add the offsets in case there's a position:fixed bar offsetx += this.stylepaddingleft + this.styleborderleft + this.htmlleft; offsety += this.stylepaddingtop + this.stylebordertop + this.htmltop; mx = e.pagex - offsetx; my = e.pagey - offsety; // we return a simple javascript object (a hash) with x and y defined return {x: mx, y: my}; } there are a few little methods i added that are not shown, such as shape’s method to see if a point is inside its bounds. you can see and download the full demo source here . now that we have a basic structure down, it is easy to write code that handles more complex shapes, like paths or images or video. rotation and scaling these things takes a bit more work, but is quite doable with the canvas and our selection method is already set up to deal with them. if you would like to see this code enhanced in future posts (or have any fixes), let me know. source: http://simonsarris.com/blog/510-making-html5-canvas-useful
December 21, 2011
by Simon Sarris
· 11,275 Views · 1 Like
article thumbnail
Google Guava Cache
This Post is a continuation of my series on Google Guava, this time covering Guava Cache. Guava Cache offers more flexibility and power than either a HashMap or ConcurrentHashMap, but is not as heavy as using EHCache or Memcached (or robust for that matter, as Guava Cache operates solely in memory). The Cache interface has methods you would expect to see like ‘get’, and ‘invalidate’. A method you won’t find is ‘put’, because Guava Cache is ‘self-populating’, values that aren’t present when requested are fetched or calculated, then stored. This means a ‘get’ call will never return null. In all fairness, the previous statement is not %100 accurate. There is another method ‘asMap’ that exposes the entries in the cache as a thread safe map. Using ‘asMap’ will result in not having any of the self loading operations performed, so calls to ‘get’ will return null if the value is not present (What fun is that?). Although this is a post about Guava Cache, I am going to spend the bulk of the time talking about CacheLoader and CacheBuilder. CacheLoader specifies how to load values, and CacheBuilder is used to set the desired features and actually build the cache. CacheLoader CacheLoader is an abstract class that specifies how to calculate or load values, if not present. There are two ways to create an instance of a CacheLoader: Extend the CacheLoader class Use the static factory method CacheLoader.from If you extend CacheLoader you need to override the V load(K key) method, instructing how to generate the value for a given key. Using the static CacheLoader.from method you build a CacheLoader either by supplying a Function or Supplier interface. When supplying a Function object, the Function is applied to the key to calculate or retrieve the results. Using a Supplier interface the value is obtained independent of the key. CacheBuilder The CacheBuilder is used to construct cache instances. It uses the fluent style of building and gives you the option of setting the following properties on the cache: Cache Size limit (removals use a LRU algorithm) Wrapping keys in WeakReferences (Strong references used by default for keys) Wrapping values in either WeakReferences or SoftReferences (Strong references used by default) Time to expire entires after last access Time based expiration of entries after being written or updated Setting a RemovalListener that can recieve events once an entry is removed from the cache Concurrency Level of the cache (defaults to 4) The concurrency level option is used to partition the table internally such that updates can occur without contention. The ideal setting would be the maximum number of threads that could potentially access the cache at one time. Here is an example of a possible usage scenario for Guava Cache. public class PersonSearchServiceImpl implements SearchService> { public PersonSearchServiceImpl(SampleLuceneSearcher luceneSearcher, SampleDBService dbService) { this.luceneSearcher = luceneSearcher; this.dbService = dbService; buildCache(); } @Override public List search(String query) throws Exception { return cache.get(query); } private void buildCache() { cache = CacheBuilder.newBuilder().expireAfterWrite(10, TimeUnit.MINUTES) .maximumSize(1000) .build(new CacheLoader>() { @Override public List load(String queryKey) throws Exception { List ids = luceneSearcher.search(queryKey); return dbService.getPersonsById(ids); } }); } } In this example, I am setting the cache entries to expire after 10 minutes of being written or updated in the cache, with a maximum amount of 1,000 entires. Note the usage of CacheLoader on line 15. RemovalListener The RemovalListener will receive notification of an item being removed from the cache. These notifications could be from manual invalidations or from a automatic one due to time expiration or garbage collection. The RemovalListener parameters can be set to listen for specific type. To receive notifications for any key or value set them to use Object. It should be noted here that a RemovalListener will receive a RemovalNotification object that implements the Map.Entry interface. The key or value could be null if either has already been garbage collected. Also the key and value object will be strong references, regardless of the type of references used by the cache. CacheStats There is also a very useful class CacheStats that can be retrieved via a call to Cache.stats(). The CacheStats object can give insight into the effectiveness and performance of your cache by providing statistics such as: hit count miss count total load timme total requests CacheStats provides many other counts in addition to the ones listed above. Conclusion The Guava Cache presents some very compelling functionality. The decision to use a Guava Cache really comes down to the tradeoff between memory availability/usage versus increases in performance. I have added a unit test CacheTest demonstrating the usages discussed here. As alway comments and suggestions are welcomed. Thanks for your time. Resources Guava Project Home Cache API Source Code for blog series From http://codingjunkie.net/google-guava-cache/
December 16, 2011
by Bill Bejeck
· 52,872 Views · 1 Like
article thumbnail
How To Sort String Array Using LINQ In C#
A string[] array and use a LINQ query expression to order its contents alphabetically. Note that we are ordering the strings, not the letters in the strings. using System; using System.Linq; class Program { static void Main() { string[] a = new string[] {"Indonesian","Korean","Japanese","English","German"}; var sort = from s in a orderby s select s; foreach (string c in sort) { Console.WriteLine(c); } } } /*OUTPUT English German Indonesian Japanese Korean */ Eclipse IDE and Eclipse
December 14, 2011
by Snippets Manager
· 10,943 Views · 1 Like
article thumbnail
Eventual Consistency in NoSQL Databases: Theory and Practice
One of NoSQL's goals: handle previously-unthinkable amounts of data. One of unthinkable-amounts-of-data's problems: previously-improbable events become extremely probable, precisely because the set of interactions is so large. Flip a coin a hundred times, and you're not likely to get 50 heads in a row. But flip it a few trillion times, and you probably will find some 50-heads streaks. So NoSQL's performance strength is also its mathematical weakness. This order of scale can result in lots of problems, but one of the most common is consistency -- the C in ACID -- clearly a fundamental desideratum for any database system, but in principle much harder to acheive for NoSQL databases than for others. Emerging database technologies have forced developers and computer scientists to define more exactly what kind of consistency is really needed, for any given application. Two years ago, ACM (the Association for Computing Machinery) published an extremely helpful examination of the attenuated notion of consistency called 'eventual consistency'. Their summary: Data inconsistency in large-scale reliable distributed systems must be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running. The article surveys technical solutions as well as user considerations that might soften the undesirability of anything less than perfect, instantaneous consistency. It's not long (4 pages plus pictures), and explains some deep database issues quite clearly. On the more practical side of the problem: Russell Brown recently gave a talk at the NoSQL Exchange 2011 on exactly this topic. More specifically, he showed how some distributed systems (Riak in particular) try to minimize conflicts, and suggested some ways to reconcile conflicts automatically using smart semantic techniques. Check out the NoSQL Exchange page for Russell's talk here, which includes an embedded video. But read the ACM article first for a broader overview, since Russell launches into technical details pretty quickly.
November 22, 2011
by John Esposito
· 12,501 Views
article thumbnail
Tackling the Circular Dependency in Java...
Let me first define what we mean by circular dependency in OOAD terms vis-a-vis Java. Suppose we have a class called A which has class B’s Object. (in UML terms A HAS B). at the same time we class B is also composed of Object of class A (in UML terms B HAS A). obviously this represents circular dependency because while creating the object of A, the compiler must know the size of B... on the other hand while creating object of B, the compiler must know the size of A. this is something like egg vs. chicken problem... this may be possible in real life situation as well. for example suppose a multi storied building has a lift. so in the UML terms, the building HAS lift... but at the same time, suppose, while constructing the lift object, we need to give it the information about the building object to access various functionalities of the Building class... for example, suppose the speed of the lift is set depending on the number of floors of the Building... hence while constructing the Lift object it must access the functionalities of the Building object which will give the number of floors the building has got...hence in UML terms the lift HAS building... so this is a sure case of circular dependency... in real java code it will look something as follows: public class Building { private Lift lift; private int floor; public Building(){ lift = new Lift(); setFloor(15); } public int getFloor(){ return floor; } public void setFloor(int floor){ this.floor = floor; } }//end of class building //class Lift public class Lift { private Building building; private int Speed; public Lift(){ building = new Building(); setSpeed(); } public void setSpeed(){ if (building.getFloor()>20){ //one set of functionalities //may be the the speed of the lift will be more this.Speed = 10; } else { //different set of functionalities //may be the speed of the lift will be less this.Speed = 5; } } public int getSpeed(){ return Speed; } }//end of class Lift As it becomes clear from the above code, that while creating the Building object it will create the Lift object, and while creating the Lift object, it will try to create a Building object to access some of its functionalities... So, ultimately it will go out of memory and we get a StackOverflow runtime exception... So how do we handle this problem in Java? We actually tackle this problem by declaring an IBuildingProxy interface and by deriving our Building class from that... the lift class, instead of Having Building object, it Has IBuildingProxy... the source code of the solution looks like the following... public interface IBuildingProxy { int getFloor(); void setFloor(int floor); } public class Building implements IBuildingProxy{ private Lift lift; private int floor; public Building(){ lift = new Lift(this); setFloor(15); } public int getFloor(){ return floor; } public void setFloor(int floor){ this.floor = floor; } } public class Lift { private IBuildingProxy building; private int Speed; public Lift(Building b){ this.building = b; setSpeed(); } private void setSpeed(){ if (building.getFloor()>20){ / /one set of functionalities //may be the the speed of the lift will be more this.Speed = 10; } else { //different set of functionalities //may be the speed of the lift will be less this.Speed = 5; } } public int getSpeed(){ return Speed; } } public class CircularDependencyTest { public static void main(String[] args){ Building b = new Building(); Lift l = new Lift(b); } } So whats the principle behind such work around... It will be clear soon... As it becomes clear from the code that Building HAS Lift... That is not a problem... Now when it comes to solve the part that Lift HAS Building, instead of the Building object, we have created an IBuildingProxy interface and we pass it to the Lift class... what it essentially means, that the building class knows the memory requirement to initialize the Lift object, and as the Lift class HAS just a proxy interface of the Building, it does not have to care for the Building's memory requirement... and that solves the problem... Hope this discussion becomes helpful for Java learners...
November 17, 2011
by Somenath Mukhopadhyay
· 32,472 Views · 2 Likes
article thumbnail
Recommendation Engine Models
In a classical model of recommendation system, there are "users" and "items". User has associated metadata (or content) such as age, gender, race and other demographic information. Items also has its metadata such as text description, price, weight ... etc. On top of that, there are interaction (or transaction) between user and items, such as userA download/purchase movieB, userX give a rating 5 to productY ... etc. Now given all the metadata of user and item, as well as their interaction over time, can we answer the following questions ... What is the probability that userX purchase itemY ? What rating will userX give to itemY ? What is the top k unseen items that should be recommended to userX ? Content-based Approach In this approach, we make use of the metadata to categorize user and item and then match them at the category level. One example is to recommend jobs to candidates, we can do a IR/text search to match the user's resume with the job descriptions. Another example is to recommend an item that is "similar" to the one that the user has purchased. Similarity is measured according to the item's metadata and various distance function can be used. The goal is to find k nearest neighbors of the item we know the user likes. Collaborative Filtering Approach In this approach, we look purely at the interactions between user and item, and use that to perform our recommendation. The interaction data can be represented as a matrix. Notice that each cell represents the interaction between user and item. For example, the cell can contain the rating that user gives to the item (in the case the cell is a numeric value), or the cell can be just a binary value indicating whether the interaction between user and item has happened. (e.g. a "1" if userX has purchased itemY, and "0" otherwise. The matrix is also extremely sparse, meaning that most of the cells are unfilled. We need to be careful about how we treat these unfilled cells, there are 2 common ways ... Treat these unknown cells as "0". Make them equivalent to user giving a rate "0". This may or may not be a good idea depends on your application scenarios. Guess what the missing value should be. For example, to guess what userX will rate itemA given we know his has rate on itemB, we can look at all users (or those who is in the same age group of userX) who has rate both itemA and itemB, then compute an average rating from them. Use the average rating of itemA and itemB to interpolate userX's rating on itemA given his rating on itemB. User-based Collaboration Filter In this model, we do the following Find a group of users that is “similar” to user X Find all movies liked by this group that hasn’t been seen by user X Rank these movies and recommend to user X This introduces the concept of user-to-user similarity, which is basically the similarity between 2 row vectors of the user/item matrix. To compute the K nearest neighbor of a particular users. A naive implementation is to compute the "similarity" for all other users and pick the top K. Different similarity functions can be used. Jaccard distance function is defined as the number of intersections of movies that both users has seen divided by the number of union of movies they both seen. Pearson similarity is first normalizing the user's rating and then compute the cosine distance. There are two problems with this approach Compare userX and userY is expensive as they have millions of attributes Find top k similar users to userX require computing all pairs of userX and userY Location Sensitive Hashing and Minhash To resolve problem 1, we approximate the similarity using a cheap estimation function, called minhash. The idea is to find a hash function h() such that the probability of h(userX) = h(userY) is proportion to the similarity of userX and userY. And if we can find 100 of h() function, we can just count the number of such function where h(userX) = h(userY) to determine how similar userX is to userY. The idea is depicted as follows ... It will be expensive to permute the rows if the number of rows is large. Remember that the purpose of h(c1) is to return row number of the first row that is 1. So we can scan each row of c1 to see if it is 1, if so we apply a function newRowNum = hash(rowNum) to simulate a permutation. Take the minimum of the newRowNum seen so far. As an optimization, instead of doing one column at a time, we can do it a row at the time, the algorithm is as follows To solve problem 2, we need to avoid computing all other users' similarity to userX. The idea is to hash users into buckets such that similar users will be fall into the same bucket. Therefore, instead of computing all users, we only compute the similarity of those users who is in the same bucket of userX. The idea is to horizontally partition the column into b bands, each with r rows. By pick the parameter b and r, we can control the likelihood (function of similarity) that they will fall into the same bucket in at least one band. Item-based Collaboration Filter If we transpose the user/item matrix and do the same thing, we can compute the item to item similarity. In this model, we do the following ... Find the set of movies that user X likes (from interaction data) Find a group of movies that is similar to these set of movies that we know user X likes Rank these movies and recommend to user X It turns out that computing item-based collaboration filter has more benefit than computing user to user similarity for the following reasons ... Number of items typically smaller than number of users While user's taste will change over time and hence the similarity matrix need to be updated more frequent, item to item similarity tends to be more stable and requires less update. Singular Value Decomposition If we look back at the matrix, we can see the matrix multiplication is equivalent to mapping an item from the item space to the user space. In other words, if we view each of the existing item as an axis in the user space (notice, each user is a vector of their rating on existing items), then multiplying a new item with the matrix gives the same vector like the user. So we can then compute a dot product with this projected new item with user to determine its similarity. It turns out that this is equivalent to map the user to the item space and compute a dot product there. In other words, multiply the matrix is equivalent to mapping between item space and user space. Now lets imagine there is a hidden concept space in between. Instead of jumping directly from user space to item space, we can think of jumping from user space to a concept space, and then to the item space. Notice that here we first map the user space to the concept space and also map the item space to the concept space. Then we match both user and item at the concept space. This is a generalization of our recommender. We can use SVD to factor the matrix into 2 parts. Let P be the m by n matrix (m rows and n columns). P = UDV where U is an m by m matrix, each column represents the eigenvectors of P*transpose(P). And V is an n by n matrix with each row represents the eigenvector of transpose(P)*P. D is a diagonal matrix containing eigenvalues of P*transpose(P), or transpose(P)*P. In other words, we can decompose P into U*squareroot(D) and squareroot(D)*V. Notice that D can be thought as the strength of each "concept" in the concept space. And the value is order in terms of their magnitude in decreasing order. If we remove some of the weakest concept by making them zero, we reduce the number of non-zero elements in D, which effective generalize the concept space (make them focus in the important concepts). Calculate SVD decomposition for matrix with large dimensions is expensive. Fortunately, if our goal is to compute an SVD approximation (with k diagonal non-zero value), we can use the random projection mechanism as describer here. Association Rule Based In this model, we use the market/basket association rule algorithm to discover rule like ... {item1, item2} => {item3, item4, item5} We represent each user as a basket and each viewing as an item (notice that we ignore the rating and use a binary value). After that we use association rule mining algorithm to detect frequent item set and the association rules. Then for each user, we match the user's previous viewing items to the set of rules to determine what other movies should we recommend. Evaluate the recommender After we have a recommender, how do we evaluate the performance of it ? The basic idea is to use separate the data into the training set and the test set. For the test set, we remove certain user-to-movies interaction (change certain cells from 1 to 0) and pretending the user hasn't seen the item. Then we use the training set to train a recommender and then fit the test set (with removed interaction) to the recommender. The performance is measured by how much overlap between the recommended items with the one that we have removed. In other words, a good recommender should be able to recover the set of items that we have removed from the test set. Leverage tagging information on items In some cases, items has explicit tags associated with them (we can considered the tags is a user-annotated concept space added to the items). Consider each item is described with a vector of tags. Now user can also be auto-tagged based on the items they have interacted. For example, if userX purchase itemY which is tagged with Z1, and Z2. Then user will increase her tag Z1 and Z2 in her existing tag vector. We can use a time decay mechanism to update the user's tag vector as follows ... current_user_tag = alpha * item_tag + (1 - alpha) * prev_user_tag To recommend an item to the user, we simply need to calculate the top k items by computing the dot product (ie: cosine distance) of the user tag vector and the item tag vector. Source: http://horicky.blogspot.com/2011/09/recommendation-engine.html
November 2, 2011
by Ricky Ho
· 26,760 Views · 2 Likes
article thumbnail
Smart Batching
How often have we all heard that “batching” will increase latency? As someone with a passion for low-latency systems this surprises me. In my experience when batching is done correctly, not only does it increase throughput, it can also reduce average latency and keep it consistent. Well then, how can batching magically reduce latency? It comes down to what algorithm and data structures are employed. In a distributed environment we are often having to batch up messages/events into network packets to achieve greater throughput. We also employ similar techniques in buffering writes to storage to reduce the number of IOPS. That storage could be a block device backed file-system or a relational database. Most IO devices can only handle a modest number of IO operations per second, so it is best to fill those operations efficiently. Many approaches to batching involve waiting for a timeout to occur and this will by its very nature increase latency. The batch can also get filled before the timeout occurs making the latency even more unpredictable. Figure 1. Figure 1. above depicts decoupling the access to an IO device, and therefore the contention for access to it, by introducing a queue like structure to stage the messages/events to be sent and a thread doing the batching for writing to the device. The Algorithm An approach to batching uses the following algorithm in Java pseudo code: public final class NetworkBatcher implements Runnable { private final NetworkFacade network; private final Queue queue; private final ByteBuffer buffer; public NetworkBatcher(final NetworkFacade network, final int maxPacketSize, final Queue queue) { this.network = network; buffer = ByteBuffer.allocate(maxPacketSize); this.queue = queue; } @Override public void run() { while (!Thread.currentThread().isInterrupted()) { while (null == queue.peek()) { employWaitStrategy(); // block, spin, yield, etc. } Message msg; while (null != (msg = queue.poll())) { if (msg.size() > buffer.remaining()) { sendBuffer(); } buffer.put(msg.getBytes()); } sendBuffer(); } } private void sendBuffer() { buffer.flip(); network.send(buffer); buffer.clear(); } } Basically, wait for data to become available and as soon as it is, send it right away. While sending a previous message or waiting on new messages, a burst of traffic may arrive which can all be sent in a batch, up to the size of the buffer, to the underlying resource. This approach can use ConcurrentLinkedQueue which provides low-latency and avoid locks. However it has an issue in not creating back pressure to stall producing/publishing threads if they are outpacing the batcher whereby the queue could grow out of control because it is unbounded. I’ve often had to wrap ConcurrentLinkedQueue to track its size and thus create back pressure. This size tracking can add 50% to the processing cost of using this queue in my experience. This algorithm respects the single writer principle and can often be employed when writing to a network or storage device, and thus avoid lock contention in third party API libraries. By avoiding the contention we avoid the J-Curve latency profile normally associated with contention on resources, due to the queuing effect on locks. With this algorithm, as load increases, latency stays constant until the underlying device is saturated with traffic resulting in a more "bathtub" profile than the J-Curve. Let’s take a worked example of handling 10 messages that arrive as a burst of traffic. In most systems traffic comes in bursts and is seldom uniformly spaced out in time. One approach will assume no batching and the threads write to device API directly as in Figure 1. above. The other will use a lock free data structure to collect the messages plus a single thread consuming messages in a loop as per the algorithm above. For the example let’s assume it takes 100µs to write a single buffer to the network device as a synchronous operation and have it acknowledged. The buffer will ideally be less than the MTU of the network in size when latency is critical. Many network sub-systems are asynchronous and support pipelining but we will make the above assumption to clarify the example. If the network operation is using a protocol like HTTP under REST or Web Services then this assumption matches the underlying implementation. Best (µs) Average (µs) Worst (µs) Packets Sent Serial 100 500 1,000 10 Smart Batching 100 150 200 1-2 The absolute lowest latency will be achieved if a message is sent from the thread originating the data directly to the resource, if the resource is un-contended. The table above shows what happens when contention occurs and a queuing effect kicks in. With the serial approach 10 individual packets will have to be sent and these typically need to queue on a lock managing access to the resource, therefore they get processed sequentially. The above figures assume the locking strategy works perfectly with no perceivable overhead which is unlikely in a real application. For the batching solution it is likely all 10 packets will be picked up in first batch if the concurrent queue is efficient, thus giving the best case latency scenario. In the worst case only one message is sent in the first batch with the other nine following in the next. Therefore in the worst case scenario one message has a latency of 100µs and the following 9 have a latency of 200µs thus giving a worst case average of 190µs which is significantly better than the serial approach. This is one good example when the simplest solution is just a bit too simple because of the contention. The batching solution helps achieve consistent low-latency under burst conditions and is best for throughput. It also has a nice effect across the network on the receiving end in that the receiver has to process fewer packets and therefore makes the communication more efficient both ends. Most hardware handles data in buffers up to a fixed size for efficiency. For a storage device this will typically be a 4KB block. For networks this will be the MTU and is typically 1500 bytes for Ethernet. When batching, it is best to understand the underlying hardware and write batches down in ideal buffer size to be optimally efficient. However keep in mind that some devices need to envelope the data, e.g. the Ethernet and IP headers for network packets so the buffer needs to allow for this. There will always be an increased latency from a thread switch and the cost of exchange via the data structure. However there are a number of very good non-blocking structures available using lock-free techniques. For the Disruptor this type of exchange can be achieved in as little as 50-100ns thus making the choice of taking the smart batching approach a no brainer for low-latency or high-throughput distributed systems. This technique can be employed for many problems and not just IO. The core of the Disruptor uses this technique to help rebalance the system when the publishers burst and outpace the EventProcessors. The algorithm can be seen inside the BatchEventProcessor. Note: For this algorithm to work the queueing structure must handle the contention better than the underlying resource. Many queue implementations are extremely poor at managing contention. Use science and measure before coming to a conclusion. Batching with the Disruptor The code below shows the same algorithm in action using the Disruptor's EventHandler mechanism. In my experience, this is a very effective technique for handling any IO device efficiently and keeping latency low when dealing with load or burst traffic. public final class NetworkBatchHandler implements EventHander { private final NetworkFacade network; private final ByteBuffer buffer; public NetworkBatchHandler(final NetworkFacade network, final int maxPacketSize) { this.network = network; buffer = ByteBuffer.allocate(maxPacketSize); } public void onEvent(Message msg, long sequence, boolean endOfBatch) throws Exception { if (msg.size() > buffer.remaining()) { sendBuffer(); } buffer.put(msg.getBytes()); if (endOfBatch) { sendBuffer(); } } private void sendBuffer() { buffer.flip(); network.send(buffer); buffer.clear(); } } The endOfBatch parameter greatly simplifies the handling of the batch compared to the double loop in the algorithm above. I have simplified the examples to illustrate the algorithm. Clearly error handling and other edge conditions need to be considered. Separation of IO from Work Processing There is another very good reason to separate the IO from the threads doing the work processing. Handing off the IO to another thread means the worker thread, or threads, can continue processing without blocking in a nice cache friendly manner. I've found this to be critical in achieving high-performance throughput. If the underlying IO device or resource becomes briefly saturated then the messages can be queued for the batcher thread allowing the work processing threads to continue. The batching thread then feeds the messages to the IO device in the most efficient way possible allowing the data structure to handle the burst and if full apply the necessary back pressure, thus providing a good separation of concerns in the workflow. Conclusion So there you have it. Smart Batching can be employed in concert with the appropriate data structures to achieve consistent low-latency and maximum throughput. From http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html
October 26, 2011
by Martin Thompson
· 11,384 Views · 1 Like
article thumbnail
Magic: The Gathering in JavaScript and HTML5
as a user interface fan, i could not miss the development with html 5. so the goal of this post is to walk through a graphic application that uses javascript and html 5. we will see through examples one way (among others) to develop this kind of project. application overview tools the html 5 page data gathering cards loading & cache handling cards display mouse management state storage animations handling multi-devices conclusion to go further application overview we will produce an application that will let us display a magic the gathering ©(courtesy of www.wizards.com/magic ) cards collection. users will be able to scroll and zoom using the mouse (like bing maps, for example). you can see the final result here: http://bolaslenses.catuhe.com the project source files can be downloaded here: http://www.catuhe.com/msdn/bolaslenses.zip cards are stored on windows azure storage and use the azure content distribution network ( cdn : a service that deploys data near the final users) in order to achieve maximum performances. an asp.net service is used to return cards list (using json format). tools to write our application, we will use visual studio 2010 sp1 with web standards update . this extension adds intellisense support in html 5 page (which is a really important thing ). so, our solution will contain an html 5 page side by side with .js files (these files will contain javascript scripts). about debug, it is possible to set a breakpoint directly in the .js files under visual studio. it is also possible to use the developer bar of internet explorer 9 (use f12 key to display it). debug with visual studio 2010 debug with internet explorer 9 (f12/developer bar) so, we have a modern developer environment with intellisense and debug support. therefore, we are ready to start and first of all, we will write the html 5 page. the html 5 page our page will be built around an html 5 canvas which will be used to draw the cards: cards scanned by mwshq team magic the gathering official site : http://www.wizards.com/magic bolas lenses your browser does not support html5 canvas. loading data... if we dissect this page, we can note that it is divided into two parts: the header part with the title, the logo and the special mentions the main part (section) holds the canvas and the tooltips that will display the status of the application. there is also a hidden image ( backimage ) used as source for not yet loaded cards. to build the layout of the page, a style sheet ( full.css ) is applied. style sheets are a mechanism used to change the tags styles (in html, a style defines the entire display options for a tag): html, body { height: 100%; } body { background-color: #888888; font-size: .85em; font-family: "segoe ui, trebuchet ms" , verdana, helvetica, sans-serif; margin: 0; padding: 0; color: #696969; } a:link { color: #034af3; text-decoration: underline; } a:visited { color: #505abc; } a:hover { color: #1d60ff; text-decoration: none; } a:active { color: #12eb87; } header, footer, nav, section { display: block; } table { width: 100%; } header, #header { position: relative; margin-bottom: 0px; color: #000; padding: 0; } #title { font-weight: bold; color: #fff; border: none; font-size: 60px !important; vertical-align: middle; margin-left: 70px } #legal { text-align: right; color: white; font-size: 14px; width: 50%; position: absolute; top: 15px; right: 10px } #leftheader { width: 50%; vertical-align: middle; } section { margin: 20px 20px 20px 20px; } #maincanvas{ border: 4px solid #000000; } #cardscount { font-weight: bolder; font-size: 1.1em; } .tooltip { position: absolute; bottom: 5px; color: black; background-color: white; margin-right: auto; margin-left: auto; left: 35%; right: 35%; padding: 5px; width: 30%; text-align: center; border-radius: 10px; -webkit-border-radius: 10px; -moz-border-radius: 10px; box-shadow: 2px 2px 2px #333333; } #bolaslogo { width: 64px; height: 64px; } #picturecell { float: left; width: 64px; margin: 5px 5px 5px 5px; vertical-align: middle; } thus, this sheet is responsible for setting up the following display: style sheets are powerful tools that allow an infinite number of displays. however, they are sometimes complicated to setup (for example if a tag is affected by a class, an identifier and its container). to simplify this setup, the development bar of internet explorer 9 is particularly useful because we can use it to see styles hierarchy that is applied to a tag. for example let’s take a look at the waittext tooltip with the development bar. to do this, you must press f12 in internet explorer and use the selector to choose the tooltip: once the selection is done, we can see the styles hierarchy: thus, we can see that our div received its styles from the body tag and the . tooltip entry of the style sheet. with this tool, it becomes possible to see the effect of each style (which can be disabled). it is also possible to add new style on the fly. another important point of this window is the ability to change the rendering mode of internet explorer 9. indeed, we can test how, for example, internet explorer 8 will handle the same page. to do this, go to the [ browser mode ] menu and select the engine of internet explorer 8. this change will especially impact our tooltip as it uses border-radius (rounded edge) and box-shadow that are features of css 3: internet explorer 9 internet explorer 8 our page provides a graceful degradation as it still works (with no annoying visual difference) when the browser does not support all the required technologies. now that our interface is ready, we will take a look at the data source to retrieve the cards to display. the server provides the cards list using json format on this url: http://bolaslenses.catuhe.com/ home/listofcards/?colorstring=0 it takes one parameter ( colorstring ) to select a specific color (0 = all). when developing with javascript, there is a good reflex to have (reflex also good in other languages too, but really important in javascript): one must ask whether what we want to develop has not been already done in an existing framework. indeed, there is a multitude of open source projects around javascript. one of them is jquery which provides a plethora of convenient services. thus, in our case to connect to the url of our server and get the cards list, we could go through a xmlhttprequest and have fun to parse the returned json. or we can use jquery . so we will use the getjson function which will take care of everything for us: function getlistofcards() { var url = "http://bolaslenses.catuhe.com/home/listofcards/?jsoncallback=?"; $.getjson(url, { colorstring: "0" }, function (data) { listofcards = data; $("#cardscount").text(listofcards.length + " cards displayed"); $("#waittext").slidetoggle("fast"); }); } as we can see, our function stores the cards list in the listofcards variable and calls two jquery functions: text that change the text of a tag slidetoggle that hides (or shows) a tag by animating its height the listofcards list contains objects whose format is: id : unique identifier of the card path : relative path of the card (without the extension) it should be noted that the url of the server is called with the “ ?jsoncallback=? ” suffix. indeed, ajax calls are constrained in terms of security to connect only to the same address as the calling script. however, there is a solution called jsonp that will allow us to make a concerted call to the server (which of course must be aware of the operation). and fortunately, jquery can handle it all alone by just adding the right suffix. once we have our cards list, we can set up the pictures loading and caching. cards loading & cache handling the main trick of our application is to draw only the cards effectively visible on the screen. the display window is defined by a zoom level and an offset (x, y) in the overall system. var visucontrol = { zoom : 0.25, offsetx : 0, offsety : 0 }; the overall system is defined by 14819 cards that are spread over 200 columns and 75 rows. also, we must be aware that each card is available in three versions: high definition: 480x680 without compression (.jpg suffix) medium definition: 240x340 with standard compression (.50.jpg suffix) low definition: 120x170 with strong compression (.25.jpg suffix) thus, depending on the zoom level, we will load the correct version to optimize networks transfer. to do this we will develop a function that will give an image for a given card. this function will be configured to download a certain level of quality. in addition it will be linked with lower quality level to return it if the card for the current level is not yet uploaded: function imagecache(substr, replacementcache) { var extension = substr; var backimage = document.getelementbyid("backimage"); this.load = function (card) { var localcache = this; if (this[card.id] != undefined) return; var img = new image(); localcache[card.id] = { image: img, isloaded: false }; currentdownloads++; img.onload = function () { localcache[card.id].isloaded = true; currentdownloads--; }; img.onerror = function() { currentdownloads--; }; img.src = "http://az30809.vo.msecnd.net/" + card.path + extension; }; this.getreplacementfromlowercache = function (card) { if (replacementcache == undefined) return backimage; return replacementcache.getimageforcard(card); }; this.getimageforcard = function(card) { var img; if (this[card.id] == undefined) { this.load(card); img = this.getreplacementfromlowercache(card); } else { if (this[card.id].isloaded) img = this[card.id].image; else img = this.getreplacementfromlowercache(card); } return img; }; } an imagecache is built by giving the associated suffix and the underlying cache. here you can see two important functions: load : this function will load the right picture and will store it in a cache (the msecnd.net url is the azure cdn address of the cards) getimageforcard : this function returns the card picture from the cache if already loaded. otherwise it requests the underlying cache to return its version (and so on) so to handle our 3 levels of caches, we have to declare three variables: var imagescache25 = new imagecache(".25.jpg"); var imagescache50 = new imagecache(".50.jpg", imagescache25); var imagescachefull = new imagecache(".jpg", imagescache50); selecting the right cover is only depending on zoom: function getcorrectimagecache() { if (visucontrol.zoom <= 0.25) return imagescache25; if (visucontrol.zoom <= 0.8) return imagescache50; return imagescachefull; } to give a feedback to the user, we will add a timer that will manage a tooltip that indicates the number of images currently loaded: function updatestats() { var stats = $("#stats"); stats.html(currentdownloads + " card(s) currently downloaded."); if (currentdownloads == 0 && statsvisible) { statsvisible = false; stats.slidetoggle("fast"); } else if (currentdownloads > 1 && !statsvisible) { statsvisible = true; stats.slidetoggle("fast"); } } setinterval(updatestats, 200); again we note the use of jquery to simplify animations. we will now discuss the display of cards. cards display to draw our cards, we need to actually fill the canvas using its 2d context (which exists only if the browser supports html 5 canvas): var maincanvas = document.getelementbyid("maincanvas"); var drawingcontext = maincanvas.getcontext('2d'); the drawing will be made by processlistofcards function (called 60 times per second): function processlistofcards() { if (listofcards == undefined) { drawwaitmessage(); return; } maincanvas.width = document.getelementbyid("center").clientwidth; maincanvas.height = document.getelementbyid("center").clientheight; totalcards = listofcards.length; var localcardwidth = cardwidth * visucontrol.zoom; var localcardheight = cardheight * visucontrol.zoom; var effectivetotalcardsinwidth = colscount * localcardwidth; var rowscount = math.ceil(totalcards / colscount); var effectivetotalcardsinheight = rowscount * localcardheight; initialx = (maincanvas.width - effectivetotalcardsinwidth) / 2.0 - localcardwidth / 2.0; initialy = (maincanvas.height - effectivetotalcardsinheight) / 2.0 - localcardheight / 2.0; // clear clearcanvas(); // computing of the viewing area var initialoffsetx = initialx + visucontrol.offsetx * visucontrol.zoom; var initialoffsety = initialy + visucontrol.offsety * visucontrol.zoom; var startx = math.max(math.floor(-initialoffsetx / localcardwidth) - 1, 0); var starty = math.max(math.floor(-initialoffsety / localcardheight) - 1, 0); var endx = math.min(startx + math.floor((maincanvas.width - initialoffsetx - startx * localcardwidth) / localcardwidth) + 1, colscount); var endy = math.min(starty + math.floor((maincanvas.height - initialoffsety - starty * localcardheight) / localcardheight) + 1, rowscount); // getting current cache var imagecache = getcorrectimagecache(); // render for (var y = starty; y < endy; y++) { for (var x = startx; x < endx; x++) { var localx = x * localcardwidth + initialoffsetx; var localy = y * localcardheight + initialoffsety; // clip if (localx > maincanvas.width) continue; if (localy > maincanvas.height) continue; if (localx + localcardwidth < 0) continue; if (localy + localcardheight < 0) continue; var card = listofcards[x + y * colscount]; if (card == undefined) continue; // get from cache var img = imagecache.getimageforcard(card); // render try { if (img != undefined) drawingcontext.drawimage(img, localx, localy, localcardwidth, localcardheight); } catch (e) { $.grep(listofcards, function (item) { return item.image != img; }); } } }; // scroll bars drawscrollbars(effectivetotalcardsinwidth, effectivetotalcardsinheight, initialoffsetx, initialoffsety); // fps computefps(); } this function is built around many key points: if the cards list is not yet loaded, we display a tooltip indicating that download is in progress: var pointcount = 0; function drawwaitmessage() { pointcount++; if (pointcount > 200) pointcount = 0; var points = ""; for (var index = 0; index < pointcount / 10; index++) points += "."; $("#waittext").html("loading...please wait" + points); subsequently, we define the position of the display window (in terms of cards and coordinates), then we proceed to clean the canvas: function clearcanvas() { maincanvas.width = document.body.clientwidth - 50; maincanvas.height = document.body.clientheight - 140; drawingcontext.fillstyle = "rgb(0, 0, 0)"; drawingcontext.fillrect(0, 0, maincanvas.width, maincanvas.height); } then we browse the cards list and call the drawimage function of the canvas context. the current image is provided by the active cache (depending on the zoom): // get from cache var img = imagecache.getimageforcard(card); // render try { if (img != undefined) drawingcontext.drawimage(img, localx, localy, localcardwidth, localcardheight); } catch (e) { $.grep(listofcards, function (item) { return item.image != img; }); we also have to draw the scroll bar with the roundedrectangle function that uses quadratic curves: function roundedrectangle(x, y, width, height, radius) { drawingcontext.beginpath(); drawingcontext.moveto(x + radius, y); drawingcontext.lineto(x + width - radius, y); drawingcontext.quadraticcurveto(x + width, y, x + width, y + radius); drawingcontext.lineto(x + width, y + height - radius); drawingcontext.quadraticcurveto(x + width, y + height, x + width - radius, y + height); drawingcontext.lineto(x + radius, y + height); drawingcontext.quadraticcurveto(x, y + height, x, y + height - radius); drawingcontext.lineto(x, y + radius); drawingcontext.quadraticcurveto(x, y, x + radius, y); drawingcontext.closepath(); drawingcontext.stroke(); drawingcontext.fill(); } function drawscrollbars(effectivetotalcardsinwidth, effectivetotalcardsinheight, initialoffsetx, initialoffsety) { drawingcontext.fillstyle = "rgba(255, 255, 255, 0.6)"; drawingcontext.linewidth = 2; // vertical var totalscrollheight = effectivetotalcardsinheight + maincanvas.height; var scaleheight = maincanvas.height - 20; var scrollheight = maincanvas.height / totalscrollheight; var scrollstarty = (-initialoffsety + maincanvas.height * 0.5) / totalscrollheight; roundedrectangle(maincanvas.width - 8, scrollstarty * scaleheight + 10, 5, scrollheight * scaleheight, 4); // horizontal var totalscrollwidth = effectivetotalcardsinwidth + maincanvas.width; var scalewidth = maincanvas.width - 20; var scrollwidth = maincanvas.width / totalscrollwidth; var scrollstartx = (-initialoffsetx + maincanvas.width * 0.5) / totalscrollwidth; roundedrectangle(scrollstartx * scalewidth + 10, maincanvas.height - 8, scrollwidth * scalewidth, 5, 4); } and finally, we need to compute the number of frames per second: function computefps() { if (previous.length > 60) { previous.splice(0, 1); } var start = (new date).gettime(); previous.push(start); var sum = 0; for (var id = 0; id < previous.length - 1; id++) { sum += previous[id + 1] - previous[id]; } var diff = 1000.0 / (sum / previous.length); $("#cardscount").text(diff.tofixed() + " fps. " + listofcards.length + " cards displayed"); } drawing cards relies heavily on the browser's ability to speed up canvas rendering. for the record, here are the performances on my machine with the minimum zoom level (0.05): browser fps internet explorer 9 30 firefox 5 30 chrome 12 17 ipad (with a zoom level of 0.8) 7 windows phone mango (with a zoom level of 0.8) 20 (!!) the site even works on mobile phones and tablets as long as they support html 5. here we can see the inner power of html 5 browsers that can handle a full screen of cards more than 30 times per second! mouse management to browse our cards collection, we have to manage the mouse (including its wheel). for the scrolling, we'll just handle the onmouvemove , onmouseup and onmousedown events. onmouseup and onmousedown events will be used to detect if the mouse is clicked or not: var mousedown = 0; document.body.onmousedown = function (e) { mousedown = 1; getmouseposition(e); previousx = posx; previousy = posy; }; document.body.onmouseup = function () { mousedown = 0; }; the onmousemove event is connected to the canvas and used to move the view: var previousx = 0; var previousy = 0; var posx = 0; var posy = 0; function getmouseposition(eventargs) { var e; if (!eventargs) e = window.event; else { e = eventargs; } if (e.offsetx || e.offsety) { posx = e.offsetx; posy = e.offsety; } else if (e.clientx || e.clienty) { posx = e.clientx; posy = e.clienty; } } function onmousemove(e) { if (!mousedown) return; getmouseposition(e); mousemovefunc(posx, posy, previousx, previousy); previousx = posx; previousy = posy; } this function (onmousemove) calculates the current position and provides also the previous value in order to move the offset of the display window: function move(posx, posy, previousx, previousy) { currentaddx = (posx - previousx) / visucontrol.zoom; currentaddy = (posy - previousy) / visucontrol.zoom; } mousehelper.registermousemove(maincanvas, move); note that jquery also provides tools to manage mouse events. for the management of the wheel, we will have to adapt to different browsers that do not behave the same way on this point: function wheel(event) { var delta = 0; if (event.wheeldelta) { delta = event.wheeldelta / 120; if (window.opera) delta = -delta; } else if (event.detail) { /** mozilla case. */ delta = -event.detail / 3; } if (delta) { wheelfunc(delta); } if (event.preventdefault) event.preventdefault(); event.returnvalue = false; } we can see that everyone does what he wants :). the function to register with this event is: mousehelper.registerwheel = function (func) { wheelfunc = func; if (window.addeventlistener) window.addeventlistener('dommousescroll', wheel, false); window.onmousewheel = document.onmousewheel = wheel; }; and we will use this function to change the zoom with the wheel: // mouse mousehelper.registerwheel(function (delta) { currentaddzoom += delta / 500.0; }); finally we will add a bit of inertia when moving the mouse (and the zoom) to give some kind of smoothness: // inertia var inertia = 0.92; var currentaddx = 0; var currentaddy = 0; var currentaddzoom = 0; function doinertia() { visucontrol.offsetx += currentaddx; visucontrol.offsety += currentaddy; visucontrol.zoom += currentaddzoom; var effectivetotalcardsinwidth = colscount * cardwidth; var rowscount = math.ceil(totalcards / colscount); var effectivetotalcardsinheight = rowscount * cardheight var maxoffsetx = effectivetotalcardsinwidth / 2.0; var maxoffsety = effectivetotalcardsinheight / 2.0; if (visucontrol.offsetx < -maxoffsetx + cardwidth) visucontrol.offsetx = -maxoffsetx + cardwidth; else if (visucontrol.offsetx > maxoffsetx) visucontrol.offsetx = maxoffsetx; if (visucontrol.offsety < -maxoffsety + cardheight) visucontrol.offsety = -maxoffsety + cardheight; else if (visucontrol.offsety > maxoffsety) visucontrol.offsety = maxoffsety; if (visucontrol.zoom < 0.05) visucontrol.zoom = 0.05; else if (visucontrol.zoom > 1) visucontrol.zoom = 1; processlistofcards(); currentaddx *= inertia; currentaddy *= inertia; currentaddzoom *= inertia; // epsilon if (math.abs(currentaddx) < 0.001) currentaddx = 0; if (math.abs(currentaddy) < 0.001) currentaddy = 0; } this kind of small function does not cost a lot to implement, but adds a lot to the quality of user experience. state storage also to provide a better user experience, we will save the display window’s position and zoom. to do this, we will use the service of localstorage (which saves pairs of keys / values for the long term (the data is retained after the browser is closed) and only accessible by the current window object): function saveconfig() { if (window.localstorage == undefined) return; // zoom window.localstorage["zoom"] = visucontrol.zoom; // offsets window.localstorage["offsetx"] = visucontrol.offsetx; window.localstorage["offsety"] = visucontrol.offsety; } // restore data if (window.localstorage != undefined) { var storedzoom = window.localstorage["zoom"]; if (storedzoom != undefined) visucontrol.zoom = parsefloat(storedzoom); var storedoffsetx = window.localstorage["offsetx"]; if (storedoffsetx != undefined) visucontrol.offsetx = parsefloat(storedoffsetx); var storedoffsety = window.localstorage["offsety"]; if (storedoffsety != undefined) visucontrol.offsety = parsefloat(storedoffsety); } animations to add even more dynamism to our application we will allow our users to double-click on a card to zoom and focus on it. our system should animate three values: the two offsets (x, y) and the zoom. to do this, we will use a function that will be responsible of animating a variable from a source value to a destination value with a given duration: var animationhelper = function (root, name) { var paramname = name; this.animate = function (current, to, duration) { var offset = (to - current); var ticks = math.floor(duration / 16); var offsetpart = offset / ticks; var tickscount = 0; var intervalid = setinterval(function () { current += offsetpart; root[paramname] = current; tickscount++; if (tickscount == ticks) { clearinterval(intervalid); root[paramname] = to; } }, 16); }; }; the use of this function is: // prepare animations parameters var zoomanimationhelper = new animationhelper(visucontrol, "zoom"); var offsetxanimationhelper = new animationhelper(visucontrol, "offsetx"); var offsetyanimationhelper = new animationhelper(visucontrol, "offsety"); var speed = 1.1 - visucontrol.zoom; zoomanimationhelper.animate(visucontrol.zoom, 1.0, 1000 * speed); offsetxanimationhelper.animate(visucontrol.offsetx, targetoffsetx, 1000 * speed); offsetyanimationhelper.animate(visucontrol.offsety, targetoffsety, 1000 * speed); the advantage of the animationhelper function is that it is able to animate as many parameters as you wish (and that only with the settimer function!) handling multi-devices finally we will ensure that our page can also be seen on tablets pc and even on phones. to do this, we will use a feature of css 3: the media-queries . with this technology, we can apply style sheets according to some queries such as a specific display size: here we see that if the screen width is less than 480 pixels, the following style sheet will be added: #legal { font-size: 8px; } #title { font-size: 30px !important; } #waittext { font-size: 12px; } #bolaslogo { width: 48px; height: 48px; } #picturecell { width: 48px; } finally we will ensure that our page can also be seen on tablets pc and even on phones. to do this, we will use a feature of css 3: #legal { font-size: 8px; } #title { font-size: 30px !important; } #waittext { font-size: 12px; } #bolaslogo { width: 48px; height: 48px; } #picturecell { width: 48px; } conclusion html 5 / css 3 / javascript and visual studio 2010 allow to develop portable and efficient solutions (within the limits of browsers that support html 5 of course) with some great features such as hardware accelerated rendering. this kind of development is also simplified by the use of frameworks like jquery. also, i am especially fan of javascript that turns out to be a very powerful dynamic language. of course, c# or vb.net developers have to change theirs reflexes but for the development of web pages it's worth. in conclusion, i think that the best to be convinced is to try! to go further internet explorer test drive: http://ie.microsoft.com/testdrive/ internet explorer 9 guide for developer : http://msdn.microsoft.com/en-us/ie/ff468705 w3c site for html 5 : http://dev.w3.org/html5/spec/overview.html internet explorer site : http://msdn.microsoft.com/en-us/ie/aa740469 about the author david catuhe is a developer evangelist for microsoft france in charge of user experience development tools (from xaml to directx/xna and html5). he defines himself as a geek and likes coding all that refer to graphics. before working for microsoft, he founded a company that developed a realtime 3d engine written with directx ( www.vertice.fr ). source: http://blogs.msdn.com/b/eternalcoding/archive/2011/07/25/feedback-of-a-graphic-development-using-html5-amp-javascript.aspx
October 24, 2011
by David Catuhe
· 25,140 Views · 1 Like
article thumbnail
How to retrieve/extract metadata information from audio files using Java and Apache Tika API?
i guess, i’m writing this post after a long time. this time, i’m writing about apache tika api that a friend of mine and i tried out to extract/retrieve metadata information from audio files supported by it – .mp3, .aiff, .au, .midi, .wav. to make it clear, here’s a screenshot of the information shown by windows vista about an audio file: we wanted to extract this using java and with googling, found that apache tika would help. we needed this metadata to index audio files for it to be searchable in a search application that we’re building using apache lucene . here’s a sample java program that extracts metadata from an mp3 file: package singz.samples.search.audio.metadata; import java.io.file; import java.io.fileinputstream; import java.io.filenotfoundexception; import java.io.ioexception; import java.io.inputstream; import org.apache.tika.exception.tikaexception; import org.apache.tika.metadata.metadata; import org.apache.tika.parser.parsecontext; import org.apache.tika.parser.parser; import org.apache.tika.parser.mp3.mp3parser; import org.xml.sax.contenthandler; import org.xml.sax.saxexception; import org.xml.sax.helpers.defaulthandler; /** * @author singaram subramanian * extract metadata of an audio file using apache tika api * */ public class audiometadataextractordemo { public static void main(string[] args) { // this audio file has metadata embedded in xmp (extensible metadata platform) standard // created by adobe systems inc. xmp standardizes the definition, creation, and // processing of extensible metadata. string audiofileloc = "c:\\pop\\backstreetboys_showmethemeaningofbeinglonely.mp3"; try { inputstream input = new fileinputstream(new file(audiofileloc)); contenthandler handler = new defaulthandler(); metadata metadata = new metadata(); parser parser = new mp3parser(); parsecontext parsectx = new parsecontext(); parser.parse(input, handler, metadata, parsectx); input.close(); // list all metadata string[] metadatanames = metadata.names(); for(string name : metadatanames){ system.out.println(name + ": " + metadata.get(name)); } // retrieve the necessary info from metadata // names - title, xmpdm:artist etc. - mentioned below may differ based // on the standard used for processing and storing standardized and/or // proprietary information relating to the contents of a file. system.out.println("title: " + metadata.get("title")); system.out.println("artists: " + metadata.get("xmpdm:artist")); system.out.println("genre: " + metadata.get("xmpdm:genre")); } catch (filenotfoundexception e) { e.printstacktrace(); } catch (ioexception e) { e.printstacktrace(); } catch (saxexception e) { e.printstacktrace(); } catch (tikaexception e) { e.printstacktrace(); } } } maven pom xml 4.0.0 singz.samples.search.audio audiometadataextractor 0.0.1 jar audiometadataextractor http://maven.apache.org utf-8 org.apache.tika tika-core 0.10 org.apache.tika tika-parsers 0.10 output xmpdm:releasedate: 2001 xmpdm:audiochanneltype: stereo xmpdm:album: top 100 pop author: backstreet boys xmpdm:artist: backstreet boys channels: 2 xmpdm:audiosamplerate: 44100 xmpdm:logcomment: eng xmpdm:tracknumber: 04 version: mpeg 3 layer iii version 1 xmpdm:composer: null xmpdm:audiocompressor: mp3 title: show me the meaning of being lonely samplerate: 44100 xmpdm:genre: pop content-type: audio/mpeg title: show me the meaning of being lonely artists: backstreet boys genre: pop about apache tika http://tika.apache.org/index.html “the apache tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.” http://www.lucidimagination.com/devzone/technical-articles/content-extraction-tika#article.tika “apache tika is a content type detection and content extraction framework. tika provides a general application programming interface that can be used to detect the content type of a document and also parse textual content and metadata from several document formats. tika does not try to understand the full variety of different document formats by itself but instead delegates the real work to various existing parser libraries such as apache poi for microsoft formats, pdfbox for adobe pdf, neko html for html etc. the grand idea behind tika is that it offers a generic interface for parsing multiple formats. the tika api hides the technical differences of the various parser implementations. this means that you don’t have to learn and consume one api for every format you use but can instead use a single api – the tika api. internally tika usually delegates the parsing work to existing parsing libraries and adapts the parse result so that client applications can easily manage variety of formats. tika aims to be efficient in using available resources (mainly ram) while parsing. the tika api is stream oriented so that the parsed source document does not need to be loaded into memory all at once but only as it is needed. ultimately, however, the amount of resources consumed is mandated by the parser libraries that tika uses. at the time of writing this, tika supports directly around 30 document formats. see list of supported document formats . the list of supported document formats is not limited by tika in any way. in the simplest case you can add support for new document formats by implementing a thin adapter that that implements the parser interface for the new document format.” about xmp standard http://en.wikipedia.org/wiki/extensible_metadata_platform “the adobe extensible metadata platform ( xmp ) is a standard, created by adobe systems inc. , for processing and storing standardized and proprietary information relating to the contents of a file. xmp standardizes the definition, creation, and processing of extensible metadata . serialized xmp can be embedded into a significant number of popular file formats, without breaking their readability by non-xmp-aware applications. embedding metadata avoids many problems that occur when metadata is stored separately. xmp is used in pdf , photography and photo editing applications. xmp can be used in several file formats such as pdf , jpeg , jpeg 2000 , jpeg xr , gif , png , html , tiff , adobe illustrator , psd , mp3 , mp4 , audio video interleave , wav , rf64 , audio interchange file format , postscript , encapsulated postscript , and proposed for djvu . in a typical edited jpeg file, xmp information is typically included alongside exif and iptc information interchange model data.” from http://singztechmusings.wordpress.com/2011/10/17/how-to-retrieveextract-metadata-information-from-audio-files-using-java-and-apache-tika-api/
October 20, 2011
by Singaram Subramanian
· 34,202 Views
article thumbnail
Getters and Setters Are Not Evil
Every now and then some OOP purist comes and tells us that getters and setters are evil, because they break encapsulation. And you should never, ever use getters and setters because this is a sign of a bad design and leads to maintainability nightmares. Well, don’t worry, because those people are wrong. Not completely wrong of course, because getters and setters can break encapsulation, but in the usual scenario for regular business projects they don’t. What is the purpose of encapsulation? First, to hide how exactly an object performs its job. And to protect the internal data of an object, so that no external object can violate its state space. In other words, only the object knows which combination of field values is valid and which isn’t. Exposing fields to the outside world can leave the object in inconsistent state. For example what if you could change the backing array in an ArrayList, without setting the size field? The ArrayList instance will be inconsistent and will be violating its contract. So no getter and setter for the array list internal array. But the majority of objects for which people generate getters and setters are simple data holders. They don’t have any rules to enforce on their state, the state space consists of all possible combinations of values, and furthermore – there is nothing they can do with that data. And before you call me “anemic”, it doesn’t matter if you are doing “real OOP” with domain-driven design, where you have business logic & state in the same object, or you are doing fat service layer + anemic objects. Why it doesn’t matter? Because even in domain-driven projects you have DTOs. And DTOs are simply data holders, which need getters and setters. Another thing is that in many cases your object state is public anyway. Tools use reflection to make use of objects – view technologies use EL to access objects, ORMs use reflection to persist your entities, jackson uses reflection to serialize your objects to JSON, jasper reports uses reflection to get details from its model, etc. Virtually anything you do in the regular project out there requires data being passed outside of the application: to the user, to the database, to the printer, as a result of an API call. And you have to know what that data is. In EL you have ${foo.bar} – with, or without a getter, you consume that field. In an ORM you need to know what database types to use. In the documentation of your JSON API you should specify the structure (another topic here is whether rest-like services need documentation). The overall point here is that you win nothing by not having getters and setters on your data holder objects. Their internal state is public anyway, and it has to be. And any change in those fields means a change has to be made in other places. Change is something people fear – “you will have to change it everywhere in your project” .. well, yeah, you have, because it has changed. If you change the structure of an address from String to an Address class, it’s likely that you should revisit all places it is used and split it there as well. If you change a double to BigDecimal you’d better go and fix all your calculations. Another point – the above examples emphasized on reading the data. However, you must set that data somehow. You have roughly 3 options – constructor, builder, setters. A constructor with 15 arguments is obviously not an option. A builder for every object is just too verbose. So we use setters, because it is more practical and more readable. And that’s the main point here – setters and getters are practical when used on data holder objects. I have supported quite big projects that had a lot of setters and getters, and I had absolutely no problem with that. In fact, tracing “who sets that data” is the same as “where did this object (that encapsulates its data) came from”. And yes, in an ideal OO world you wouldn’t need data holders / DTOs, and there will be no flow of data in the system. But in the real world there is. To conclude – be careful with setters and getters on non-data-only objects. Encapsulation is a real thing. When designing a library, a component or some base frameworks in your project – don’t simply generate getters and setters. But for the regular data object – don’t worry, there’s no evil in that. From http://techblog.bozho.net/?p=621
October 14, 2011
by Bozhidar Bozhanov
· 23,715 Views · 1 Like
article thumbnail
Practical PHP Refactoring: Replace Record with Data Class
We often find ourselves tempted by the shortcut of using directly a record-like data structure provided by the language or a framework. There are many example scenarios where a record emerges: when you use a database for persistence (not only a with relational one) there can be a data structure containing the results of a query. when you use associative arrays, often they have a small number of fixed keys. Record-like structures are the equivalent of C structs, or Ruby hashes. This refactoring is a generalization of Replace Array with Object: in this case the starting point is not just an array which always has the same number and type of fields, but any data structure homoegeneous to it: Zend_Db_Table_Row, implementing a Row Data Gateway and giving access to a row in the database. stdClass instances (or arrays) fetched by PDOStatement. In some languages (see C) the record is a particular data structure which is not an object; in PHP, it is always an array or an object of some vendor class. Even ORMs based on Active Record are more advanced than this kind of usage: they usually let you add methods on the model classes, which are populated with data by the ORM itself. In this refactoring, we are talking about a data structure managed by the language of the persistence layer, and whose code you cannot modify. Why replacing a record-like structure? Generic classes do not let you add methods to manipulate their data: when using directly a Zend_Db_Table_Row or an associative array for storage of the result of a query, you have to resort continuously to foreign methods. Each time you need new logic, you have to compromise encapsulation on the object itself, and these methods get duplicated in various instance of client code. Solutions There are different ways to eliminate the coupling to a record-like structure. The first is to refactor to subclassing, where the data structure becomes an Active Record. You extends the vendor class with your own one: this option is only available if the structure is defined as an object. Moreover, the name of the class to instantiate must be configurable in the persistence mechanism. Zend_Db_Table_Record supports this kind of usage. A second option is to refactor to composition: Zend_Db examples exist also for this case. This approach can be used to avoid large hierarchies: your model classes compose the Zend_Db objects and hide the database; methods can be added for once at your own models. A third and final alternative is to use hydration: data is copied to your model objects, and records are thrown away after the fact. Doctrine 2 and Data Mappers in general choose this approach. Steps I will describe a short procedure for refactoring to hydration since it is the most complex approach and it is always applicable. The other are instead specific for the particular data structure (for example with Zend_Db you have to write some subclasses and configure some protected fields to contain the right class names.) Create a new class, as one of your models. Its state should be represented by one row (or more rows joined into one) in the database. This class should gain a private field for each of the record's fields, usually with getters and setters. This class should accept in the constructor or in a Factory Method an instance of the record, so that it can produce a new instance. If you want to decouple from the persistsnce, or you want to go two ways (also save and not only visualize, since this is not just a presentation model), look for a Data Mapper such as Doctrine 2, which will even hide all the record structures from you and manager associations where objects compose other ones. Example In the example, the data of a single user are returned in an array. I chose an array as the record-like structure to minimize the external dependencies of this code. find(42); $this->assertEquals('Giorgio', $giorgio['name']); } } /** * This is a Fake Table Data Gateway. The machinery for making it work with * a database will be distracting for our purposes, so they will be omitted. */ class UsersTable { /** * @return mixed the returned value can be a Zend_Db_Table_Row, * an Active Record, a stdClass, an associative array... * It should just represent a single entity. */ public function find($id) { // execute a PDOStatement and fetch the data return array('id' => 42, 'name' => 'Giorgio'); } } After the refactoring, we have a User class where we can add all the methods we need: find(42); $this->assertEquals('Giorgio', $giorgio->getName()); } } /** * This is a Fake Table Data Gateway. The machinery for making it work with * a database will be distracting for our purposes, so they will be omitted. */ class UsersTable { /** * @return mixed the returned value can be a Zend_Db_Table_Row, * an Active Record, a stdClass, an associative array... * It should just represent a single entity. */ public function find($id) { // execute a PDOStatement and fetch the data return User::fromRecord(array('id' => 42, 'name' => 'Giorgio')); } } class User { private $id; private $name; public static function fromRecord(array $record) { $object = new self(); $object->id = $record['id']; $object->name = $record['name']; return $object; } public function getName() { return $this->name; } }
September 19, 2011
by Giorgio Sironi
· 11,048 Views
article thumbnail
Memory Barriers/Fences
In this article I'll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors. CPUs have employed many techniques to try and accommodate the fact that CPU execution unit performance has greatly outpaced main memory performance. In my “Write Combining” article I touched on just one of these techniques. The most common technique employed by CPUs to hide memory latency is to pipeline instructions and then spend significant effort, and resource, on trying to re-order these pipelines to minimise stalls related to cache misses. When a program is executed it does not matter if its instructions are re-ordered provided the same end result is achieved. For example, within a loop it does not matter when the loop counter is updated if no operation within the loop uses it. The compiler and CPU are free to re-order the instructions to best utilise the CPU provided it is updated by the time the next iteration is about to commence. Also over the execution of a loop this variable may be stored in a register and never pushed out to cache or main memory, thus it is never visible to another CPU. CPU cores contain multiple execution units. For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation. Each execution unit can do some combination of these tasks. These execution units operate in parallel allowing instructions to be executed in parallel. This introduces another level of non-determinism to program order if it was observed from another CPU. Finally, when a cache-miss occurs, a modern CPU can make an assumption on the results of a memory load and continue executing based on this assumption until the load returns the actual data. Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance. Figure 1. Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers. These buffers are associative queues that allow fast lookup. This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache. Figure 1 above depicts a simplified view of a modern multi-core CPU. It shows how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system. In a multi-threaded environment techniques need to be employed for making program results visible in a timely manner. I will not cover cache coherence in this article. Just assume that once memory has been pushed to the cache then a protocol of messages will occur to ensure all caches are coherent for any shared data. The techniques for making memory visible from a processor core are known as memory barriers or fences. Memory barriers provide two properties. Firstly, they preserve externally visible program order by ensuring all instructions either side of the barrier appear in the correct program order if observed from another CPU and, secondly, they make the memory visible by ensuring the data is propagated to the cache sub-system. Memory barriers are a complex subject. They are implemented very differently across CPU architectures. At one end of the spectrum there is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers. Since x86 CPUs are the most common for multi-threaded programming I’ll try and simplify to this level. Store Barrier A store barrier, “sfence” instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued. This will make the program state visible to other CPUs so they can act on it if necessary. A good example of this in action is the following simplified code from the BatchEventProcessor in the Disruptor. When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action. All previous updates to memory that happened before the barrier are now visible. private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE; // from inside the run() method T event = null; long nextSequence = sequence + 1L; while (running) { try { final long availableSequence = dependencyBarrier.waitFor(nextSequence); while (nextSequence <= availableSequence) { event = dependencyBarrier.getEvent(nextSequence); eventHandler.onEvent(event, nextSequence == availableSequence); nextSequence++; } sequence = event.getSequence(); // store barrier inserted here !!! } catch (final Exception ex) { exceptionHandler.handle(ex, event); sequence = event.getSequence(); // store barrier inserted here !!! nextSequence = event.getSequence() + 1L; } } Load Barrier A load barrier, “lfence” instruction on x86, forces all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for that CPU. This makes program state exposed from other CPUs visible to this CPU before making further progress. A good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor. Full Barrier A full barrier, "mfence" instruction on x86, is a composite of both load and store barriers happening on a CPU. Java Memory Model In the Java Memory Model a volatile field has a store barrier inserted after a write to it and a load barrier inserted before a read of it. Qualified final fields of a class have a store barrier inserted after their initialisation to ensure these fields are visible once the constructor completes when a reference to the object is available. Atomic Instructions and Software Locks Atomic instructions, such as the “lock ...” instructions on x86, are effectively a full barrier as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs. Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order. Performance Impact of Memory Barriers Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost which must be considered. To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units. Taking this approach allows the processor to optimise the units of work without restriction. There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them. From http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html
September 12, 2011
by Martin Thompson
· 26,067 Views · 8 Likes
article thumbnail
JSON data migration
JSON data format is simple and still powerful. Nowadays you can encounter more and more web applications communicating using JSON format then a couple of years ago. It is simple for a developer to read the format, it is effective for a web browser to parse the format and there are databases using it as its primary data format. But what happens when the data structure changes? You need to migrate. And that is where acris-json-migration might help you! Example situation Let's shed a light into it and assume we have a data like this: { "firstName":"John" "secondName":"Doe" "street":"Over the rainbow" "streetNr":21 } Such data can be represented by following Java domain object: public class Person { String firstName; String secondName; String street; Integer streetNr; // ... and getters and setters... } Well, this seems like data about a person named John Doe. We stored it in a database and you can clearly see, that secondName is probably not the field name we really like to have. But a developer made a mistake and in second version of our domain model we are going to fix it: { "firstName":"John" "surname":"Doe" "street":"Over the rainbow" "streetNr":21 } Now you can see the point - thousands of data stored in the format defined by Person class in its version #1 but our program communicating in version #2 with changed secondName to surname in Person class. Clients can wonder why the don't see surnames, can't they? ;) One thing to remember (for the following context) - the class Person changed and there is only Person class in version #2. Simple migration script In this situation I would like to write a script: public class PersonV1toV2Script extends JacksonTransformationScript { @Override public void process(ObjectNode node) { rename(node, "secondName", "surname"); } } From the above example it is clear that the script will do the job. And you can do pretty anything with the whole tree of JSON data - adding new nodes, removing existing ones, transforming here and there - all thanks to Jackson's tree model. How can I execute it? There is a Transformer abstract class representing a transofmer responsible for passing JSON data to a script and writing it back. Currently there are tow kinds of transformers: Jackson-based JSONT-based Jackson-based is the preferred one and is more developed then JSONT-based. So to execute a transformation on a data set you have to specify only two lines of code: JacksonTransformer t = new JacksonTransformer(input, output); t.transform(PersonV1toV2Script.class.getName()); ... where input and output represent directories. In the input directory all files are treated as files containing JSON data and are transformed and written to the output directory. For a detailed test you can look into TransformerTest in the project. Conclusion The script's helper API is evolving and provides you with nice methods like removeIfExists or addNonExistent methods. We would like to hear about your use-cases which are not handled by acris-json-migration yet so the project can generally serve the purpose of JSON data migration.
September 9, 2011
by Ladislav Gažo
· 12,536 Views
article thumbnail
False Sharing
Memory is stored within the cache system in units know as cache lines. Cache lines are a power of 2 of contiguous bytes which are typically 32-256 in size. The most common cache line size is 64 bytes. False sharing is a term which applies when threads unwittingly impact the performance of each other while modifying independent variables sharing the same cache line. Write contention on cache lines is the single most limiting factor on achieving scalability for parallel threads of execution in an SMP system. I’ve heard false sharing described as the silent performance killer because it is far from obvious when looking at code. To achieve linear scalability with number of threads, we must ensure no two threads write to the same variable or cache line. Two threads writing to the same variable can be tracked down at a code level. To be able to know if independent variables share the same cache line we need to know the memory layout, or we can get a tool to tell us. Intel VTune is such a profiling tool. In this article I’ll explain how memory is laid out for Java objects and how we can pad out our cache lines to avoid false sharing. Figure 1. Figure 1. above illustrates the issue of false sharing. A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y. Unfortunately these two hot variables reside in the same cache line. Each thread will race for ownership of the cache line so they can update it. If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2. When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line. This will ping pong back and forth via the L3 cache greatly impacting performance. The issue would be further exacerbated if competing cores are on different sockets and additionally have to cross the socket interconnect. Java Memory Layout For the Hotspot JVM, all objects have a 2-word header. First is the “mark” word which is made up of 24-bits for the hash code and 8-bits for flags such as the lock state, or it can be swapped for lock objects. The second is a reference to the class of the object. Arrays have an additional word for the size of the array. Every object is aligned to an 8-byte granularity boundary for performance. Therefore to be efficient when packing, the object fields are re-ordered from declaration order to the following order based on size in bytes: doubles (8) and longs (8) ints (4) and floats (4) shorts (2) and chars (2) booleans (1) and bytes (1) references (4/8) With this knowledge we can pad a cache line between any fields with 7 longs. Within the Disruptor we pad cache lines around the RingBuffer cursor and BatchEventProcessor sequences. To show the performance impact let’s take a few threads each updating their own independent counters. These counters will be volatile longs so the world can see their progress. public final class FalseSharing implements Runnable { public final static int NUM_THREADS = 4; // change public final static long ITERATIONS = 500L * 1000L * 1000L; private final int arrayIndex; private static VolatileLong[] longs = new VolatileLong[NUM_THREADS]; static { for (int i = 0; i < longs.length; i++) { longs[i] = new VolatileLong(); } } public FalseSharing(final int arrayIndex) { this.arrayIndex = arrayIndex; } public static void main(final String[] args) throws Exception { final long start = System.nanoTime(); runTest(); System.out.println("duration = " + (System.nanoTime() - start)); } private static void runTest() throws InterruptedException { Thread[] threads = new Thread[NUM_THREADS]; for (int i = 0; i < threads.length; i++) { threads[i] = new Thread(new FalseSharing(i)); } for (Thread t : threads) { t.start(); } for (Thread t : threads) { t.join(); } } public void run() { long i = ITERATIONS + 1; while (0 != --i) { longs[arrayIndex].value = i; } } public final static class VolatileLong { public volatile long value = 0L; public long p1, p2, p3, p4, p5, p6; // comment out } } Results Running the above code while ramping the number of threads and adding/removing the cache line padding, I get the results depicted in Figure 2. below. This is measuring the duration of test runs on my 4-core Nehalem at home. Figure 2. The impact of false sharing can clearly be seen by the increased execution time required to complete the test. Without the cache line contention we achieve near linear scale up with threads. This is not a perfect test because we cannot be sure where the VolatileLongs will be laid out in memory. They are independent objects. However experience shows that objects allocated at the same time tend to be co-located. So there you have it. False sharing can be a silent performance killer. From http://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html
August 31, 2011
by Martin Thompson
· 38,989 Views · 10 Likes
article thumbnail
Concurrency Pattern: Producer and Consumer
In my career spanning 15 years, the problem of Producer and Consumer is one that I have come across only a few times. In most programming cases, what we are doing is performing functions in a synchronous fashion where the JVM or the web container handles the complexities of multi-threading on its own. However, when writing certain kinds of use cases where we need this. Last week, I came acros one such use case that sent me 3 years back when I last did it. However, the way it was done last time was very different. When I first heard the problem statement, I knew instantly what was needed. However, my approach to doing it this time was going to be different from last time. It had simply to do with how I am viewing technology in my life today. I will not go into any non-technical side and will jump straight into the problem and its solution. I started to look at what existed in the market and did come across a couple of posts that helped me in channelizing my thoughts in the right way. Problem Statement We need a solution for a batch migration. We are migrating data form System 1 to System 2 and in the process we need to do three tasks: Load data from Database based on groups Process the data Update the records loaded in step#1 with modifications We have to handle 100s of groups and each group will have around 40K records. You can imagine the amount of time it would take if we were to perform this exercise in a synchronous fashion. Image here explains this problem in an effective way. Producer Consumer: The Problem Producer and Consumer Pattern Let us take a look at the Producer Consumer pattern to begin with. If you refer to the problem statement above and look at the image, we see that there are so many entities who are ready with their part of data. However, there are not enough workers who can process all the data. Hence, as the producers continue to line-up in a queue it just continues to grow. We see that the systems start to hog up threads and take a lot of time. Intermediate Solution Producer Consumer: The Intermediate approch We do have an intermediate solution. Refer to the image and you will immediately notice that the producers are piling up their work in a filing cabinet and the worker continues to pick it up as they get done with the previous task. However, this approach does have some glaring shortcomings: There is still one worker who has to do all the work. The external systems may be happy, but the task will still continue to exist until the worker has completed all of the tasks The producers will pile up their data in a queue and it needs resources to hold the same. Just as in this example the cabinet can fill up, the same can happen with the JVM resources too. We need to be careful how much data we are going to place in memory and in some cases it may not be much. The Solution Producer Consumer: The Solution The solution is what we see everyday in many places – like the cinema hall queue, Petrol Pumps etc. There are so many people who come in to book a ticket and based on how many people come in, the more people are added to issue tickets. Essentially, refer to image here and you will notice that Producers will keep adding their jobs to the cabinet and we have more workers to handle the work load. Java provided concurrency package to solve this issue. Till now, I have always worked on threading at a much lower level and this was first time I was going to work with this package. As I started to explore the web and read fellow bloggers with what they have to say, I came across one very good article. It helped in understanding the use of BlockingQueue in a very effective manner. However, the solutions provided by Dhruba would not have helped me in achieving the high throughput which is needed. So, I started to explore the use of ArrayBlockingQueue for the same. The Controller This is the first class where the contract between the producers and consumers are managed. The controller will setup 1 thread for the Producer and 2 threads for the consumer. Based on the needs we can create as many threads as we need; and even can even read the data from a properties or do some dynamic magic. For now, we will keep this simple. package com.kapil.techieforever.producerconsumer; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; public class TestProducerConsumer { public static void main(String args[]) { try { Broker broker = new Broker(); ExecutorService threadPool = Executors.newFixedThreadPool(3); threadPool.execute(new Consumer("1", broker)); threadPool.execute(new Consumer("2", broker)); Future producerStatus = threadPool.submit(new Producer(broker)); // this will wait for the producer to finish its execution. producerStatus.get(); threadPool.shutdown(); } catch (Exception e) { e.printStackTrace(); } } } I am using ExecuteService to create a thread pool and manage it. Instead of using the basic Thread implementation, this is a more effective way as it will handle the exiting and restarting the threads as needed. You will also notice that I am using Future class to get the status of the producer thread. This class is very effective and will halt my program from further execution. This is a nice way of replacing the “.join” method on the threads. Note: I am not using Future very effectively in this example; so you may have to try a few things as you feel fit. Also, you should note the Broker class which is being used as filing cabinet between the producers and consumers. We will see its implementation in just a little while. The Producer This class is responsible for producing the data that needs to be worked upon. package com.kapil.techieforever.producerconsumer; public class Producer implements Runnable { private Broker broker; public Producer(Broker broker) { this.broker = broker; } @Override public void run() { try { for (Integer i = 1; i < 5 + 1; ++i) { System.out.println("Producer produced: " + i); Thread.sleep(100); broker.put(i); } this.broker.continueProducing = Boolean.FALSE; System.out.println("Producer finished its job; terminating."); } catch (InterruptedException ex) { ex.printStackTrace(); } } } This class is doing the most simplest of things that it can do – adding an integer to the broker. Some key areas to note are: 1. There is a property on Broker which is updated in the end by the producer when its done producing. This is also known as the “final” or “poison” entry. This is used by the consumers to know that there are no more data coming up 2. I have used Thread.sleep to simulate that some producers may take more time to produce the data. You can tweak this value and see the consumers act The Consumer This class is responsible for reading the data from the broker and doing its job package com.kapil.techieforever.producerconsumer; public class Consumer implements Runnable { private String name; private Broker broker; public Consumer(String name, Broker broker) { this.name = name; this.broker = broker; } @Override public void run() { try { Integer data = broker.get(); while (broker.continueProducing || data != null) { Thread.sleep(1000); System.out.println("Consumer " + this.name + " processed data from broker: " + data); data = broker.get(); } System.out.println("Comsumer " + this.name + " finished its job; terminating."); } catch (InterruptedException ex) { ex.printStackTrace(); } } } This is again a simple class that reads the Integer and prints it on the console. However, key points to note are: 1. The loop to process data is an endless loop, that runs on two conditions – until the producer is consuming and there is some data with the broker 2. Again, the Thread.sleep is used to create effective and different scenarios The Broker package com.kapil.techieforever.producerconsumer; import java.util.concurrent.ArrayBlockingQueue; import java.util.concurrent.TimeUnit; public class Broker { public ArrayBlockingQueue queue = new ArrayBlockingQueue(100); public Boolean continueProducing = Boolean.TRUE; public void put(Integer data) throws InterruptedException { this.queue.put(data); } public Integer get() throws InterruptedException { return this.queue.poll(1, TimeUnit.SECONDS); } } The very first thing to note is that we are using ArrayBlockingQueue as the data holder. I am not going to say what this does, but insist you to read it on the JavaDocs here. however, I will explain that the producers are going to place the data in the queue and the consumers will fetch from the queue in FIFO format. But, if the producers are slow, the consumers will wait for data to come in and if the array is full, the producers will wait for it to fill up. Also, note that I am using the ‘poll’ function instead of get in the queue. This is to ensure that the consumers will not keep waiting for ever and the waiting will time out after a few seconds. This helps us in inter-communication and kill the consumers when all the data is processed. (Note: try replacing poll with get and you will see some interesting outputs). Code I have the code sitting on Google project hosting. Feel free to go across and download it from there. It is essentially an eclipse (Spring STS) project. You may also get additional packages and classes when you download it based on when you are downloading it. Feel free to look into those too and share your comments - You can browse the source code on the SVN browser or; - You can download it from the project itself From http://scratchpad101.com/2011/08/22/concurrency-pattern-producer-consumer/
August 29, 2011
by Kapil Viren Ahuja
· 68,557 Views · 2 Likes
article thumbnail
Java NIO vs. IO
when studying both the java nio and io api's, a question quickly pops into mind: when should i use io and when should i use nio? in this text i will try to shed some light on the differences between java nio and io, their use cases, and how they affect the design of your code. main differences of java nio and io the table below summarizes the main differences between java nio and io. i will get into more detail about each difference in the sections following the table. io nio stream oriented buffer oriented blocking io non blocking io selectors stream oriented vs. buffer oriented the first big difference between java nio and io is that io is stream oriented, where nio is buffer oriented. so, what does that mean? java io being stream oriented means that you read one or more bytes at a time, from a stream. what you do with the read bytes is up to you. they are not cached anywhere. furthermore, you cannot move forth and back in the data in a stream. if you need to move forth and back in the data read from a stream, you will need to cache it in a buffer first. java nio's buffer oriented approach is slightly different. data is read into a buffer from which it is later processed. you can move forth and back in the buffer as you need to. this gives you a bit more flexibility during processing. however, you also need to check if the buffer contains all the data you need in order to fully process it. and, you need to make sure that when reading more data into the buffer, you do not overwrite data in the buffer you have not yet processed. blocking vs. non-blocking io java io's various streams are blocking. that means, that when a thread invokes a read() or write(), that thread is blocked until there is some data to read, or the data is fully written. the thread can do nothing else in the meantime. java nio's non-blocking mode enables a thread to request reading data from a channel, and only get what is currently available, or nothing at all, if no data is currently available. rather than remain blocked until data becomes available for reading, the thread can go on with something else. the same is true for non-blocking writing. a thread can request that some data be written to a channel, but not wait for it to be fully written. the thread can then go on and do something else in the mean time. what threads spend their idle time on when not blocked in io calls, is usually performing io on other channels in the meantime. that is, a single thread can now manage multiple channels of input and output. selectors java nio's selectors allow a single thread to monitor multiple channels of input. you can register multiple channels with a selector, then use a single thread to "select" the channels that have input available for processing, or select the channels that are ready for writing. this selector mechanism makes it easy for a single thread to manage multiple channels. how nio and io influences application design whether you choose nio or io as your io toolkit may impact the following aspects of your application design: the api calls to the nio or io classes. the processing of data. the number of thread used to process the data. the api calls of course the api calls when using nio look different than when using io. this is no surprise. rather than just read the data byte for byte from e.g. an inputstream, the data must first be read into a buffer, and then be processed from there. the processing of data the processing of the data is also affected when using a pure nio design, vs. an io design. in an io design you read the data byte for byte from an inputstream or a reader. imagine you were processing a stream of line based textual data. for instance: name: anna age: 25 email: [email protected] phone: 1234567890 this stream of text lines could be processed like this: inputstream input = ... ; // get the inputstream from the client socket bufferedreader reader = new bufferedreader(new inputstreamreader(input)); string nameline = reader.readline(); string ageline = reader.readline(); string emailline = reader.readline(); string phoneline = reader.readline(); notice how the processing state is determined by how far the program has executed. in other words, once the first reader.readline() method returns, you know for sure that a full line of text has been read. the readline() blocks until a full line is read, that's why. you also know that this line contains the name. similarly, when the second readline() call returns, you know that this line contains the age etc. as you can see, the program progresses only when there is new data to read, and for each step you know what that data is. once the executing thread have progressed past reading a certain piece of data in the code, the thread is not going backwards in the data (mostly not). this principle is also illustrated in this diagram: java io: reading data from a blocking stream. a nio implementation would look different. here is a simplified example: bytebuffer buffer = bytebuffer.allocate(48); int bytesread = inchannel.read(buffer); notice the second line which reads bytes from the channel into the bytebuffer. when that method call returns you don't know if all the data you need is inside the buffer. all you know is that the buffer contains some bytes. this makes processing somewhat harder. imagine if, after the first read(buffer) call, that all what was read into the buffer was half a line. for instance, "name: an". can you process that data? not really. you need to wait until at leas a full line of data has been into the buffer, before it makes sense to process any of the data at all. so how do you know if the buffer contains enough data for it to make sense to be processed? well, you don't. the only way to find out, is to look at the data in the buffer. the result is, that you may have to inspect the data in the buffer several times before you know if all the data is inthere. this is both inefficient, and can become messy in terms of program design. for instance: bytebuffer buffer = bytebuffer.allocate(48); int bytesread = inchannel.read(buffer); while(! bufferfull(bytesread) ) { bytesread = inchannel.read(buffer); } the bufferfull() method has to keep track of how much data is read into the buffer, and return either true or false, depending on whether the buffer is full. in other words, if the buffer is ready for processing, it is considered full. the bufferfull() method scans through the buffer, but must leave the buffer in the same state as before the bufferfull() method was called. if not, the next data read into the buffer might not be read in at the correct location. this is not impossible, but it is yet another issue to watch out for. if the buffer is full, it can be processed. if it is not full, you might be able to partially process whatever data is there, if that makes sense in your particular case. in many cases it doesn't. the is-data-in-buffer-ready loop is illustrated in this diagram: java nio: reading data from a channel until all needed data is in buffer. summary nio allows you to manage multiple channels (network connections or files) using only a single (or few) threads, but the cost is that parsing the data might be somewhat more complicated than when reading data from a blocking stream. if you need to manage thousands of open connections simultanously, which each only send a little data, for instance a chat server, implementing the server in nio is probably an advantage. similarly, if you need to keep a lot of open connections to other computers, e.g. in a p2p network, using a single thread to manage all of your outbound connections might be an advantage. this one thread, multiple connections design is illustrated in this diagram: java nio: a single thread managing multiple connections. if you have fewer connections with very high bandwidth, sending a lot of data at a time, perhaps a classic io server implementation might be the best fit. this diagram illustrates a classic io server design: java io: a classic io server design - one connection handled by one thread. from http://tutorials.jenkov.com/java-nio/nio-vs-io.html
August 28, 2011
by Jakob Jenkov
· 133,867 Views · 19 Likes
article thumbnail
Practical PHP Refactoring: Replace Array with Object
This refactoring is a specialization of Replace Data Value with Object: its goal is to replace a scalar or primitive structure (in this case, an ever-present array) with an object where we can host methods that act on those data. We have already seen a lightweight version of this refactoring in the code sample of that article: this time we go all the way to a real object, which has private fields representing the elements of the array. Usually the target of the refactoring is an associative array, but it may also be a numeric one, with a limited number of elements. When to introduce an object where a simple array already works? A clue that points to the need for this refactoring in numerical arrays is the fact that the elements are not homogeneous: they may be in type (all strings or integers) but not in meaning. For example, if two or them are flipped the array loses meaning or becomes very strange: array( 'FirstName LastName', '[email protected]' ) For associative arrays, the refactoring is viable everytime the number of elements is strictly fixed: array( 'name' => ... 'email' => ... ) Private fields are self-documenting, and they're easier to understand and maintain that the documentation of the keys of an array. Documentation on array structures always gets repeated in docblocks and doesn't have a real place to live in without a class; moreover, it's the death of encapsulation as nothing stops client code (even in the parts that should only pass the array to other methods) from accessing every single element of the array. And of course, a class is a place where to put methods, while an array cannot host them. Steps The technique described by Fowler for this refactoring is composed of many little steps: create a new class: it should contain only a public field encapsulating a little the array. Change the client code to use this new class in place of the primitive variable. In an iterative cycle, add a getter and a setter for each field and change client code. At each step, the relevant tests should be run. The methods should still use internally the elements of the array. When this phase has been completed, make the array private and see if the code still works. Add private fields to substitute the elements of the array, and change getters and setters accordingly. This change now ripples only into the source code of the new class. When you're finished, delete the field storing the array. Many little steps are often appropriate as the usage of the array spans over dozens of differente classes, and raises the risk of reaching an irreparably broken build. After you have reached the final state, an object with getters and setters, you can go on and remove methods accordingly for immutability or encapsulation; or move Foreign Methods to the new class now that it has become a first class citizen. Note that tests may encompass even end-to-end ones if the array was used on a large scale. For example, we replaced arrays with objects in the two upper layers of the application, forcing us to run tests at the end-to-end scale. Example In the initial state, a response is created by putting together an array. Client code is omitted for brevity, and only the creation part will be our target. true, 'content' => '{someJson:"ok"}' ); } } The array is moved onto a public field of a new class. true, 'content' => '{someJson:"ok"}' )); } } class HttpResponse { public $data; public function __construct(array $data) { $this->data = $data; } } We add setters (also getters in case we need them.) class HttpResponse { public $data; public function __construct(array $data) { $this->data = $data; } public function setSuccess($boolean) { $this->data['success'] = $boolean; } public function setContent($content) { $this->data['content'] = $content; } } The array becomes private, to check that only getters, setters and methods are really used externally. true, 'content' => '{someJson:"ok"}' )); $response->setSuccess(false); $response->setContent('{}'); $this->assertEquals(new HttpResponse(array( 'success' => false, 'content' => '{}' )), $response); } } class HttpResponse { private $data; public function __construct(array $data) { $this->setSuccess($data['success']); $this->setContent($data['content']); } public function setSuccess($boolean) { $this->data['success'] = $boolean; } public function setContent($content) { $this->data['content'] = $content; } } Private fields replace the array elements. We can start move logic into methods on the new class. class HttpResponse { private $success; private $content; public function __construct(array $data) { $this->setSuccess($data['success']); $this->setContent($data['content']); } public function setSuccess($boolean) { $this->success = $boolean; } public function setContent($content) { $this->content = $content; } }
August 24, 2011
by Giorgio Sironi
· 11,559 Views
article thumbnail
Clojure: partition-by, split-with, group-by, and juxt
Today I ran into a common situation: I needed to split a list into 2 sublists - elements that passed a predicate and elements that failed a predicate. I'm sure I've run into this problem several times, but it's been awhile and I'd forgotten what options were available to me. A quick look at http://clojure.github.com/clojure/ reveals several potential functions: partition-by, split-with, and group-by. partition-by From the docs: Usage: (partition-by f coll) Applies f to each value in coll, splitting it each time f returns a new value. Returns a lazy seq of partitions. Let's assume we have a collection of ints and we want to split them into a list of evens and a list of odds. The following REPL session shows the result of calling partition-by with our list of ints. user=> (partition-by even? [1 2 4 3 5 6]) ((1) (2 4) (3 5) (6)) The partition-by function works as described; unfortunately, it's not exactly what I'm looking for. I need a function that returns ((1 3 5) (2 4 6)). split-with From the docs: Usage: (split-with pred coll) Returns a vector of [(take-while pred coll) (drop-while pred coll)] The split-with function sounds promising, but a quick REPL session shows it's not what we're looking for. user=> (split-with even? [1 2 4 3 5 6]) [() (1 2 4 3 5 6)] As the docs state, the collection is split on the first item that fails the predicate - (even? 1). group-by From the docs: Usage: (group-by f coll) Returns a map of the elements of coll keyed by the result of f on each element. The value at each key will be a vector of the corresponding elements, in the order they appeared in coll. The group-by function works, but it gives us a bit more than we're looking for. user=> (group-by even? [1 2 4 3 5 6]) {false [1 3 5], true [2 4 6]} The result as a map isn't exactly what we desire, but using a bit of destructuring allows us to grab the values we're looking for. user=> (let [{evens true odds false} (group-by even? [1 2 4 3 5 6])] [evens odds]) [[2 4 6] [1 3 5]] The group-by results mixed with destructuring do the trick, but there's another option. juxt From the docs: Usage: (juxt f) (juxt f g) (juxt f g h) (juxt f g h & fs) Alpha - name subject to change. Takes a set of functions and returns a fn that is the juxtaposition of those fns. The returned fn takes a variable number of args, and returns a vector containing the result of applying each fn to the args (left-to-right). ((juxt a b c) x) => [(a x) (b x) (c x)] The first time I ran into juxt I found it a bit intimidating. I couldn't tell you why, but if you feel the same way - don't feel bad. It turns out, juxt is exactly what we're looking for. The following REPL session shows how to combine juxt with filter and remove to produce the desired results. user=> ((juxt filter remove) even? [1 2 4 3 5 6]) [(2 4 6) (1 3 5)] There's one catch to using juxt in this way, the entire list is processed with filter and remove. In general this is acceptable; however, it's something worth considering when writing performance sensitive code. From http://blog.jayfields.com/2011/08/clojure-partition-by-split-with-group.html
August 24, 2011
by Jay Fields
· 13,192 Views
article thumbnail
Edge Side Includes with Varnish in 10 minutes
Varnish is a tool built to be an intermediate server in the HTTP chain, not an origin one like Apache or IIS. You can outsource caching, logging, zipping and other filters to Varnish, since they are not the main feature of an HTTP server like Apache. What we'll see today is how to work with Edge Side Includes in Varnish, as a way to compose dynamic pages from independently generated and cached fragments; we won't encounter logging or other features. If you are familiar with PHP, ESI is an (almost) standard for executing include()-like statements on a front end server like Varnish; the proxy is able not only to assembly pages but also to cache them according to different policies: a certain time, for a single user, and so on. Thijs Feryn and Alessandro Nadalin introduced me to Varnish and ESI respectively, for the first time. I recommend you to consider their blogs and talks as additional sources on these topics. Installation The default version of Varnish in Ubuntu 11.04 is instead 2.1, and apparently does not support ESI very much. Installation via packages means adding a public key and a repository to your list of software sources, and install the varnish package via apt-get or an equivalent command. You can install version 3.0.0 via packages, but only in Ubuntu LTS (10.04). A way that always works in these cases is the installation from sources. The linked page will list the package dependencies and give you a sequence of 3-4 commands to seamlessly compile varnish. I used checkinstall instead of make install to get a binary package that I can reuse later: $ sudo checkinstall -D --install=no --fstrans=no [email protected] --reset-uids=yes --nodoc --pkgname=varnish --pkgversion=3.0.0 --pkgrelease=201108231000 --arch=i386 After installation with dpkg, check that varnishd is available and of the right version: [10:18:17][giorgio@Desmond:~]$ varnishd -V varnishd (varnish-3.0.0 revision 3bd5997) Copyright (c) 2006 Verdens Gang AS Copyright (c) 2006-2011 Varnish Software AS Varnish needs minimal configuration: a server to point at. For our tests you can edit /etc/varnish/default.vcl and check (or add) the following: backend default { .host = "127.0.0.1"; .port = "80"; } You can execute ps -A | grep varnishd at any time to see if varnish is already in execution. Execution [09:55:18][giorgio@Desmond:~]$ sudo varnishd -f /etc/varnish/default.vcl -s malloc,1G -T 127.0.0.1:2000 -a 0.0.0.0:8080 storage_malloc: max size 1024 MB. 1 gigabyte of memory is allocated for keeping fragments in RAM. An administrative interface will respond on port 2000, and only be accessible from localhost. http://localhost:8080/ is the exposed HTTP server, and will point to http://localhost:80 as defined in the configuration. Look at man varnishd for more switched and to man vcl for additional explanations on the configuration language. A bit of ESI ESI is a technique for leveraging HTTP cache and at the same time build dynamic pages. The problem with today's pages is that they are highly dynamic: some sections change very often or according to the current user (Welcome, John Doe or the current posts timeline); some sections do not change at all for days (the navigation bar and the layout structure); some sections change in response to external events (the list of incoming messages only when a new message arrives). It would be ideal to set different caching configurations for all the page's fragments. But implementing this strategy in the application code is error-prone and means reinventing the wheel. To use HTTP cache you will be forced to load with Ajax every single fragment of the page, even a single paragraph. With ESI, your application produces only the pieces, and lets an implementor of the Edge Side Include specification like Varnish assemble the whole thing. Example HTML page (very static): Varnish will work on this page: . PHP page (really dynamic, can change at any time): Varnish will work on this page: 2011-08-23. No sign of Varnish interventions, and totally transparent for the client. And sometimes you can also throw away Zend_Layout and similar components to assemble HTML on the PHP side.
August 23, 2011
by Giorgio Sironi
· 25,149 Views · 1 Like
article thumbnail
Practical PHP Refactoring: Replace Data Value with Object
One of the rules of simple design is the necessity to minimize the number of moving parts, like classes and methods, as long as the tests are satisfied and we are not accepting duplication or feeling the lack of an explicit concept. Thus, a rule that aids simple design is to use primitive types unless a field has already some behavior attached: we don't create a class for the user's name or the user's password; we just use some strings. As we make progress, however, we must be able to revise our decisions via refactoring: if a field gains some logic, this behavior shouldn't be modelled by methods in the containing class, but by a new object. The code in this new class can be reused, while the containing object will change from case to case and you will end up duplicating the same methods. Transforming a scalar value into an object is the essence of the Replace Data Value with Object refactoring. In most of the cases, a Value Object or a Parameter Object come out as a result: while DDD pursue Value Objects as concepts in the domain layer, this refactoring is more general and can be applied anywhere. For instance, in a project we started introducing Data Transfer Objects to model the data sent by the controller to a Service Layer. Data values in PHP In PHP, all scalar values are by nature data values as they cannot host methods: string, integers, and booleans are proper scalar. arrays are not scalar in the Perl or mathematical sense, but they are still a primitive type. On the borderline, we find some simple objects used as data containers in PHP: ArrayObjects. SplHeap and other SPL data structures. The classes on the borderline may host methods, but the original class is out of reach for modification, and an indirection has to be introduced."Local Extension" Steps Create the new class: it should contain as a private field just the value you want to substitute. The methods you immediately need have to be chosen between a constructor, getters, and setters (where needed). Change the field in the containing class. Update the constructor to also create the new object and populate the field, or accept injection (a rarer case). Update the original getter to delegate to the new one. Update the original setter to delegate to the new one (where present) or to create a new object. Run tests at the functional level; the changes should be propagated to the construction phases, while the external usage should not change very much. Example In the initial state, magic arrays are passed around. It's very easy to build an array where a key is missing or is called incorrectly. newPassword(array( 'userId' => 42, 'oldPassword' => 'gismo', 'newPassword' => 'supersecret', 'repeatNewPassword' => 'supersecret' )); $this->markTestIncomplete('This refactoring is about the introduction of an object; it suffices that the test does not explode.'); } } class UserService { public function newPassword($changePasswordData) { /* it's not interesting to do something here */ } } After the introduction of an ArrayObject extension, a little type safety is ensure and we gained a place to put methods at a little cost. newPassword(new ChangePasswordCommand(array( 'userId' => 42, 'oldPassword' => 'gismo', 'newPassword' => 'supersecret', 'repeatNewPassword' => 'supersecret' ))); $this->markTestIncomplete('This refactoring is about the introduction of an object; it suffices that the test does not explode.'); } } class UserService { public function newPassword(ChangePasswordCommand $changePasswordData) { /* it's not interesting to do something here */ } } class ChangePasswordCommand extends ArrayObject { } We add methods to implement logic on this object; in this case, validation logic; in general cases, any kind of code that should not be duplicated by the different clients. For a stricter implementation, wrap an array or another data structure (scalars, SPL objects) instead of extending ArrayObject as you gain immutability and encapsulation (but this kind of objects need little encapsulation.) class ChangePasswordCommand extends ArrayObject { public function __construct($data) { if (!isset($data['userId'])) { throw new Exception('User id is missing.'); } parent::__construct($data); } public function getPassword() { if ($this['newPassword'] != $this['repeatNewPassword']) { throw new Exception('Password do not match.'); } return $this['newPassword']; } } Being this a refactoring however, this is the less invasive kind of introduction of objects you can make as the client code can still use the ArrayAccess interface and treat the object as a scalar array.
August 15, 2011
by Giorgio Sironi
· 9,877 Views
  • Previous
  • ...
  • 426
  • 427
  • 428
  • 429
  • 430
  • 431
  • 432
  • 433
  • 434
  • 435
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×