Data Engineering Resources

The Latest Data Engineering Topics

Connecting to SQL Azure with SQL Management Studio

Intro If you want to manage your SQL Databases in Azure using tools that you’re a little more familiar and comfortable with – for example – SQL Management Studio, how do you go about connecting? You could read the help article from Microsoft, or you can follow my intuitive screen-based instructions, below: Assumptions 1. I’m assuming you have a version of SQL Management Studio already installed. I believe you’ll need at least SQL Server 2008 R2’s version or newer 2. I’m further assuming you’ve already created a SQL Database in Azure Steps to Connect SSMS to SQL Azure 1. Authenticate to the Azure Portal 2. Click on SQL Databases 3. Click on Servers 4. Click on the name of the Server you wish to connect to… 5. Click on Configure… If not already in place, click on ‘Add to the allowed IP addresses’ to add your current IP address (or specify an address you wish to connect from) and click ‘Save’ 6. Open SQL Management Studio and connect to Database services (usually comes up by default) Enter the fully qualified server name (.database.windows.net) Change to SQL Server Authentication Enter the login preferred (if a new database, the username you specified when yuo created the DB server) Enter the correct password 7. Hit the Connect button Troubleshooting Ensure you have the appropriate ports open outbound from your local network or connection (typically port 1433) Ensure you have allowed the correct public IP address you’re trying to connect from via the Azure Portal (steps 1-5 above) Ensure you are using the correct server name and user name For SSMS, this is the server name (in step 4) followed by .database.windows.net Ensure you are using SQL Server Authentication For SSMS the username format is If you forgot the password of your username, you can reset the password in the Azure Portal, in step 4, click on Dashboard: Lastly… You can click on the Database (in step 2) to see your connection options:

September 25, 2013

by Rob Sanders

· 262,687 Views

The Real Cost of Change in Software Development

There are two widely opposed (and often misunderstood) positions on how expensive it can be to change or fix software once it has been designed, coded, tested and implemented. One holds that it is extremely expensive to leave changes until late, that the cost of change rises exponentially. The other position is that changes should be left as late as possible, because the cost of changing software is – or at least can be – essentially flat (that’s why we call it software). Which position is right? Why should we care? And what can we do about it? Exponential Cost of Change Back in the early 1980s, Barry Boehm published some statistics (Software Engineering Economics, 1981) which showed that the cost of making a software change or fix increases significantly over time – you can see the original curve that he published here. Boehm looked at data collected from Waterfall-based projects at TRW and IBM in the 1970s, and found that the cost of making a change increases as you move from the stages of requirements analysis to architecture, design, coding, testing and deployment. A requirements mistake found and corrected while you are still defining the requirements costs almost nothing. But if you wait until after you've finished designing, coding and testing the system and delivering it to the customer, it can cost up to 100 times as much. A few caveats here. First, the cost curve is much higher in large projects (in smaller projects, the cost curve is more like 1:4 instead of 1:100). Those cases when the cost of change rises up to 100 times are rare, what Boehm calls Architecture-Breakers, where the team gets a fundamental architectural assumption wrong (scaling, performance, reliability) and doesn't find out until after customers are already using the system and running into serious operational problems. This analysis was all done on a small data sample from more than 30 years ago, when developing code was much more expensive and time-consuming and paperworky, and the tools sucked. A few other studies have been done since then that mostly back up Boehm's findings – at least the basic idea that the longer it takes for you to find out that you made a mistake, the more expensive it is to correct it. These studies have been widely referenced in books like Steve McConnell’s Code Complete, and used to justify the importance of early reviews and testing: Studies over the last 25 years have proven conclusively that it pays to do things right the first time. Unnecessary changes are expensive. Researchers at Hewlett-Packard, IBM, Hughes Aircraft, TRW, and other organizations have found that purging an error by the beginning of construction allows rework to be done 10 to 100 times less expensively than when it's done in the last part of the process, during system test or after release (Fagan 1976; Humphrey, Snyder, and Willis 1991; Leffingwell 1997; Willis et al. 1998; Grady 1999; Shull et al. 2002; Boehm and Turner 2004). In general, the principle is to find an error as close as possible to the time at which it was introduced. The longer the defect stays in the software food chain, the more damage it causes further down the chain. Since requirements are done first, requirements defects have the potential to be in the system longer and to be more expensive. Defects inserted into the software upstream also tend to have broader effects than those inserted further downstream. That also makes early defects more expensive. There’s some controversy over how accurate and complete this data is, how much we can rely on it, and how relevant it is today when we have much better development tools and many teams have moved from heavyweight sequential Waterfall development to lightweight iterative, incremental development approaches. Flattening the Cost of Changing Code The rules of the game should change with iterative and incremental development – because they have to. Boehm realized back in the 1980s that we could catch more mistakes early (and therefore reduce the cost of development) if we think about risks upfront and design and build software in increments, using what he called the Spiral Model, rather than trying to define, design and build software in a Waterfall sequence. The same ideas are behind more modern, lighter Agile development approaches. In Extreme Programming Explained (the first edition, but not the second) Kent Beck states that minimizing the cost of change is one of the goals of Extreme Programming, and that a flattened change cost curve is “the technical premise of XP”: Under certain circumstances, the exponential rise in the cost of changing software over time can be flattened. If we can flatten the curve, old assumptions about the best way to develop software no longer hold … You would make big decisions as late in the process as possible, to defer the cost of making the decisions and to have the greatest possible chance that they would be right. You would only implement what you had to, in hopes that the needs you anticipate for tomorrow wouldn't come true. You would introduce elements to the design only as they simplified existing code or made writing the next bit of code simpler. It’s important to understand that Beck doesn't say that with XP the change curve is flat. He says that these costs can be flattened if teams work toward this, leveraging key practices and principles in XP, such as: Simple Design, doing the simplest thing that works, and deferring design decisions as late as possible (YAGNI), so that the design is easy to understand and easy to change Continuous, disciplined refactoring to keep the code easy to understand and easy to change Test-First Development – writing automated tests upfront to catch coding mistakes immediately, and to build up a testing safety net to catch mistakes in the future Developers collaborating closely and constantly with the customer to confirm their understanding of what they need to build and working together in pairs to design solutions and solve problems, and catch mistakes and misunderstandings early Relying on working software over documentation to minimize the amount of paperwork that needs to be done with each change (write code, not specs) The team’s experience working incrementally and iteratively – the more that people work and think this way, the better they will get at it. All of this makes sense and sounds right, although there are no studies that back up these assertions, which is why Beck dropped this change curve discussion from the second edition of his XP book. But, by then, the idea that change could be flat with Agile development had already become accepted by many people. The Importance of Feedback Scott Amber agrees that the cost curve can be flattened in Agile development, not because of Simple Design, but because of the feedback loops that are fundamental to iterative, incremental development. Agile methods optimize feedback within the team, developers working closely together with each other and with the customer and relying on continuous face-to-face communications. Following technical practices like test-first development, pair programming and continuous integration makes these feedback loops even tighter. But what really matters is getting feedback from the people using the system – it’s only then that you know if you got it right or what you missed. The longer that it takes to design and build something and get feedback from real users, the more time and work that is required to get working software into a real customer’s hands, the higher your cost of change really is. Optimizing and streamlining this feedback loop is what is driving the lean startup approach to development: defining a minimum viable product (something that just barely does the job), getting it out to customers as quickly as you can, and then responding to user feedback through continuous deployment and A/B testing techniques until you find out what customers really want. Even Flat Change Can Still Be Expensive Even if you do everything to optimize these feedback loops and minimize your overheads, this still doesn’t mean that change will come cheap. Being fast isn’t good enough if you make too many mistakes along the way. The Post Agilist uses the example of painting a house: Assume that it costs $1,000 each time you paint the house, whether you paint it blue, red or white. The cost of change is flat. But if you have to paint it blue first, then red, then white before everyone is happy, you’re wasting time and money. “No matter how expensive or cheap the "cost of change" curve may be, the fewer changes that are made, the cheaper and faster the result will be … Planning is not a four letter word.” (However, I would like to point out that “plan” is.) Spending too much time upfront in planning and design is waste. But not spending enough time upfront to find out what you should be building and how you should be building it before you build it, and not taking the care to build it carefully, is also a waste. Change Gets More Expensive Over Time You also have to accept that the incremental cost of change will go up over the life of a system, especially once a system is being used. This is not just a technical debt problem. The more people using the system, the more people who might be impacted by the change if you get it wrong, the more careful you have to be. This means that you need to spend more time on planning and communicating changes, building and testing a roll-back capability, and roll changes out slowly using canary releases and dark launching – which add costs and delays to getting feedback. There are also more operational dependencies that you have to understand and take care of, and more data that you have to change or fix up, making changes even more difficult and expensive. If you do things right, keep a good team together and manage technical debt responsibly, these costs should rise gently over the life of a system – and if you don’t, that exponential change curve will kick in. What is the real cost of change? Is the real cost of change exponential, or is it flat? The truth is somewhere in between. There’s no reason that the cost of making a change to software has to be as high as it was 30 years ago. We can definitely do better today, with better tools and better, cheaper ways of developing software. The keys to minimizing the costs of change seem to be: Get your software into customer hands as quickly as you can. I am not convinced that any organization really needs to push out software changes 10 to 50 to 100 times a day, but you don’t want to wait months or years for feedback, either. Deliver less, but more often. And because you’re going to deliver more often, it makes sense to build a continuous delivery pipeline so that you can push changes out efficiently and with confidence. Use ideas from lean software development and maybe Kanban to identify and eliminate waste and to minimize cycle time. We know that, even with lots of upfront planning and design thinking, we won’t get everything right upfront -- this is the Waterfall fallacy. But it’s also important not to waste time and money iterating when you don’t need to. Spending enough time upfront in understanding requirements and in design to get it at least mostly right the first time can save a lot later on. Whether you’re working incrementally and iteratively, or sequentially, it makes good sense to catch mistakes early when you can, whether you do this through test-first development and pairing, or requirements workshops and code reviews -- whatever works for you.

September 20, 2013

by Jim Bird

· 20,284 Views

Top 10 Methods for Java Arrays

The following are top 10 methods for Java Array. They are the most voted questions from stackoverflow. 0. Decalre an array String[] aArray = new String[5]; String[] bArray = {"a","b","c", "d", "e"}; String[] cArray = new String[]{"a","b","c","d","e"}; 1. Print an array in Java int[] intArray = { 1, 2, 3, 4, 5 }; String intArrayString = Arrays.toString(intArray); // print directly will print reference value System.out.println(intArray); // [I@7150bd4d System.out.println(intArrayString); // [1, 2, 3, 4, 5] 2. Create ArrayList from array String[] stringArray = { "a", "b", "c", "d", "e" }; ArrayList arrayList = new ArrayList(Arrays.asList(stringArray)); System.out.println(arrayList); // [a, b, c, d, e] 3. Check if an array contains a certain value String[] stringArray = { "a", "b", "c", "d", "e" }; boolean b = Arrays.asList(stringArray).contains("a"); System.out.println(b); // true 4. Concatenate two arrays int[] intArray = { 1, 2, 3, 4, 5 }; int[] intArray2 = { 6, 7, 8, 9, 10 }; // Apache Commons Lang library int[] combinedIntArray = ArrayUtils.addAll(intArray, intArray2); 5. Declare array inline method(new String[]{"a", "b", "c", "d", "e"}); 6. Joins the elements of the provided array into a single String // containing the provided list of elements // Apache common lang String j = StringUtils.join(new String[] { "a", "b", "c" }, ", "); System.out.println(j); // a, b, c 7. Covnert ArrayList to Array String[] stringArray = { "a", "b", "c", "d", "e" }; ArrayList arrayList = new ArrayList(Arrays.asList(stringArray)); String[] stringArr = new String[arrayList.size()]; arrayList.toArray(stringArr); for (String s : stringArr) System.out.println(s); 8. Convert Array to Set Set set = new HashSet(Arrays.asList(stringArray)); System.out.println(set); //[d, e, b, c, a] 9. Reverse an array int[] intArray = { 1, 2, 3, 4, 5 }; ArrayUtils.reverse(intArray); System.out.println(Arrays.toString(intArray)); //[5, 4, 3, 2, 1] 10. Remove element of an array int[] intArray = { 1, 2, 3, 4, 5 }; int[] removed = ArrayUtils.removeElement(intArray, 3);//create a new array System.out.println(Arrays.toString(removed)); One more – convert int to byte array byte[] bytes = ByteBuffer.allocate(4).putInt(8).array(); for (byte t : bytes) { System.out.format("0x%x ", t); } In addition, do you know what arrays look like in memory ?

September 18, 2013

by Ryan Wang

· 70,745 Views · 2 Likes

Solving the Detached Many-to-Many Problem with the Entity Framework

Introduction This article is part of the ongoing series I’ve been writing recently, but can be read as a standalone article. I’m going to do a better job of integrating the changes documented here into the ongoing solution I’ve been building. However, considering how much time and effort I put into solving this issue, I’ve decided to document the approach independently in case it is of use to others in the interim. The Problem Defined This issue presents itself when you are dealing with disconnected/detached Entity Framework POCO objects,. as the DbContext doesn’t track changes to entities. Specifically, trouble occurs with entities participating in a many-to-many relationship, where the EF has hidden a “join table” from the model itself. The problem with detached entities is that the data context has no way of knowing what changes have been made to an object graph, without fetching the data from the data store and doing an entity-by-entity comparison – and that assuming it’s possible to fetch the same way as it was originally. In this solution, all the entities are detached, don’t use proxy types and are designed to move between WCF service boundaries. Some Inspiration There are no out-of-the-box solutions that I’m aware of which can process POCO object graphs that are detached. I did find an interesting solution called GraphDiff which is available from github and also as a NuGet package, but it didn’t work with the latest RC version of the Entity Framework (v6). I also found a very comprehensive article on how to implement a generic repository pattern with the Entity Framework, but it was unable to handle detached many-to-many relationships. In any case, I highly recommend a read of this article, it was inspiration for some of the approach I’ve ended up taking with my own design. The Approach This morning I put together a simple data model with the relationships that I wanted to support with detached entities. I’ve attached the solution with a sample schema and test data at the bottom of this article. If you prefer to open and play with it, be sue to add the Entity Framework (v6 RC) via NuGet, I’ve omitted it for file size and licensing reasons). Here’s a logical view of the model I wanted to support: Here’s the schema view from SQL Server: Here’s the Entity Model which is generated from the above SQL schema: In the spirit of punching myself in the head, I’ve elected to have one table implement an identity specification (meaning the underlying schema allocated PK ID values) whereas the other two tables the ID must be specified. Theoretically, if I can handle the entity types in a generic fashion, then this solution can scale out to larger and more complex models. The scenarios I’m specifically looking to solve in this solution with detached object graphs are as follows: Add a relationship (many-to-many) Add a relationship (FK-based) Update a related entity (many-to-many) Update a related entity (FK-based) Remove a relationship (many-to-many) Remove a relationship (FK-based) Per the above, here’s the scenarios within the context of the above data model: Add a new Secondary entity to a Primary entity Add an Other entity to a Secondary entity Update a Secondary entity by updating a Primary entity Update an Other entity from a Secondary entity (or Primary entity) Remove (but not delete!) a Secondary entity from a Primary entity Remove (but not delete) a Other entity from a Secondary entity Establishing Test Data Just to give myself a baseline, the data model is populated (by default) with the following data. This gives us some “existing entities” to query and modify. More Work for the Consumer Although I tried my best, I couldn’t come to a design which didn’t require the consuming client to do slightly more work to enable this to work properly. Unfortunately the best place for change tracking to occur with disconnected entities is with the layer making changes – be it a business layer or something downstream. To this effect, entities will need to implement a property which reflects the state of the entity (added, modified, deleted etc.). For the object graph to be updated/managed successfully, the consumer of the entities needs to set the entity state properly. This isn’t at all as bad as it sounds, but it’s not nothing. Establishing some Scaffolding After generating the data model, the first thing to be done is ensure each entity derives from the same base class. (“EntityBase”) this is used later to establish the active state of an entity when it needs to be processed. I’ve also created an enum (“ObjectState”) which is a property of the base class and a helper function which maps ObjectState to an EF EntityState. In case this isn’t clear, here’s a class view: Constructing Data Access To ensure that the usage is consistent, I’ve defined a single Data Access class, mainly to establish the pattern for handling detached object graphs. I can’t stress enough that this is not intended as a guide to an appropriate way to structure your data access – I’ll be updating my ongoing series of articles to go into more detail – this is only to articulate a design approach to handling detached object graphs. Having said all that, here’s a look at my “DataAccessor” class, which can be used with generic data access entities (by way of generics): As with my ongoing project, the Entity Framework DbContext is instantiated by this class on construction, and implements IDisposable to ensure the DbContext is disposed properly upon construction. Here’s the constructor showing the EF configuration options I’m using: public DataAccessor() { _accessor = new SampleEntities(); _accessor.Configuration.LazyLoadingEnabled = false; _accessor.Configuration.ProxyCreationEnabled = false; } Updating an Entity We start with a basic scenario to ensure that the scaffolding has been implemented properly. The scenario is to query for a Primary entity and then change a property and update the entity in the data store. [TestMethod] public void UpdateSingleEntity() { Primary existing = null; String existingValue = String.Empty; using (DataAccessor a = new DataAccessor()) { existing = a.DataContext.Primaries.Include("Secondaries").First(); Assert.IsNotNull(existing); existingValue = existing.Title; existing.Title = "Unit " + DateTime.Now.ToString("MMdd hh:mm:ss"); } using (DataAccessor b = new DataAccessor()) { existing.State = ObjectState.Modified; b.InsertOrUpdate(existing); } using (DataAccessor c = new DataAccessor()) { existing.Title = existingValue; existing.State = ObjectState.Modified; c.InsertOrUpdate(existing); } } You’ll noticed that there is nothing particularly significant here, except that the object’s State is reset toModified between operations. Updating a Many-to-Many Relationship Now things get interesting. I’m going to query for a Primary entity, then I’ll update both a property of thePrimary entity itself, and a property of one of the entity’s relationships. [TestMethod] public void UpdateManyToMany() { Primary existing = null; Secondary other = null; String existingValue = String.Empty; String existingOtherValue = String.Empty; using (DataAccessor a = new DataAccessor()) { //Note that we include the navigation property in the query existing = a.DataContext.Primaries.Include("Secondaries").First(); Assert.IsTrue(existing.Secondaries.Count() > 1, "Should be at least 1 linked item"); } //save the original description existingValue = existing.Description; //set a new dummy value (with a date/time so we can see it working) existing.Description = "Edit " + DateTime.Now.ToString("yyyyMMdd hh:mm:ss"); existing.State = ObjectState.Modified; other = existing.Secondaries.First(); //save the original value existingOtherValue = other.AlternateDescription; //set a new value other.AlternateDescription = "Edit " + DateTime.Now.ToString("yyyyMMdd hh:mm:ss"); other.State = ObjectState.Modified; //a new data access class (new DbContext) using (DataAccessor b = new DataAccessor()) { //single method to handle inserts and updates //set a breakpoint here to see the result in the DB b.InsertOrUpdate(existing); } //return the values to the original ones existing.Description = existingValue; other.AlternateDescription = existingOtherValue; existing.State = ObjectState.Modified; other.State = ObjectState.Modified; using (DataAccessor c = new DataAccessor()) { //update the entities back to normal //set a breakpoint here to see the data before it reverts back c.InsertOrUpdate(existing); } } If we actually run this unit test and set the breakpoints accordingly, you’ll see the following in the database: Database at Breakpoint #1 / Database at Breakpoint #2 Database when Unit Test completes You’ll notice at the second breakpoint that the description of the first entities have both been updated. Examining the Insert/Update Code The function exposed by the “data access” class really just passes through to another private function which does the heavy lifting. This is mainly in case we need to reuse the logic, since it essentially processes state action on attached entities. public void InsertOrUpdate(params T[] entities) where T : EntityBase { ApplyStateChanges(entities); DataContext.SaveChanges(); } Here’s the definition of the ApplyStateChanges function, which I’ll discuss below: private void ApplyStateChanges(params T[] items) where T : EntityBase { DbSet dbSet = DataContext.Set(); foreach (T item in items) { //loads related entities into the current context dbSet.Attach(item); if (item.State == ObjectState.Added || item.State == ObjectState.Modified) { dbSet.AddOrUpdate(item); } else if (item.State == ObjectState.Deleted) { dbSet.Remove(item); } foreach (DbEntityEntry entry in DataContext.ChangeTracker.Entries() .Where(c => c.Entity.State != ObjectState.Processed && c.Entity.State != ObjectState.Unchanged)) { var y = DataContext.Entry(entry.Entity); y.State = HelperFunctions.ConvertState(entry.Entity.State); entry.Entity.State = ObjectState.Processed; } } } Notes on this Implementation What this function does is to iterate through the items to be examined, attach them to the current Data Context (which also attaches their children), act on each item accordingly (add/update/remove) and then process new entities which have been added to the Data Context’s change tracker. For each newly “discovered” entity (and ignoring entities which are unchanged or have already been examined), each entity’s DbEntityEntry is set according to the entity’s ObjectState (which is set by the calling client). Doing this allows the Entity Framework to understand what actions it needs to perform on the entities when SaveChanges() is invoked later. You’ll also note that I set the entity’s state to “Processed” when it has been examined, so we don’t act on it more than once (for performance purposes). Fun note: the AddOrUpdate extension method is something I found in theSystem.Data.Entity.Migrations namespace and it acts as an ‘Upsert’ operation, inserting or updating entities depending on whether they exist or not already. Bonus! That’s it for adding and updating, believe it or not. Corresponding Unit Test The following unit test establishes the creation of a new many-to-many entity, it is then removed (by relationship) and then finally deleted altogether from the database: [TestMethod] public void AddRemoveRelationship() { Primary existing = null; using (DataAccessor a = new DataAccessor()) { existing = a.DataContext.Primaries.Include("Secondaries") .FirstOrDefault(); Assert.IsNotNull(existing); } Secondary newEntity = new Secondary(); newEntity.State = ObjectState.Added; newEntity.AlternateTitle = "Unit"; newEntity.AlternateDescription = "Test"; newEntity.SecondaryId = 1000; existing.Secondaries.Add(newEntity); using (DataAccessor a = new DataAccessor()) { //breakpoint #1 here a.InsertOrUpdate(existing); } newEntity.State = ObjectState.Unchanged; existing.State = ObjectState.Modified; using (DataAccessor b = new DataAccessor()) { //breakpoint #2 here b.RemoveEntities(existing, x => x.Secondaries, newEntity); } using (DataAccessor c = new DataAccessor()) { //breakpoint #3 here c.Delete(newEntity); } } Test Results: Pre-Test – Breakpoint #1 / Breakpoint #2 Breakpoint #3 / Post execution (new entity deleted) SQL Profile Trace Removing a Many-to-Many Relationship Now this is where it gets tricky. I’d like to have something a little more polished, but the best I have come up with to date is a separate operation on the data provider which exposes functionality akin to “remove relationship”. The fundamental problem with how the EF POCO entities work without any modifications, is when they are detached, to remove a many-to-many relationship, the relationship to be removed is physically removed from the collection. When the object graph is sent back for processing, there’s a missing related entity, and the service or data context would have to make an assumption that the omission was on purpose, not to mention that it would have to compare against data currently in the data store. To make this easier, I’ve implemented a function called “RemoveEnttiies” which alters the relationship between the parent and the child/children. The one bug catch is that you need to specify the navigation property or collection, which might make it slightly undesirable to implement generically. In any case, I’ve provided two options – with the navigation property as a string parameter or as a LINQ expression – they both do the same thing. public void RemoveEntities(T parent, Expression> expression, params T2[] children) where T : EntityBase where T2 : EntityBase { DataContext.Set().Attach(parent); ObjectContext obj = DataContext.ToObjectContext(); foreach (T2 child in children) { DataContext.Set().Attach(child); obj.ObjectStateManager.ChangeRelationshipState(parent, child, expression, EntityState.Deleted); } DataContext.SaveChanges(); } Notes on this Implementation The “ToObjectContext” is an extension method, and is akin to (DataContext as IObjectContextAdapter).ObjectContext. This is to expose a more fundamental part of the Entity Framework’s object model. We need this level of access to get to the functionality which controls relationships. For each child to be removed (note: not deleted from the physical database), we nominate the parent object, the child, the navigation property (collection) and the nature of the relationship change (delete). Note that this will NOT WORK for Foreign Key defined relationships – more on that below. To delete entities which have active relationships, you’ll need to drop the relationship before attempting to delete or else you’ll have data integrity/referential integrity errors, unless you have accounted for cascading deletion (which I haven’t). Example execution: using (DataAccessor c = new DataAccessor()) { //c.RemoveEntities(existing, "Secondaries", s); //(or can use an expression): c.RemoveEntities(existing, x => x.Secondaries, s); } Removing FK Relationships As mentioned above, you can’t just edit the relationship to remove an FK-based relationship. Instead, you have to follow the EF practice of setting the FK entity to NULL. Here’s a Unit Test which demonstrates how this is achieved: Secondary s = ExistingEntity(); using (DataAccessor c = new DataAccessor()) { s.Other = null; s.OtherId = null; s.State = ObjectState.Modified; o.State = ObjectState.Unchanged; c.InsertOrUpdate(s); } We use the same “Insert or Update’ call – being aware that you have to set the ObjectState properties accordingly. Note: I’m in the process of testing the reverse removal – i.e. what happens if you want to remove a Secondaryentity from an Other entity’s collection. Deleting Entities This is fairly straightforward, but I’ve taken a few more precautions to ensure that the entity to be deleted is valid no the server side. public void Delete(params T[] entities) where T : EntityBase { foreach (T entity in entities) { T attachedEntity = Exists(entity); if (attachedEntity != null) { var attachedEntry = DataContext.Entry(attachedEntity); attachedEntry.State = EntityState.Deleted; } } DataContext.SaveChanges(); } To understand the above, you should take a look at the implementation of the “Exists” function which essentially checks the data store and local cache to see if there is an attached representation: protected T Exists(T entity) where T : EntityBase { var objContext = ((IObjectContextAdapter)this.DataContext) .ObjectContext; var objSet = objContext.CreateObjectSet(); var entityKey = objContext.CreateEntityKey(objSet.EntitySet.Name, entity); DbSet set = DataContext.Set(); var keys = (from x in entityKey.EntityKeyValues select x.Value).ToArray(); //Remember, there can by surrogate keys, so don't assume there's //just one column/one value //If a surrogate key isn't ordered properly, the Set().Find() //method will fail, use attributes on the entity to determine the //proper order. //context.Configuration.AutoDetectChangesEnabled = false; return set.Find(keys); } This is a fairly expensive operation which is why it’s pretty much reserved for deletes and not more frequent operations. It essentially determines the target entity’s primary key and then checks whether the entity exists or not. Note: I haven’t tested this on entities with surrogate keys, but I’ll get to it at some point. If you have surrogate key tables, you can define the PK key order using attributes on the model entity, but I haven’t done this (yet). Summary This article is the culmination of about two days of heavy analysis and investigation. I’ve got a whole lot more to contribute on this topic, but for now, I felt it was worthy enough to post as-is. What you’ve got here is still incredibly rough, and I haven’t done nearly enough testing. To be honest, I was quite excited by the initial results, which is why I decided to write this post. there’s an incredibly good chance that I’ve missed something in the design and implementation, so please be aware of that. I’ll be continuing to refine this approach in my main series of articles with much cleaner implementation. In the meantime though, if any of this helps anyone out there struggling with detached entities, I hope it helps. There’s precious few articles and samples that are up to date, and very few that seem to work. This is provided without any warranty of any kind! If you find any issues please e-mail me rob.sanders@sanderstechnology.com and I’ll attempt to refactor/debug and find ways around some of the inherent limitations. In the meantime, there are a few helpful links I’ve come across in my travels on the WWW. See below. Example Solution Files [ Files ] Note: you’ll need to add the Entity Framework v6 RC package via NuGet, I haven’t included it in the archive. Helpful Links http://blog.magnusmontin.net/2013/05/30/generic-dal-using-entity-framework/ https://github.com/refactorthis/GraphDiff http://stackoverflow.com/questions/11686225/dbset-find-method-ridiculously-slow-compared-to-singleordefault-on-id http://stackoverflow.com/questions/10381106/cannot-update-many-to-many-relationships-in-entity-framework http://stackoverflow.com/questions/8413248/how-to-save-an-updated-many-to-many-collection-on-detached-entity-framework-4-1 http://stackoverflow.com/questions/6018711/generic-way-to-check-if-entity-exists-in-entity-framework

September 18, 2013

by Rob Sanders

· 162,826 Views

Introduction to ElasticSearch

Learn about ElasticSearch, an open source tool developed with Java. It is a Lucene-based, scalable, full-text search engine, and a data analysis tool.

September 17, 2013

by Hüseyin Akdoğan

CORE

· 12,113 Views · 5 Likes

EasyNetQ: Big Breaking Changes in the Advanced Bus

EasyNetQ is my little, easy to use, client API for RabbitMQ. It’s been doing really well recently. As I write this, it has 24,653 downloads on NuGet, making it by far the most popular high-level RabbitMQ API. The goal of EasyNetQ is to make working with RabbitMQ as easy as possible. I wanted junior developers to be able to use basic messaging patterns out-of-the-box with just a few lines of code and have EasyNetQ do all the heavy lifting: exchange-binding-queue configuration, error management, connection management, serialization, thread handling; all the things that make working against the low level AMQP C# API, provided by RabbitMQ, such a steep learning curve. To meet this goal, EasyNetQ has to be a very opinionated library. It has a set way of configuring exchanges, bindings and queues based on the .NET type of your messages. However, right from the first release, many users said that they liked the connection management, thread handling, and error management, but wanted to be able to set up their own broker topology. To support this, we introduced the advanced API, an idea stolen shamelessly from Ayende’s RavenDB client. You access the advanced bus (IAdvancedBus) via the Advanced property on IBus: var advancedBus = RabbitHutch.CreateBus("host=localhost").Advanced; Sometimes something can seem like a good idea at the time, and then later you think, “WTF! Why on earth did I do that?” It happens to me all the time. I thought it would be cool if I created the exchange-binding-queue topology and then passed it to the publish and subscribe methods, which would then internally declare the exchanges and queues and do the binding. I implemented a tasty little visitor pattern in my ITopologyVisitor. I optimized for my own programming pleasure, rather than an a simple, obvious, easy-to-understand API. I realized a while ago that a more straightforward set of declares on IAdvancedBus would be a far more obvious and intentional design. To this end, I’ve refactored the advanced bus to separate declares from publishing and consuming. I just pushed the changes to NuGet and have also updated the Advanced Bus documentation. Note that these are breaking changes, so please be careful if you are upgrading to the latest version, 0.12, and upwards. Here is a taste of how it works: Declare a queue, exchange and binding, and consume raw message bytes: var advancedBus = RabbitHutch.CreateBus("host=localhost").Advanced; var queue = advancedBus.QueueDeclare("my_queue"); var exchange = advancedBus.ExchangeDeclare("my_exchange", ExchangeType.Direct); advancedBus.Bind(exchange, queue, "routing_key"); advancedBus.Consume(queue, (body, properties, info) => Task.Factory.StartNew(() => { var message = Encoding.UTF8.GetString(body); Console.Out.WriteLine("Got message: '{0}'", message); })); Note that I’ve renamed ‘Subscribe’ to ‘Consume’ to better reflect the underlying AMQP method. Declare an exchange and publish a message: var advancedBus = RabbitHutch.CreateBus("host=localhost").Advanced; var exchange = advancedBus.ExchangeDeclare("my_exchange", ExchangeType.Direct); using (var channel = advancedBus.OpenPublishChannel()) { var body = Encoding.UTF8.GetBytes("Hello World!"); channel.Publish(exchange, "routing_key", new MessageProperties(), body); } You can also delete exchanges, queues and bindings: var advancedBus = RabbitHutch.CreateBus("host=localhost").Advanced; // declare some objects var queue = advancedBus.QueueDeclare("my_queue"); var exchange = advancedBus.ExchangeDeclare("my_exchange", ExchangeType.Direct); var binding = advancedBus.Bind(exchange, queue, "routing_key"); // and then delete them advancedBus.BindingDelete(binding); advancedBus.ExchangeDelete(exchange); advancedBus.QueueDelete(queue); advancedBus.Dispose(); I think these changes make for a much better advanced API. Have a look at the documentation for the details.

September 13, 2013

by Mike Hadlow

· 11,672 Views

Introducing the Spring YARN framework for Developing Apache Hadoop YARN Applications

Originally posted on the SpringSource blog by Janne Valkealahti We're super excited to let the cat out of the bag and release support for writing YARN based applications as part of the Spring for Apache Hadoop 2.0 M1 release. In this blog post I’ll introduce you to YARN, what you can do with it, and how Spring simplifies the development of YARN based applications. If you have been following the Hadoop community over the past year or two, you’ve probably seen a lot of discussions around YARN and the next version of Hadoop's MapReduce called MapReduce v2. YARN (Yet Another Resource Negotiator) is a component of the MapReduce project created to overcome some performance issues in Hadoop's original design. The fundamental idea of MapReduce v2 is to split the functionalities of the JobTracker, Resource Management and Job Scheduling/Monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and a per-application Application Master (AM). A generic diagram for YARN component dependencies can be found from YARN architecture. MapReduce Version 2 is an application running on top of YARN. It is also possible to make similar custom YARN based application which have nothing to do with MapReduce, it is simply running YARN application. However, writing a custom YARN based application is difficult. The YARN APIs are low-level infrastructure APIs and not developer APIs. Take a look at thedocumentation for writing a YARN application to get an idea of what is involved. Starting with the 2.0 version, Spring for Apache Hadoop introduces the Spring YARN sub-project to provide support for building Spring based YARN applications. This support for YARN steps in by trying to make development easier. “Spring handles the infrastructure so you can focus on your application” applies to writing Hadoop applications as well as other types of Java applications. Spring’s YARN support also makes it easier to test your YARN application. With Spring’s YARN support, you're going to use all familiar concepts of Spring Framework itself, including configuration and generally speaking what you can do in your application. At a high level, Spring YARN provides three different components, YarnClient, YarnAppmaster andYarnContainer which together can be called a Spring YARN Application. We provide default implementations for all components while still giving the end user an option to customize as much as he or she wants. Lets take a quick look at a very simplistic Spring YARN application which runs some custom code in a Hadoop cluster. The YarnClient is used to communicate with YARN's Resource Manager. This provides management actions like submitting new application instances, listing applications and killing running applications. When submitting applications from the YarnClient, the main concerns relate to how the Application Master is configured and launched. Both the YarnAppmaster andYarnContainer share the same common launch context config logic so you'll see a lot of similarities in YarnClient and YarnAppmaster configuration. Similar to how the YarnClient will define the launch context for the YarnAppmaster, the YarnAppmaster defines the launch context for the YarnContainer. The Launch context defines the commands to start the container, localized files, command line parameters, environment variables and resource limits(memory, cpu). The YarnContainer is a worker that does the heavy lifting of what a YARN application will actually do. The YarnAppmaster is communicating with YARN Resource Manager and starts and stops YarnContainers accordingly. You can create a Spring application that launches an ApplicationMaster by using the YARN XML namespace to define a Spring Application Context. Context configuration for YarnClient defines the launch context for YarnAppmaster. This includes resources and libraries needed by YarnAppmaster and its environment settings. An example of this is shown below. Note: Future releases will provide a Java based API for configuration, similar to what is done in Spring Security 3.2. The purpose of YarnAppmaster is to control the instance of a running application. TheYarnAppmaster is responsible for controlling the lifecycle of all its YarnContainers, the whole running application after the application is submitted, as well as itself. The example above is defining a context configuration for the YarnAppmaster. Similar to what we saw in YarnClient configuration, we define local resources for the YarnContainer and its environment. The classpath setting picks up hadoop jars as well as your own application jars in default locations, change the setting if you want to use non-default directories. Also within theYarnAppmaster we define components handling the container allocation and bootstrapping. Allocator component is interacting with YARN resource manager handling the resource scheduling. Runner component is responsible for bootstrapping of allocated containers. Above example defines a simple YarnContainer context configuration. To implement the functionality of the container, you implement the interface YarnContainer. The YarnContainer interface is similar to Java’s Runnable interface, its has a run() method, as well as two additional methods related to getting environment and command line information. Below is a simple hello world application that will be run inside of a YARN container: public class MyCustomYarnContainer implements YarnContainer { private static final Log log = LogFactory.getLog(MyCustomYarnContainer.class); @Override public void run() { log.info("Hello from MyCustomYarnContainer"); } @Override public void setEnvironment(Map environment) {} @Override public void setParameters(Properties parameters) {} } We just showed the configuration of a Spring YARN Application and the core application logic so what remains is how to bootstrap the application to run inside the Hadoop cluster. The utility class, CommandLineClientRunner provides this functionality. You can you use CommandLineClientRunner either manually from a command line or use it from your own code. # java -cp org.springframework.yarn.client.CommandLineClientRunner application-context.xml yarnClient -submit A Spring YARN Application is packaged into a jar file which then can be transferred into HDFS with the rest of the dependencies. A YarnClient can transfer all needed libraries into HDFS during the application submit process but generally speaking it is more advisable to do this manually in order to avoid unnecessary network I/O. Your application wont change until new version is created so it can be copied into HDFS prior the first application submit. You can i.e. use Hadoop's hdfs dfs -copyFromLocal command. Below you can see an example of a typical project setup. src/main/java/org/example/MyCustomYarnContainer.java src/main/resources/application-context.xml src/main/resources/appmaster-context.xml src/main/resources/container-context.xml As a wild guess, we'll make a bet that you have now figured out that you are not actually configuring YARN, instead you are configuring Spring Application Contexts for all three components, YarnClient, YarnAppmaster and YarnContainer. We have just scratched the surface of what we can do with Spring YARN. While we’re preparing more blog posts, go ahead and check existing samples in GitHub. Basically, to reflect the concepts we described in this blog post, see the multi-context example in our samples repository. Future blog posts will cover topics like Unit Testing and more advanced YARN application development.

September 11, 2013

by Pieter Humphrey

CORE

· 13,344 Views

How to shard a cron

Sharding is a database partitioning technique that distributed Aggregates such as rows or documents across multiple servers; this choice for horizontal queries trades in some client complexity (whose queries must include a shard key such as a zip code or a customer id) for the capability of distributing the dataset between multiple servers, scaling not only the read but also the write capacity. On the application side, there are several singleton processes - for example cron configurations - that are usually run only once on the whole data set. For a certain category of singleton processes we can switch to a shard-like architecture that can scale first to multiple processes and when necessary to multiple servers. Step 1: identify the candidate process Take a look at your crontab or at your process scheduling configuration if you use another infrastructure. Some of the processes are aggregations of data producing statistics, and their work can already be distributed with patterns such as MapReduce. The kind of processes interesting for client sharding is the one where an operation is performed over every single element of the data set. Each element is an Aggregate and as such does not interact, in a single transaction, with other ones. Therefore these processes are intrinsically parallelizable: you only need a way to distribute the load. Some examples of shard-able processes are: rebuild the data aggregates for a new day for each user perform some consistency checks on every customer order send all pending orders perform a renewal for each user subscription Step 2: choose a shard key Once you have identified the aggregates along which to parallelize a length operation, the choice of the shard key will usually be straightforward. This key must be uniformly distributed between the N shards you want to create on the client side. Some examples: the zip code for customers the numerical, sequential id for orders a UUID for uploaded videos the transaction reference number for money transactions Step 3: divide the work with the shard key A process before the application of sharding is usually composed of two phases: select all Aggregates whose satisfy condition C apply operation O to all the selected Aggregates. The first operation can be sometime sharded directly, transforming each of the N processes in: select all Aggregates whose satisfy condition C and whose shard key is equal to this shard's number modulo N. apply operation O to all the selected Aggregates. For example, the first operation for shard 0 of 4 can be accomplished by an SQL query: SELECT * FROM aggregate_table WHERE outdated=true // condition C AND aggregate_id % 4=0// sharding Precalculating aggregate_id % 4 can improve the performance of the query, depending on your database; however it can make more difficult to rescale the number of processes. When you switch to 8 or 16 client shards it will be necessary to stop all current running processes, recalculate the column and restart the new batch. Furthermore, the performance of ALTER TABLE is usually not good on large tables which are the subject of this client sharding technique. Some times we're not able to divide the query in a partition of the original data set directly in the database. The pattern becomes: select id (and shard key if different) for all Aggregates whose satisfy condition C. filter the subset by only considering the id whose modulo N is equal to this shard's number. apply operation O to all the selected Aggregates. For example, I use this second form while using multiple client with MongoDB. I do not know if it's possible to query ObjectIds by their modulo, so I resort to selecting all of them and then filtering them out on the client side: $id = (string) $document['_id']; $numericalValue = hexdec(substr($id, -4)); // without substr() the conversion will overflow 32-bit integers if ($numericalValues % $shards == $shard) { ... } This is only useful under the assumption that is not the query that's taking too much time in the original process, but the application of the O operation to all of its results. Note also that these solutions needs well-behaving processes that do not intervene on the data of each other: the filtering is left to the programmer. In general, it is also necessary to guarantee mutual exclusion with the original singleton process; this usually comes up when deploying the battery of N crons, as they should not start until the last of the singleton processes has terminated not to be started again.

September 4, 2013

by Giorgio Sironi

· 6,505 Views

Efficient Techniques For Loading Data Into Memory

Data loading usually has to do with initializing cache data on start up. However, quite often caches need to be loaded or reloaded periodically, and not only on start up. In cases in which you need to load lots of data, either at start up or at any point afterward, using standard cache put(...) or putAll(...) operations is generally inefficient, especially when transactional boundaries are not important. This is especially true when data has to be partitioned across the network, so you don't know in advance on which node the data will end up. For fast loading of large amounts of data, GridGain provides a cool mechanism called data loader (implemented via GridDataLoader). The data loader will properly batch keys together and collocate those batches with nodes on which the data will be cached. By controlling the size of the batch and the size of internal transactions it is possible to achieve very fast data loading rates. The code below shows an example of how it can be done: // Get the data loader reference. try (GridDataLoader ldr = grid.dataLoader("partitioned")) { // Load the entries. for (int i = 0; i < ENTRY_COUNT; i++) ldr.addData(i, Integer.toString(i)); } Whenever the data is submitted to the data loader, it is stored in the buffer, which is consumed by loader threads. If the buffer is full, the user thread will block on the addData(...) call until the loader threads free enough room for new entries. Another method of data preloading is to load it directly from a persistent data store. GridGain supports that via the GridCache.loadCache(...) method. Note that this method of loading data into cache is very efficient as it is local, non-transactional and is usually implemented using bulk data store operations. The reason it can afford to be non-transactional is because it will not override any values in cache, it can only insert new values. This means that if some transaction has already updated an entry, this entry will not be overwritten by the loadCache(...) call. Whenever the GridCache.loadCache(...) method is called, it will internally delegate to the underlying persistent store implementation by invoking the GridCacheStore.loadAll(...) method. Usually implementation of this method will load from DB either full or partial set of objects depending on requirements. Here is an example of how the GridCacheStore.loadAll(...) method may be implemented: @Override public void loadAll(@Nullable String cacheName, GridInClosure2 closure, Object... args) throws GridException { try (Connection conn = getConnection()) { // Load all Persons from database (perhaps to warm up cache?) try (PreparedStatement st = conn.prepareStatement("select * from PERSONS")) { ResultSet rs = st.executeQuery(); while (rs.next()) c.apply( // Key. UUID.fromString(rs.getString(1)), // New value. person(rs.getString(1), rs.getString(2), rs.getString(3), rs.getString(4)) ); } } } catch (SQLException e) { throw new GridException("Failed to load objects", e); } } Note that, instead of returning a collection of loaded entries, this method instead passes each load entry into the closure provided by the system, which avoids costly large collection creations and internal resizing. GridGain will then take the values passed into the closure and store them in cache. Using the above loading routines will often render 10x and above performance improvement over simple put(...) calls.

September 4, 2013

by Dmitriy Setrakyan

· 9,048 Views

API Gateway and API Portal - The pillars of API Management and the evolution of SOA

API Management solutions must combine an API Portal (for signing up developers) with an API Gateway (to link back to the enterprise). But where do these come from, and what is the relationship with SOA? To answer these questions, first let's look at a bit of history: In the 2000's, we had the SOA Gateway and the SOA Registry, working hand-in-hand. This was "SOA Governance". The SOA Registry (with a Repository) was intended to be the "central store of truth" for information about Web Services. It was often the public face of SOA Governance, the part which people could see. Usually the services in the registry took the form of heavyweight SOAP services, defined by WSDLs. The problem was that developers were often forced to register their SOAP services in the registry, rather than feeling that it was something beneficial to them. Browsing the registry was also a chore, involving the use of UDDI, also a heavyweight protocol (in fact, it was built on SOAP). Fast-forward to the current decade, and we find that the SOA Registry has been replaced by the API Portal. An API portal is also the "central store of truth", but now it includes REST APIs definitions (usually expressed using a Swagger-type format) as well as SOAP services. The API Portal is designed to be useful and helpful to developers who wish to build apps, rather than feeling like a chore to use. The lesson of SOA was that an attitude of "If we build it, they will come" (or "If we put it in the SOA Registry, people will use it") does not work. You have to make it into a pleasant experience for developers. API portals work for the very reason that SOA registries did not work: usability. Just like the SOA Gateway worked with the SOA Registry, so the API Gateway works hand-in-hand with the API Portal. Together, the combination of the API Portal with the API Gateway constitutes "API Management". The API Portal is for developers to sign up to use APIs, receive API Keys and quotas, and the API Gateway operates at runtime, managing the API Key usage and enforcing the API usage quotas. The API Gateway also performs the very important task of bridging from the technologies used by API clients (REST, OAuth) to the technologies used in the enterprise (Kerberos, SAML, or proprietary identity tokens such as CA SiteMinder smsession tokens). For more on this bridging, check out my webinar with Jason Cardinal from Identica tomorrow on "Bridging APIs to Enterprise Infrastructure". Gartner defines the combination of SOA Governance and API Management as "Application Services Governance". I'm proud to say that Axway (which acquired Vordel in 2012) is recognized by Gartner as a Leader in the category of Application Services Governance. We've seen an evolution of technologies (SOAP to REST) and approach (the UDDI registry to the web-based API Portal) in the journey from SOA Governance to API Management. From 30,000 feet, SOA Governance and API Management might look similar, but the new approach of API Management has already outshone SOA. The API Gateway and API Portal are key to this.

September 3, 2013

by Mitch Pronschinske

· 7,504 Views

When Reading Excel with POI, Beware of Floating Points

Our problem began when we tried to read a certain cell that contained the value 929 as a numeric field and store it into an integer.

August 30, 2013

by Lieven Doclo

· 47,361 Views · 1 Like

Assigning UUIDs to Neo4j Nodes and Relationships

TL;DR: This blog post features a small demo project on github: neo4j-uuid and explains how to automatically assign UUIDs to nodes and relationships in Neo4j. A very brief introduction into Neo4j 1.9′s KernelExtensionFactory is included as well. A Little Rant on Neo4j Node/Relationship IDs In a lot of use cases there is demand for storing a reference to a Neo4j node or relationship in a third party system. The first naive idea probably is to use the internal node/relationship id that Neo4j provides. Do not do that! Ever! You ask why? Well, Neo4j’s id is basically a offset in one of the store files Neo4j uses (with some math involved). Assume you delete couple of nodes. This produces holes in the store files that Neo4j might reclaim when creating new nodes later on. And since the id is a file offset there is a chance that the new node will have exactly the same id like the previously deleted node. If you don’t synchronously update all node id references stored elsewhere, you’re in trouble. If neo4j would be completely redeveloped from scratch the getId() method would not be part of the public API. As long as you use node ids only inside a request of an application for example, there’s nothing wrong. To repeat myself: Never ever store a node id in a third party system. I have officially warned you. UUIDs Enough of ranting, let’s see what we can do to safely store node references in an external system. Basically we need an identifier that has no semantics in contrast to the node id. A common approach to this is using Universally Unique Identifiers (UUID). Java JDK offers a UUID implementation, so we could potentially use UUID.randomUUID(). Unfortunately random UUIDs are slow to generate. A preferred approach is to use the machine’s MAC and a timestamp as base for the UUID – this should provide enough uniqueness. There a nice library out there at http://wiki.fasterxml.com/JugHome providing exactly what we need. Automatic UUID Assignments For convenience it would be great if all fresh created nodes and relationships get automatically assigned a uuid property without doing this explicitly. Fortunately Neo4j supports TransactionEventHandlers, a callback interface pluging into transaction handling. A TransactionEventHandler has a chance to modify or veto any transaction. It’s a sharp tool which can have significant negative performance impact if used the wrong way. I’ve implemented a UUIDTransactionEventHandler that performs the following tasks: Populate a UUID property for each new node or relationship Reject a transaction if a manual modification of a UUID is attempted; either assignment or removal public class UUIDTransactionEventHandler implements TransactionEventHandler { public static final String UUID_PROPERTY_NAME = "uuid"; private final TimeBasedGenerator uuidGenerator = Generators.timeBasedGenerator(); @Override public Object beforeCommit(TransactionData data) throws Exception { checkForUuidChanges(data.removedNodeProperties(), "remove"); checkForUuidChanges(data.assignedNodeProperties(), "assign"); checkForUuidChanges(data.removedRelationshipProperties(), "remove"); checkForUuidChanges(data.assignedRelationshipProperties(), "assign"); populateUuidsFor(data.createdNodes()); populateUuidsFor(data.createdRelationships()); return null; } @Override public void afterCommit(TransactionData data, java.lang.Object state) { } @Override public void afterRollback(TransactionData data, java.lang.Object state) { } /** * @param propertyContainers set UUID property for a iterable on nodes or relationships */ private void populateUuidsFor(Iterable propertyContainers) { for (PropertyContainer propertyContainer : propertyContainers) { if (!propertyContainer.hasProperty(UUID_PROPERTY_NAME)) { final UUID uuid = uuidGenerator.generate(); final StringBuilder sb = new StringBuilder(); sb.append(Long.toHexString(uuid.getMostSignificantBits())).append(Long.toHexString(uuid.getLeastSignificantBits())); propertyContainer.setProperty(UUID_PROPERTY_NAME, sb.toString()); } } } private void checkForUuidChanges(Iterable> changeList, String action) { for (PropertyEntry removedProperty : changeList) { if (removedProperty.key().equals(UUID_PROPERTY_NAME)) { throw new IllegalStateException("you are not allowed to " + action + " " + UUID_PROPERTY_NAME + " properties"); } } } } Setting up Using KernelExtensionFactory There are two remaining tasks for full automation of UUID assignments: We need to setup autoindexing for uuid properties to have a convenient way to look up nodes or relationships by UUID We need to register UUIDTransactionEventHandler with the graph database Since version 1.9 Neo4j has the notion of KernelExtensionFactory. Using KernelExtensionFactory you can supply a class that receives lifecycle callbacks when e.g. Neo4j is started or stopped. This is the right place for configuring autoindexing and setting up the TransactionEventHandler. Since JVM’s ServiceLoader is used KernelExtenstionFactories need to be registered in a file META-INF/services/org.neo4j.kernel.extension.KernelExtensionFactory by listing all implementations you want to use: org.neo4j.extension.uuid.UUIDKernelExtensionFactory KernelExtensionFactories can declare dependencies, therefore declare a inner interface (“Dependencies” in code) below that just has getters. Using proxies Neo4j will implement this class and supply you with the required dependencies. The dependencies are match on requested type, see Neo4j’s source code what classes are supported for being dependencies. KernelExtensionFactories must implement a newKernelExtension method that is supposed to return a instance of LifeCycle. For our UUID project we return a instance of UUIDLifeCycle: package org.neo4j.extension.uuid; import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.PropertyContainer; import org.neo4j.graphdb.event.TransactionEventHandler; import org.neo4j.graphdb.factory.GraphDatabaseSettings; import org.neo4j.graphdb.index.AutoIndexer; import org.neo4j.graphdb.index.IndexManager; import org.neo4j.kernel.configuration.Config; import org.neo4j.kernel.lifecycle.LifecycleAdapter; import java.util.Map; /** * handle the setup of auto indexing for UUIDs and registers a {@link UUIDTransactionEventHandler} */ class UUIDLifeCycle extends LifecycleAdapter { private TransactionEventHandler transactionEventHandler; private GraphDatabaseService graphDatabaseService; private IndexManager indexManager; private Config config; UUIDLifeCycle(GraphDatabaseService graphDatabaseService, Config config) { this.graphDatabaseService = graphDatabaseService; this.indexManager = graphDatabaseService.index(); this.config = config; } /** * since {@link org.neo4j.kernel.NodeAutoIndexerImpl#start()} is called *after* {@link org.neo4j.extension.uuid.UUIDLifeCycle#start()} it would apply config settings for auto indexing. To prevent this we change config here. * @throws Throwable */ @Override public void init() throws Throwable { Map params = config.getParams(); params.put(GraphDatabaseSettings.node_auto_indexing.name(), "true"); params.put(GraphDatabaseSettings.relationship_auto_indexing.name(), "true"); config.applyChanges(params); } @Override public void start() throws Throwable { startUUIDIndexing(indexManager.getNodeAutoIndexer()); startUUIDIndexing(indexManager.getRelationshipAutoIndexer()); transactionEventHandler = new UUIDTransactionEventHandler(); graphDatabaseService.registerTransactionEventHandler(transactionEventHandler); } @Override public void stop() throws Throwable { stopUUIDIndexing(indexManager.getNodeAutoIndexer()); stopUUIDIndexing(indexManager.getRelationshipAutoIndexer()); graphDatabaseService.unregisterTransactionEventHandler(transactionEventHandler); } void startUUIDIndexing(AutoIndexer autoIndexer) { autoIndexer.startAutoIndexingProperty(UUIDTransactionEventHandler.UUID_PROPERTY_NAME); } void stopUUIDIndexing(AutoIndexer autoIndexer) { autoIndexer.stopAutoIndexingProperty(UUIDTransactionEventHandler.UUID_PROPERTY_NAME); } } Most of the code is pretty much straight forward, l.44/45 set up autoindexing for uuid property. l48 registers the UUIDTransactionEventHandler with the graph database. Not that obvious is the code in the init() method. Neo4j’s NodeAutoIndexerImpl configures autoindexing itself and switches it on or off depending on the respective config option. However we want to have autoindexing always switched on. Unfortunately NodeAutoIndexerImpl is run after our code and overrides our settings. That’s we l.37-40 tweaks the config settings to force nice behaviour of NodeAutoIndexerImpl. Looking up Nodes or Relationships for UUID For completeness the project also contains a trivial unmanaged extension for looking up nodes and relationships using the REST interface, see UUIDRestInterface. By sending a HTTP GET to http://localhost:7474/db/data/node/ the node’s internal id returned. Build System and Testing For building the project, Gradle is used; build.gradle is trivial. Of course couple of tests are included. As a long standing addict I’ve obviously used Spock for testing. See the test code here. Final Words A downside of this implementation is that each and every node and relationships gets indexed. Indexing always trades write performance for read performance. Keep that in mind. It might make sense to get rid of unconditional auto indexing and put some domain knowledge into the TransactionEventHandler to assign only those nodes uuids and index them that are really used for storing in an external system.

August 22, 2013

by Stefan Armbruster

· 11,342 Views

OpenStack Savanna: Fast Hadoop Cluster Provisioning on OpenStack

introduction openstack is one of the most popular open source cloud computing projects to provide infrastructure as a service solution. its key components are compute (nova), networking (neutron, formerly known as quantum), storage (object and block storage, swift and cinder, respectively), openstack dashboard (horizon), identity service (keystone) and image service (glance). there are other official incubated projects like metering (celiometer) and orchestration and service definition (heat). savanna is a hadoop as a service for openstack introduced by mirantis . it is still in an early phase (version .02 was released in summer 2013) and according to its roadmap version 1.0 is targeted for official openstack incubation. in principle, heat also could be used for hadoop cluster provisioning but savanna is especially tuned for providing hadoop-specific api functionality while heat is meant to be used for generic purposes. savanna architecture savanna is integrated with the core openstack components such as keystone, nova, glance, swift and horizon. it has a rest api that supports the hadoop cluster provisioning steps. savanna api is implemented as a wsgi server that, by default, listens to port 8386. in addition, savanna can also be integrated with horizon, the openstack dashboard to create a hadoop cluster from the management console. savanna also comes with a vanilla plugin that deploys a hadoop cluster image. the standard out-of-the-box vanilla plugin supports hadoop 1.1.2 version. installing savanna the simplest option to try out savanna is to use devstack in a virtual machine. i was using an ubuntu 12.04 virtual instance in my tests. in that environment we need to execute the following commands to install devstack and savanna api: $ sudo apt-get install git-core $ git clone https://github.com/openstack-dev/devstack.git $ vi localrc # edit localrc admin_password=nova mysql_password=nova rabbit_password=nova service_password=$admin_password service_token=nova # enable swift enabled_services+=,swift swift_hash=66a3d6b56c1f479c8b4e70ab5c2000f5 swift_replicas=1 swift_data_dir=$dest/data # force checkout prerequsites # force_prereq=1 # keystone is now configured by default to use pki as the token format which produces huge tokens. # set uuid as keystone token format which is much shorter and easier to work with. keystone_token_format=uuid # change the floating_range to whatever ips vm is working in. # in nat mode it is subnet vmware fusion provides, in bridged mode it is your local network. floating_range=192.168.55.224/27 # enable auto assignment of floating ips. by default savanna expects this setting to be enabled extra_opts=(auto_assign_floating_ip=true) # enable logging screen_logdir=$dest/logs/screen $ ./stack.sh # this will take a while to execute $ sudo apt-get install python-setuptools python-virtualenv python-dev $ virtualenv savanna-venv $ savanna-venv/bin/pip install savanna $ mkdir savanna-venv/etc $ cp savanna-venv/share/savanna/savanna.conf.sample savanna-venv/etc/savanna.conf # to start savanna api: $ savanna-venv/bin/python savanna-venv/bin/savanna-api --config-file savanna-venv/etc/savanna.conf to install savanna ui integrated with horizon, we need to run the following commands: $ sudo pip install savanna-dashboard $ cd /opt/stack/horizon/openstack-dashboard $ vi settings.py horizon_config = { 'dashboards': ('nova', 'syspanel', 'settings', 'savanna'), installed_apps = ( 'savannadashboard', .... $ cd /opt/stack/horizon/openstack-dashboard/local $ vi local_settings.py savanna_url = 'http://localhost:8386/v1.0' $ sudo service apache2 restart provisioning a hadoop cluster as a first step, we need to configure keystone-related environment variables to get the authentication token: ubuntu@ip-10-59-33-68:~$ vi .bashrc $ export os_auth_url=http://127.0.0.1:5000/v2.0/ $ export os_tenant_name=admin $ export os_username=admin $ export os_password=nova ubuntu@ip-10-59-33-68:~$ source .bashrc ubuntu@ip-10-59-33-68:~$ ubuntu@ip-10-59-33-68:~$ env | grep os os_password=nova os_auth_url=http://127.0.0.1:5000/v2.0/ os_username=admin os_tenant_name=admin ubuntu@ip-10-59-33-68:~$ keystone token-get +-----------+----------------------------------+ | property | value | +-----------+----------------------------------+ | expires | 2013-08-09t20:31:12z | | id | bdb582c836e3474f979c5aa8f844c000 | | tenant_id | 2f46e214984f4990b9c39d9c6222f572 | | user_id | 077311b0a8304c8e86dc0dc168a67091 | +-----------+----------------------------------+ $ export auth_token="bdb582c836e3474f979c5aa8f844c000" $ export tenant_id="2f46e214984f4990b9c39d9c6222f572" then we need to create the glance image that we want to use for our hadoop cluster. in our example we have used mirantis's vanilla image but we can also build our own image: $ wget http://savanna-files.mirantis.com/savanna-0.2-vanilla-1.1.2-ubuntu-12.10.qcow2 $ glance image-create --name=savanna-0.2-vanilla-hadoop-ubuntu.qcow2 --disk-format=qcow2 --container-format=bare < ./savanna-0.2-vanilla-1.1.2-ubuntu-12.10.qcow2 ubuntu@ip-10-59-33-68:~/devstack$ glance image-list +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ | id | name | disk format | container format | size | status | +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ | d0d64f5c-9c15-4e7b-ad4c-13859eafa7b8 | cirros-0.3.1-x86_64-uec | ami | ami | 25165824 | active | | fee679ee-e0c0-447e-8ebd-028050b54af9 | cirros-0.3.1-x86_64-uec-kernel | aki | aki | 4955792 | active | | 1e52089b-930a-4dfc-b707-89b568d92e7e | cirros-0.3.1-x86_64-uec-ramdisk | ari | ari | 3714968 | active | | d28051e2-9ddd-45f0-9edc-8923db46fdf9 | savanna-0.2-vanilla-hadoop-ubuntu.qcow2 | qcow2 | bare | 551699456 | active | +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ $ export image_id=d28051e2-9ddd-45f0-9edc-8923db46fdf9 then we have installed httpie , an open source http client that can be used to send rest requests to savanna api: $ sudo pip install httpie from now on we will use httpie to send savanna commands. we need to register the image with savanna: $ export savanna_url="http://localhost:8386/v1.0/$tenant_id" $ http post $savanna_url/images/$image_id x-auth-token:$auth_token username=ubuntu http/1.1 202 accepted content-length: 411 content-type: application/json date: thu, 08 aug 2013 21:28:07 gmt { "image": { "os-ext-img-size:size": 551699456, "created": "2013-08-08t21:05:55z", "description": "none", "id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "metadata": { "_savanna_description": "none", "_savanna_username": "ubuntu" }, "mindisk": 0, "minram": 0, "name": "savanna-0.2-vanilla-hadoop-ubuntu.qcow2", "progress": 100, "status": "active", "tags": [], "updated": "2013-08-08t21:28:07z", "username": "ubuntu" } } $ http $savanna_url/images/$image_id/tag x-auth-token:$auth_token tags:='["vanilla", "1.1.2", "ubuntu"]' http/1.1 202 accepted content-length: 532 content-type: application/json date: thu, 08 aug 2013 21:29:25 gmt { "image": { "os-ext-img-size:size": 551699456, "created": "2013-08-08t21:05:55z", "description": "none", "id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "metadata": { "_savanna_description": "none", "_savanna_tag_1.1.2": "true", "_savanna_tag_ubuntu": "true", "_savanna_tag_vanilla": "true", "_savanna_username": "ubuntu" }, "mindisk": 0, "minram": 0, "name": "savanna-0.2-vanilla-hadoop-ubuntu.qcow2", "progress": 100, "status": "active", "tags": [ "vanilla", "ubuntu", "1.1.2" ], "updated": "2013-08-08t21:29:25z", "username": "ubuntu" } } then we need to create a nodegroup templates (json files) that will be sent to savanna. there is one template for the master nodes ( namenode , jobtracker ) and another template for the worker nodes such as datanode and tasktracker . the hadoop version is 1.1.2. $ vi ng_master_template_create.json { "name": "test-master-tmpl", "flavor_id": "2", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_processes": ["jobtracker", "namenode"] } $ vi ng_worker_template_create.json { "name": "test-worker-tmpl", "flavor_id": "2", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_processes": ["tasktracker", "datanode"] } $ http $savanna_url/node-group-templates x-auth-token:$auth_token < ng_master_template_create.json http/1.1 202 accepted content-length: 387 content-type: application/json date: thu, 08 aug 2013 21:58:00 gmt { "node_group_template": { "created": "2013-08-08t21:58:00", "flavor_id": "2", "hadoop_version": "1.1.2", "id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "name": "test-master-tmpl", "node_configs": {}, "node_processes": [ "jobtracker", "namenode" ], "plugin_name": "vanilla", "updated": "2013-08-08t21:58:00", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } } $ http $savanna_url/node-group-templates x-auth-token:$auth_token < ng_worker_template_create.json http/1.1 202 accepted content-length: 388 content-type: application/json date: thu, 08 aug 2013 21:59:41 gmt { "node_group_template": { "created": "2013-08-08t21:59:41", "flavor_id": "2", "hadoop_version": "1.1.2", "id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "name": "test-worker-tmpl", "node_configs": {}, "node_processes": [ "tasktracker", "datanode" ], "plugin_name": "vanilla", "updated": "2013-08-08t21:59:41", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } } the next step is to define the cluster template: $ vi cluster_template_create.json { "name": "demo-cluster-template", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_groups": [ { "name": "master", "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "count": 1 }, { "name": "workers", "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "count": 2 } ] } $ http $savanna_url/cluster-templates x-auth-token:$auth_token < cluster_template_create.json http/1.1 202 accepted content-length: 815 content-type: application/json date: fri, 09 aug 2013 07:04:24 gmt { "cluster_template": { "anti_affinity": [], "cluster_configs": {}, "created": "2013-08-09t07:04:24", "hadoop_version": "1.1.2", "id": "{ "name": "cluster-1", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "cluster_template_id" : "64c4117b-acee-4da7-937b-cb964f0471a9", "user_keypair_id": "stack", "default_image_id": "3f9fc974-b484-4756-82a4-bff9e116919b" }", "name": "demo-cluster-template", "node_groups": [ { "count": 1, "flavor_id": "2", "name": "master", "node_configs": {}, "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "node_processes": [ "jobtracker", "namenode" ], "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 }, { "count": 2, "flavor_id": "2", "name": "workers", "node_configs": {}, "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "node_processes": [ "tasktracker", "datanode" ], "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } ], "plugin_name": "vanilla", "updated": "2013-08-09t07:04:24" } } now we are ready to create the hadoop cluster: $ vi cluster_create.json { "name": "cluster-1", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "cluster_template_id" : "64c4117b-acee-4da7-937b-cb964f0471a9", "user_keypair_id": "savanna", "default_image_id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9" } $ http $savanna_url/clusters x-auth-token:$auth_token < cluster_create.json http/1.1 202 accepted content-length: 1153 content-type: application/json date: fri, 09 aug 2013 07:28:14 gmt { "cluster": { "anti_affinity": [], "cluster_configs": {}, "cluster_template_id": "64c4117b-acee-4da7-937b-cb964f0471a9", "created": "2013-08-09t07:28:14", "default_image_id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "hadoop_version": "1.1.2", "id": "d919f1db-522f-45ab-aadd-c078ba3bb4e3", "info": {}, "name": "cluster-1", "node_groups": [ { "count": 1, "created": "2013-08-09t07:28:14", "flavor_id": "2", "instances": [], "name": "master", "node_configs": {}, "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "node_processes": [ "jobtracker", "namenode" ], "updated": "2013-08-09t07:28:14", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 }, { "count": 2, "created": "2013-08-09t07:28:14", "flavor_id": "2", "instances": [], "name": "workers", "node_configs": {}, "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "node_processes": [ "tasktracker", "datanode" ], "updated": "2013-08-09t07:28:14", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } ], "plugin_name": "vanilla", "status": "validating", "updated": "2013-08-09t07:28:14", "user_keypair_id": "savanna" } } after a while we can run the nova command to check if the instances are created and running: $ nova list +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ | id | name | status | task state | power state | networks | +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ | 1a9f43bf-cddb-4556-877b-cc993730da88 | cluster-1-master-001 | active | none | running | private=10.0.0.2, 192.168.55.227 | | bb55f881-1f96-4669-a94a-58cbf4d88f39 | cluster-1-workers-001 | active | none | running | private=10.0.0.3, 192.168.55.226 | | 012a24e2-fa33-49f3-b051-9ee2690864df | cluster-1-workers-002 | active | none | running | private=10.0.0.4, 192.168.55.225 | +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ now we can log in to the hadoop master instance and run the required hadoop commands: $ ssh -i savanna.pem ubuntu@10.0.0.2 $ sudo chmod 777 /usr/share/hadoop $ sudo su hadoop $ cd /usr/share/hadoop $ hadoop jar hadoop-example-1.1.2.jar pi 10 100 savanna ui via horizon in order to create nodegroup templates, cluster templates and the cluster itself we used a command line tool -- httpie -- to send rest api calls. the same functionality is also available via horizon, the standard openstack dashboard. first we need to register the image with savanna: then we need to create the nodegroup templates: after that we have to create the cluster template: and finally we have to create the cluster:

August 20, 2013

by Istvan Szegedi

· 9,151 Views

Remove Characters at the Start and End of a String in PHP

In a previous article about how you can remove whitesapce from a string, I spoke about using the functions ltrim() and rtrim(). These work by passing in a string to remove whitespace. Using the ltrim() function will remove the whitespace from the start of the string, using the rtrim() function will remove the whitespace from the end of the string. But you can also use these functions to remove characters from a string. These functions take a second parameter that allows you to specify what characters to remove. // This will search for the word start at the beginning of the string and remove it ltrim($string, 'start'); // This will search for the word end at the end of the string and remove it rtrim($string, 'end'); Remove Trailing Slashes From a String A common use for this functionality is to remove the trailing slash from a URL. Below is a code snippet that allows you to easily do this using the rtrim() function. function remove_trailing_slashes( $url ) { return rtrim($url, '/'); } A common use for the ltrim() function is to remove the "http://" from a URL. Use the function below to remove both "http" and "https" from a URL: function remove_http( $url ) { $url = ltrim($url, 'http://'); $url = ltrim($url, 'https://'); return $url; }

August 20, 2013

by Paul Underwood

· 41,174 Views

DyngoDB: A MongoDB Interface for DynamoDB

You might be asking yourself, 'why do I need a MongoDB-like experience for DynamoDB when there are already full-MongoDB cloud services like MMS, MongoLab, MongoHQ and MongoDirector? One developer believes there is a need and has set up an experimental project called DyngoDB. It provides the MongoDB-style interface in front of Amazon's DynamoDB and their CloudSearch service. Apparently, in the developer's case, he only wants the MongoDB interface but prefers the DynamoDB storage engine. We'll have to see if other developers also have this specific set of preferences.

August 20, 2013

by Mitch Pronschinske

· 5,449 Views

A Consistent Approach To Client-Side Cache Invalidation

Download the source code for this blog entry here: ClientSideCacheInvalidation.zip TL;DR? Please scroll down to the bottom of this article to review the summary. I ran into a problem not long ago where some JSON results from an AJAX call to an ASP.NET MVC JsonResult action were being cached by the browser, quite intentionally by design, but were no longer up-to-date, and without devising a new approach to route manipulation or any of the other fundamental infrastructural designs for the endpoints (because there were too many) our hands were tied. The caching was being done using the ASP.NETOutputCacheAttribute on the action being invoked in the AJAX call, something like this (not really, but this briefly demonstrates caching): [OutputCache(Duration = 300)] public JsonResult GetData() { return Json(new { LastModified = DateTime.Now.ToString() }, JsonRequestBehavior.AllowGet); } @model dynamic @{ ViewBag.Title = "Home"; } Home Reload @section scripts { } Since we were using a generalized approach to output caching (as we should), I knew that any solution to this problem should also be generalized. My first thought was in the mistaken assumption that the default [OutputCache] behavior was to rely on client-side caching, since client-side caching was what I was observing while using Fiddler. (Mind you, in the above sample this is not the case, it is actually server-side, but this is probably because of the amount of data being transferred. I’ll explain after I explain what I did in my false assumption.) Microsoft’s default convention for implementing cache invalidation is to rely on “VaryBy..” semantics, such as varying the route parameters. That is great except that the route and parameters were currently not changing in our implementation. So, my initial proposal was to force the caching to be done on the server instead of on the client, and to invalidate when appropriate. public JsonResult DoSomething() { // // Do something here that has a side-effect // of making the cached data stale // Response.RemoveOutputCacheItem(Url.Action("GetData")); return Json("OK"); } [OutputCache(Duration = 300, Location = OutputCacheLocation.Server)] public JsonResult GetData() { return Json(new { LastModified = DateTime.Now.ToString() }, JsonRequestBehavior.AllowGet); } Invalidate $('#invalidate').on('click', function() { $.post($APPROOT + "Home/DoSomething", null, function(o) { window.location.reload(); }, 'json'); }); While Reload has no effect on the Last modified value, the Invalidate button causes the date to increment. When testing, this actually worked quite well. But concerns were raised about the payload of memory on the server. Personally I think the memory payload in practically any server-side caching is negligible, certainly if it is small enough that it would be transmitted over the wire to a client, so long as it is measured in kilobytes or tens of kilobytes and not megabytes. I think the real concern is that transmission; the point of caching is to make the user experience as smooth and seamless as possible with minimal waiting, so if the user is waiting for a (cached) payload, while it may be much faster than the time taken to recalculate or re-acquire the data, it is still measurably slower than relying on browser cache. The default implementation of OutputCacheAttribute is actually OutputCacheLocation.Any. This indicates that the cached item can be cached on the client, on a proxy server, or on the web server. From my tests, for tiny payloads, the behavior seemed to be caching on the server and no caching on the client; for a large payload from GET requests with querystring parameters seemed to be caching on the client but with an HTTP query with an “If-Modified-Since” header, resulting in a 304 Not Modified on the server (indicating it was also cached on the server but verified by the server that the client’s cache remains valid); and for a large payload from GET requests with all parameters in the path, the behavior seemed to be caching on the client without any validation checking from the client (no HTTP request for an If-Modified-Since check). Now, to be quite honest I am only guessing that these were the distinguishing factors of these behavior observations. Honestly, I saw variations of these behaviors happening all over the place as I tinkered with scenarios, and this was the initial pattern I felt I was observing. At any rate, for our purposes we were currently stuck with relying on “Any” as the location, which in theory would remove server-side caching if the server ran short on RAM (in theory, I don’t know, although the truth can probably be researched, which I don’t have time to get into). The point of all this is, we have client-side caching that we cannot get away from. So, how do you invalidate the client-side cache? Technically, you really can’t. The browser controls the cache bucket and no browsers provide hooks into the cache to invalidate them. But we can get smart about this, and work around the problem, by bypassing the cached data. Cached HTTP results are stored on the basis of varying by the full raw URL on HTTP GET methods, they are cached with an expiration (in the above sample’s case, 300 seconds, or 5 minutes), and are only cached if allowed to be cached in the first place as per the HTTP header directives in the HTTP response. So, to bypass the cache you don’t cache, or you need to know up front how long the cache should remain until it expires—neither of these being acceptable in a dynamic application—or you need to use POST instead of GET, or you need to vary up the URL. Microsoft originally got around the caching problem in ASP.NET 1.x by forcing the “normal” development cycle in the lifecycle of tags that always used the POST method over HTTP. Responses from POST requests are never cached. But POSTing is not clean as it does not follow the semantics of the verbiage if nothing is being sent up and data is only being retrieved. You can also use ETag in the HTTP headers, which isn’t particularly helpful in a dynamic application as it is no different from a URL + expiration policy. To summarize, to control cache: Disable caching from the server in the Response header (Pragma: no-cache) Predict the lifetime of the content and use an expiration policy Use POST not GET Etag Vary the URL (case-sensitive) Given our options, we need to vary up the URL. There a number of approaches to this, but almost all of the approaches involve relying on appending or modifying the querystring with parameters that are expected to be ignored by the server. $.getJSON($APPROOT + "Home/GetData?_="+Date.now(), function (o) { $('#results').text("Last modified: " + o.LastModified); }); In this sample, the URL is appended with “?_=”+Date.now(), resulting in this URL in the GET: /Home/GetData?_=1376170287015 This technique is often referred to as cache-busting. (And if you’re reading this blog article, you’re probably rolling your eyes. “Duh.”) jQuery inherently supports cache-busting, but it does not do it on its own from $.getJSON(), it only does it in $.ajax() when the options parameter includes {cache: false}, unless you invoke $.ajaxSetup({ cache: false }); first to disable all caching. Otherwise, for $.getJSON() you would have to do it manually by appending the URL. (Alright, you can stop rolling your eyes at me now, I’m just trying to be thorough here..) This is not our complete solution. We have a couple problems we still have to solve. First of all, in a complex client codebase, hacking at the URL from application logic might not be the most appropriate approach. Consider if you’re using Backbone.js with routes that synchronize objects to and from the server. It would be inappropriate to modify the routes themselves just for cache invalidation. A more generalized cache invalidation technique needs to be implemented in the XHR-invoking AJAX function itself. The approach in doing this will depend upon your Javascript libraries you are using, but, for example, if jQuery.getJSON() is being used in application code, then jQuery.getJSON itself could perhaps be replaced with an invalidation routine. var gj = $.getJSON; $.getJSON = function (url, data, callback) { url = invalidateCacheIfAppropriate(url); // todo: implement something like this return gj.call(this, url, data, callback); }; This is unconventional and probably a bad example since you’re hacking at a third party library, a better approach might be to wrap the invocation of $.getJSON() with an application function. var getJSONWrapper = function (url, data, callback) { url = invalidateCacheIfAppropriate(url); // todo: implement something like this return $.getJSON(url, data, callback); }; And from this point on, instead of invoking $.getJSON() in application code, you would invoke getJSONWrapper, in this example. The second problem we still need to solve is that the invalidation of cached data that derived from the server needs to be triggered by the server because it is the server, not the client, that knows that client cached data is no longer up-to-date. Depending on the application, the client logic might just know by keeping track of what server endpoints it is touching, but it might not! Besides, a server endpoint might have conditional invalidation triggers; the data might be stale given specific conditions that only the server may know and perhaps only upon some calculation. In other words, invalidation needs to be pushed by the server. One brute force, burdensome, and perhaps a little crazy approach to this might be to use actual “push technology”, formerly “Comet” or “long-polling”, now WebSockets, implemented perhaps with ASP.NET SignalR, where a connection is maintained between the client and the server and the server then has this open socket that can push invalidation flags to the client. We had no need for that level of integration and you probably don’t either, I just wanted to mention it because it might come back as food for thought for a related solution. One scenario I suppose where this might be useful is if another user of the web application has caused the invalidation, in which case the current user will not be in the request/response cycle to acquire the invalidation flag. Otherwise, it is perhaps a reasonable assumption that invalidation is only needed, and only triggered, in the context of a user’s own session. If not, perhaps it is a “good enough” assumption even if it is sometimes not true. The expiration policy can be set low enough that a reasonable compromise can be made between the current user’s changes and changes invoked by other systems or other users. While we may not know what server endpoint might introduce the invalidation of client cache data, we could assume that the invalidation will be triggered by any server endpoint(s), and build invalidation trigger logic on the response of server HTTP responses. To begin implementing some sort of invalidation trigger on the server I could flag invalidations to the client using HTTP header(s). public JsonResult DoSomething() { // // Do something here that has a side-effect // of making the cached data stale // InvalidateCacheItem(Url.Action("GetData")); return Json("OK"); } public void InvalidateCacheItem(string url) { Response.RemoveOutputCacheItem(url); // invalidate on server Response.AddHeader("X-Invalidate-Cache-Item", url); // invalidate on client } [OutputCache(Duration = 300)] public JsonResult GetData() { return Json(new { LastModified = DateTime.Now.ToString() }, JsonRequestBehavior.AllowGet); } At this point, the server is emitting a trigger to the HTTP client that says that “as a result of a recent operation, that other URL, the one for GetData, is no longer valid for your current cache, if you have one”. The header alone can be handled by different client implementations (or proxies) in different ways. I didn’t come across any “standard” HTTP response that does this “officially”, so I’ll come up with a convention here. Now we need to handle this on the client. Before I do anything first of all I need to refactor the existing AJAX functionality on the client so that instead of using $.getJSON, I might use $.ajax or some other flexible XHR handler, and wrap it all in custom functions such as httpGET()/httpPOST() and handleResponse(). var httpGET = function(url, data, callback) { return httpAction(url, data, callback, "GET"); }; var httpPOST = function (url, data, callback) { return httpAction(url, data, callback, "POST"); }; var httpAction = function(url, data, callback, method) { url = cachebust(url); if (typeof(data) === "function") { callback = data; data = null; } $.ajax(url, { data: data, type: "GET", success: function(responsedata, status, xhr) { handleResponse(responsedata, status, xhr, callback); } }); }; var handleResponse = function (data, status, xhr, callback) { handleInvalidationFlags(xhr); callback.call(this, data, status, xhr); }; function handleInvalidationFlags(xhr) { // not yet implemented }; function cachebust(url) { // not yet implemented return url; }; // application logic httpGET($APPROOT + "Home/GetData", function(o) { $('#results').text("Last modified: " + o.LastModified); }); $('#reload').on('click', function() { window.location.reload(); }); $('#invalidate').on('click', function() { httpPOST($APPROOT + "Home/Invalidate", function (o) { window.location.reload(); }); }); At this point we’re not doing anything yet, we’ve just broken up the HTTP/XHR functionality into wrapper functions that we can now modify to manipulate the request and to deal with the invalidation flag in the response. Now all our work will be in handleInvalidationFlags() for capturing that new header we just emitted from the server, and cachebust() for hijacking the URLs of future requests. To deal with the invalidation flag in the response, we need to detect that the header is there, and add the cached item to a cached data set that can be stored locally in the browser with web storage. The best place to put this cached data set is in sessionStorage, which is supported by all current browsers. Putting it in a session cookie (a cookie with no expiration flag) works but is less ideal because it adds to the payload of all HTTP requests. Putting it in localStorage is less ideal because we do want the invalidation flag(s) to go away when the browser session ends, because that’s when the original browser cache will expire anyway. There is one caveat to sessionStorage: if a user opens a new tab or window, the browser will drop the sessionStorage in that new tab or window, but may reuse the browser cache. The only workaround I know of at the moment is to use localStorage (permanently retaining the invalidation flags) or a session cookie. In our case, we used a session cookie. Note also that IIS is case-insensitive on URI paths, but HTTP itself is not, and therefore browser caches will not be. We will need to ignore case when matching URLs with cache invalidation flags. Here is a more or less complete client-side implementation that seems to work in my initial test for this blog entry. function handleInvalidationFlags(xhr) { // capture HTTP header var invalidatedItemsHeader = xhr.getResponseHeader("X-Invalidate-Cache-Item"); if (!invalidatedItemsHeader) return; invalidatedItemsHeader = invalidatedItemsHeader.split(';'); // get invalidation flags from session storage var invalidatedItems = sessionStorage.getItem("invalidated-cache-items"); invalidatedItems = invalidatedItems ? JSON.parse(invalidatedItems) : {}; // update invalidation flags data set for (var i in invalidatedItemsHeader) { invalidatedItems[prepurl(invalidatedItemsHeader[i])] = Date.now(); } // store revised invalidation flags data set back into session storage sessionStorage.setItem("invalidated-cache-items", JSON.stringify(invalidatedItems)); } // since we're using IIS/ASP.NET which ignores case on the path, we need a function to force lower-case on the path function prepurl(u) { return u.split('?')[0].toLowerCase() + (u.indexOf("?") > -1 ? "?" + u.split('?')[1] : ""); } function cachebust(url) { // get invalidation flags from session storage var invalidatedItems = sessionStorage.getItem("invalidated-cache-items"); invalidatedItems = invalidatedItems ? JSON.parse(invalidatedItems) : {}; // if item match, return concatonated URL var invalidated = invalidatedItems[prepurl(url)]; if (invalidated) { return url + (url.indexOf("?") > -1 ? "&" : "?") + "_nocache=" + invalidated; } // no match; return unmodified return url; } Note that the date/time value of when the invalidation occurred is permanently stored as the concatenation value. This allows the data to remain cached, just updated to that point in time. If invalidation occurs again, that concatenation value is revised to the new date/time. Running this now, after invalidation is triggered by the server, the subsequent request of data is appended with a cache-buster querystring field. In Summary, .. .. a consistent approach to client-side cache invalidation triggered by the server might be by following these steps. Use X-Invalidate-Cache-Item as an HTTP response header to flag potentially cached URLs as expired. You might consider using a semicolon-delimited response to list multiple items. (Do not URI-encode the semicolon when using it as a URI list delimiter.) Semicolon is a reserved/invalid character in URI and is a valid delimiter in HTTP headers, so this is valid. Someday, browsers might support this HTTP response header by automatically invalidating browser cache items declared in this header, which would be awesome. In the mean time ... Capture these flags on the client into a data set, and store the data set into session storage in the format: { "http://url.com/route/action": (date_value_of_invalidation_flag), "http://url.com/route/action/2": (date_value_of_invalidation_flag) } { "http://url.com/route/action": (date_value_of_invalidation_flag), "http://url.com/route/action/2": (date_value_of_invalidation_flag) } Hijack all XHR requests so that the URL is appropriately appended with cachebusting querystring parameter if the URL was found in the invalidation flags data set, i.e. http://url.com/route/action becomes something like http://url.com/route/action?_nocache=(date_value_of_invalidation_flag), being sure to hijack only the XHR request and not any logic that generated the URL in the first place. Remember that IIS and ASP.NET by default convention ignore case (“/Route/Action” == “/route/action”) on the path, but the HTTP specification does not and therefore the browser cache bucket will not ignore case. Force all URL checks for invalidation flags to be case-insensitive to the left of the querystring (if there is a querystring, otherwise for the entire URL). Make sure the AJAX requests’ querystring parameters are in consistent order. Changing the sequential order of parameters may be handled the same on the server but will be cached differently on the client. These steps are for “pull”-based XHR-driven invalidation flags being pulled from the server via XHR. For “push”-based invalidation triggered by the server, consider using something like a SignalR channel or hub to maintain an open channel of communication using WebSockets or long polling. Server application logic can then invoke this channel or hub to send an invalidation flag to the client or to all clients. On the client side, an invalidation flag “push” triggered in #7 above, for which #1 and #2 above would no longer apply, can still utilize #3 through #6. You can download the project I used for this blog entry here: ClientSideCacheInvalidation.zip

August 16, 2013

by Jon Davis

· 10,575 Views

I was facing a problem where while a person logs out his session is invalidated but the JSESSIONID still remained in the browser. As a result while logging in the Java API used to get the request from the browser along with a JSESSIONID(Just the ID since the session was invalidated) and would create the new session with the same ID. To fix this problem I used the above code so that whenever a user logs out the entire JSESSIONID becomes empty and thus cookie wont exist for that site.Anyone using JAVA can utilize this in their code. @RequestMapping(value = "/logout", method = RequestMethod.POST) public void logout(HttpServletRequest request, HttpServletResponse response) { /* Getting session and then invalidating it */ HttpSession session = request.getSession(false); if (request.isRequestedSessionIdValid() && session != null) { session.invalidate(); } handleLogOutResponse(response); } /** * This method would edit the cookie information and make JSESSIONID empty * while responding to logout. This would further help in order to. This would help * to avoid same cookie ID each time a person logs in * @param response */ private void handleLogOutResponse(HttpServletResponse response) { Cookie[] cookies = request.getCookies(); for (Cookie cookie : cookies) { cookie.setMaxAge(0); cookie.setValue(null); cookie.setPath("/"); response.addCookie(cookie); } }

August 15, 2013

by Shiv Kumar Ganesh

· 40,834 Views · 2 Likes

Algorithm of the Week: Quicksort - Three-way vs. Dual-pivot

It’s no news that quicksort is considered one of the most important algorithms of the century and that it is the de facto system sort for many languages, including the Arrays.sort in Java. So, what’s new about quicksort? Well, nothing except that I just now figured out (two damn years after the release of Java 7) that the quicksort implementation of Arrays.sort has been replaced with a variant called dual-pivot quicksort. This thread is not only awesome for this reason but also how humble Jon Bentley and Joshua Bloch really are. What did I do then? Just like everybody else, I wanted to implement it and do some benchmarking against some 10 million integers (random and duplicate). Oddly enough, I found the following results: Random Data Basic quicksort: 1222 ms Three-way quicksort: 1295 ms (seriously!) Dual-pivot quicksort: 1066 ms Duplicate Data Basic quicksort: 378 ms Three-way quicksort: 15 ms Dual-pivot quicksort: six ms Stupid Question One I am afraid that I am missing something in the implementation of the three-way partition. Across several runs against random inputs (of 10 million numbers), I could see that the single pivot always performs better (although the difference is less than 100 milliseconds for 10 million numbers). I understand that the whole purpose of making the three-way quicksort the default quicksort until now is that it does not give 0(n2) performance on duplicate keys, which is very evident when I run it against duplicate inputs. But is it true that, for the sake of handling duplicate data, a small penalty is taken by the three-way quicksort? Or is my implementation bad? Stupid Question Two My dual-pivot implementation (link below) does not handle duplicates well. It takes forever (0(n2)) to execute. Is there a good way to avoid this? Referring to the Arrays.sort implementation, I figured out that ascending sequences and duplicates are eliminated well before the actual sorting is done. So, as a dirty fix, if the pivots are equal I fast-forward the lowerIndex until it is different than pivot2. Is this a fair implementation? else if (pivot1==pivot2){ while (pivot1==pivot2 && lowIndex=j) break; exchange(input, i, j); } exchange(input, pivotIndex, j); return j; } } Three-way package basics.sorting.quick; import static basics.shuffle.KnuthShuffle.shuffle; import static basics.sorting.utils.SortUtils.exchange; import static basics.sorting.utils.SortUtils.less; public class QuickSort3Way { public void sort (int[] input){ //input=shuffle(input); sort (input, 0, input.length-1); } public void sort(int[] input, int lowIndex, int highIndex) { if (highIndex<=lowIndex) return; int lt=lowIndex; int gt=highIndex; int i=lowIndex+1; int pivotIndex=lowIndex; int pivotValue=input[pivotIndex]; while (i<=gt){ if (less(input[i],pivotValue)){ exchange(input, i++, lt++); } else if (less (pivotValue, input[i])){ exchange(input, i, gt--); } else{ i++; } } sort (input, lowIndex, lt-1); sort (input, gt+1, highIndex); } } Dual-pivot package basics.sorting.quick; import static basics.shuffle.KnuthShuffle.shuffle; import static basics.sorting.utils.SortUtils.exchange; import static basics.sorting.utils.SortUtils.less; public class QuickSortDualPivot { public void sort (int[] input){ //input=shuffle(input); sort (input, 0, input.length-1); } private void sort(int[] input, int lowIndex, int highIndex) { if (highIndex<=lowIndex) return; int pivot1=input[lowIndex]; int pivot2=input[highIndex]; if (pivot1>pivot2){ exchange(input, lowIndex, highIndex); pivot1=input[lowIndex]; pivot2=input[highIndex]; //sort(input, lowIndex, highIndex); } else if (pivot1==pivot2){ while (pivot1==pivot2 && lowIndex

August 14, 2013

by Arun Manivannan

· 45,953 Views

Resource Pooling, Virtualization, Fabric, and the Cloud

One of the five essential attributes of cloud computing (see The 5-3-2 Principle of Cloud Computing) is resource pooling, which is an important differentiator separating the thought process of traditional IT from that of a service-based, cloud computing approach. Resource pooling in the context of cloud computing and from a service provider’s viewpoint denotes a set of strategies and a methodical way of managing resources. For a user, resource pooling institutes an abstraction for presenting and consuming resources in a consistent and transparent fashion. This article presents these key concepts derived from resource pooling: Resource Pools Virtualization in the Context of Cloud Computing Standardization, Automation, and Optimization Fabric Cloud Closing Thoughts Resource Pools Ultimately, data center resources can be logically placed into three categories. They are: compute, networks, and storage. For many, this grouping may appear trivial. It is, however, a foundation upon which some cloud computing methodologies are developed, products designed, and solutions formulated. Compute This is a collection of all CPU capabilities. Essentially all data center servers, either for supporting or actually running a workload, are all part of this compute group. Compute pool represents the total capacity for executing code and running instances. The process to construct a compute pool is to first inventory all servers and identify virtualization candidates followed by implementing server virtualization. It is never too early to introduce a system management solution to facilitate the processes, which in my view is a strategic investment and a critical component for all cloud initiatives. Networks The physical and logical artifacts putting in place to connect resources, segment, and isolate resources from layer three and below, etc., are gathered in the network pool. Networking enables resources becoming visible and hence possibly manageable. In the age of instant gratification, networks and mobility are redefining the security and system administration boundaries, and play a direct and impactful role in user productivity and customer satisfaction. Networking in cloud computing is more than just remote access, but empowerment for a user to self-serve and consume resources anytime, anywhere, with any device. BYOD and consumerization of IT are various expressions of these concepts. Storage This has long been a very specialized and sometimes mysterious part of IT. An enterprise storage solution is frequently characterized as a high-cost item with a significant financial and contractual commitment, specialized hardware, proprietary API and software, a dependency on direct vendor support, etc. In cloud computing, storage has become even more noticeable since the ability to grow and shrink based on demands, i.e. elasticity, demands an enterprise-level, massive, reliable, and resilient storage solution at a global scale. While enterprise IT is consolidating resources and transforming the existing establishment into a cloud computing environment, how to leverage existing storage devices from various vendors and integrate them with the next generation storage solutions is among the highest priorities for modernizing a data center. Virtualization in the Context of Cloud Computing In the last decade, virtualization has proved its value and accelerated the realization of cloud computing. Then, virtualization was mainly server virtualization, which in an over-simplified statement means hosting multiple server instances with the same hardware while each instance runs transparently and in insolation, as if each consumes the entire hardware and is the only instance running. Much of the customer expectations, business needs, and methodologies has since evolved. Now, we should validate virtualization in the context of cloud computing to fully address the innovations rapidly changing how IT conducts business and delivers services. As discussed below, in the context of cloud computing, consumable resources are delivered in some virtualized form. Various virtualization layers collectively construct and form the so-called fabric. Server Virtualization The concept of server virtualization remains: running multiple server instances with the same hardware while each instance runs transparently and in isolation, as if each instance is the only instance running and consuming the entire server hardware. In addition to virtualizing and consolidating servers, server virtualization also signifies the practices of standardizing server deployment switching away from physical boxes to VMs. Server virtualization is for packaging, delivering, and consuming a compute pool. There are a few important considerations of virtualizing servers. IT needs the ability to identify and manage bare metal such that the entire resource life-cycle management from commencing to decommissioning can be standardized and automated. To fundamentally reduce the support and training cost while increasing productivity, a consistent platform with tools applicable across physical, virtual, on-premises, and off-premises deployments is essential. The last thing IT wants is one set of tools for physical resources and another for those virtualized, one set of tools for on-premises deployment and another for those deployed to a service provider, and one set of tools for development and another for deploying applications. The requirement is one methodology for all, one skill set for all, and one set of tools for all. This advantage is obvious when developing applications and deploying Windows Server 2012 R2 on premises or off premises to Windows Azure. The Active Directory security model can work across sites, System Center can manage resources deployed off premises to Windows Azure, and Visual Studio can publish applications across platforms. Windows infrastructure architecture, security, and deployment models are all directly applicable. Network Virtualization The similar idea of server virtualization applies here. Network virtualization is the ability to run multiple networks on the same network device while each network runs transparently and in isolation, as if each network is the only network running and consuming the entire network hardware. Conceptually, since each network instance is running in isolation, one tenant’s 192.168.x network is not aware of another tenant’s identical192.168.x network running with the same network device. Network virtualization provides the translation between physical network characteristics and the representation of and a resource identity in a virtualized network. Consequently, above the network virtualization layer, various tenants while running in isolation can have identical network configurations. A great example of network virtualization is Windows Azure virtual networking. At any given time, there can be multiple Windows Azure subscribers all allocating the same 192.168.x address space with an identical subnet scheme (192.168.1.x/16) for deploying VMs. Those VMs belonging to one subscriber will however not be aware of or visible to those deployed by others, despite the fact that the network configuration, IP scheme, and IP address assignments may all be identical. Network virtualization in Windows Azure isolates on subscriber from the others such that each subscriber operates as if the subscription is the only one employing a 192.168.x address space. Storage Virtualization I believe this is where the next wave of drastic cost reduction of IT post-server virtualization happens. Historically, storage has been a high cost item in any IT budget in each and every aspects including hardware, software, staffing, maintenance, SLA, etc. Since the introduction of Windows Server 2012, there is a clear direction where storage virtualization is built into OS and becoming a commodity. New capabilities like Storage Pool, Hyper-V over SMB, Scale-Out Fire Share, etc., are now part of Windows Server OS and are making storage virtualization part of server administration routines and easily manageable with tools and utilities like PowerShell, which is familiar to many IT professionals. The concept of storage virtualization remains consistent with the idea of logically separating a computing object from its hardware, in this case the storage capacity. Storage virtualization is the ability to integrate multiple and heterogeneous storage devices, aggregate the storage capacities, and present/manage as one logical storage device with a continuous storage space. JBOD is a technology to realize this concept. Standardization, Automation and Optimization Each of the three resource pools has an abstraction to logically present itself with characteristics and work patterns. A compute pool is a collection of physical (virtualization and infrastructure) hosts and VMs. A virtualization host hosts VMs that run workloads deployed by service owners and consumed by authorized users. A network pool encompasses network resources including physical devices, logical switches, address spaces, and site configurations. Network virtualization as enabled/defined in configurations can identify and translate a logical/virtual IP address into a physical one, such that tenants with the same network hardware can implement an identical network scheme without a concern. A storage pool is based on storage virtualization which is a concept of presenting an aggregated storage capacity as one continuous storage space as if provided from one logical storage device. In other words, the three resource pools are wrapped with server virtualization, network virtualization, and storage virtualization, respectively. Each virtualization presents a set of methodologies on which work patterns are derived and common practices are developed. These virtualization layers provides opportunities to standardize, automate, and optimize deployments and considerably facilitates the adoption of cloud computing. Standardization Virtualizing resources decouples the dependency between instances and the underlying hardware. This offers an opportunity to simplify and standardize the logical representation of a resource. For instance, a VM is defined and deployed with a VM template that provides a level of consistency with a standardized configuration. Automation Once VM characteristics are identified and standardized, we can now generate an instance by providing only instance-based information or information that depends on run-time, such as the VM machine name, which must be validated at run-time to prevent duplicated names. This requirement for providing only minimal information at deployment can be significantly simplify and streamline operations for automation. And with automation, resources can then be deployed, instantiated, relocated, taken off-line, brought back online, or removed rapidly and automatically based on set criteria. Standardization and automation are essential mechanisms so that workload can be scaled on demand, i.e., become elastic. Optimization Standardization provides a set of common criteria. Automation executes operations based on set criteria with volumes, consistency, and expediency. With standardization and automation, instances can be instantiated with consistency, efficiency, and predictability. In other words, resources can be operated in bulk with consistency and predictability. The next logical step is then to optimize the usage based on SLA. The presented progression is what resource pooling and virtualizations can provide and facilitate. These methodologies are now built into products and solutions. Windows Server 2012 R2 and System Center 2012 and later integrate server virtualization, network virtualization, and storage virtualization into one consistent solution platform with standardization, automation, and optimization for building and managing clouds. Fabric This is a significant abstraction in cloud computing. Fabric implies accessibility and discoverability, and denotes the ability to discover, identify, and manage a resource. Conceptually, fabric is an umbrella term encompassing all the underlying infrastructure supporting a cloud computing environment. At the same time, a fabric controller represents the system management solution which manages, i.e. owns, fabric. In cloud architecture, fabric consists of the three resource pools: compute, networks, and storage. Compute provides the computing capabilities, executes code, and runs instances. Networks glues the resources based on requirements. Storage is where VMs, configurations, data, and resources are kept. Fabric shields the physical complexities of the three resource pools presented with server virtualization, network virtualization, and storage virtualization. All operations are eventually directed by the fabric controller of a data center. Above fabric, there are logical views of consumable resources including VMs, virtual networks, and logical storage drives. By deploying VMs, configuring virtual networks, or acquiring storage, a user consumes resources. Under fabric, there are virtualization and infrastructure hosts, Active Directory, DNS, clusters, load balancers, address pools, network sites, library shares, storage arrays, topology, racks, cables, etc., all under the fabric controller’s command to collectively present and support fabric. For a service provider, building a cloud computing environment is essentially establishing a fabric controller and constructing fabric. Namely, instituting a comprehensive management solution, building the three resource pools, and integrating server virtualization, network virtualization, and storage virtualization to form fabric. From a user’s point of view, how and where a resource is physically provided is not a concern, but the accessibility, readiness, scalability, and fulfillment of SLA are. Cloud This is a well-defined term and we should not be confused with it. (see NIST SP 800-145 and the 5-3-2 Principle of Cloud Computing) We need to be very clear on: what a cloud must exhibit (the five essential attributes), how to consume it (with SaaS, PaaS, or IaaS), and the model a service is deployed in (like private cloud, public cloud, and hybrid cloud). Cloud is a concept, a state, a set of capabilities such that a business can be delivered as a service, i.e. available on demand. The architecture of a cloud computing environment is presented with three resource pools: compute, networks, and storage. Each is an abstraction provided by a virtualization layer. Server virtualization presents a compute pool with VMs that supply the computing, i.e. CPUs, and power to execute code and run instances. Network virtualization offers a network pool and is the mechanism that allows multiple tenants with identical network configurations on the same virtualization host while connecting, segmenting, isolating network traffic with virtual NICs, logical switches, address space, network sites, IP pools, etc. Storage virtualization provides a logical storage device with the capacity to appear continuous and aggregated with a pool of storage devices behind the scene. The three resource pools together constitute the fabric (of a cloud) while the three virtualization layers collectively form the abstraction, such that while the underlying physical infrastructure may be intricate, the user experience above fabric remains logical and consistent. Deploying a VM, configuring a virtual network, or acquiring storage is transparent with virtualization regardless of where the VM actually resides, how the virtual network is physically wired, or what devices in the aggregate the requested storage is provided with. Closing Thoughts Cloud is a very consumer-focused approach. It is about a customer’s ability and control based on SLA in getting resources when needed and with scale, and equally important releasing resources when no longer required. It is not about products and technologies. It is about servicing, consuming, and strengthening the bottom line.

August 12, 2013

by Yung Chou

· 9,997 Views

neo4j: Extracting a subgraph as an adjacency matrix and calculating eigenvector centrality with JBLAS

Earlier in the week I wrote a blog post showing how to calculate the eigenvector centrality of an adjacency matrix using JBLAS and the next step was to work out the eigenvector centrality of a neo4j sub graph. There were 3 steps involved in doing this: Export the neo4j sub graph as an adjacency matrix Run JBLAS over it to get eigenvector centrality scores for each node Write those scores back into neo4j I decided to make use of the Paul Revere data set from Kieran Healy’s blog post which consists of people and groups that they had membership of. The script to import the data is on my fork of the revere repository. Having imported the data the next step was to write a cypher query which would give me the people in anadjacency matrix with the number in each column/row intersection showing how many common groups that pair of people had. I thought it’d be easier to build this query incrementally so I started out writing a query which would return one row of the adjacency matrix: MATCH p1:Person, p2:Person WHERE p1.name = "Paul Revere" WITH p1, p2 MATCH p = p1-[?:MEMBER_OF]->()<-[?:MEMBER_OF]-p2 WITH p1.name AS p1, p2.name AS p2, COUNT(p) AS links ORDER BY p2 RETURN p1, COLLECT(links) AS row Here we start with Paul Revere and then find the relationships between him and every other person by way of a common group membership. We use an optional relationship since we need to include a value in each column/row of our adjacency matrix we need to return a 0 value for anyone he doesn’t intersect with. If we run that query we get back the following: +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | p1 | row | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | "Paul Revere" | [2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,3,1,1,1,1,1,1,3,3,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,3,2,1,1,2,1,2,1,1,1,1,1,0,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,2,1,3,1,3,2,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,0,1,0,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,2,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,3,1,1,2,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,2,1,1,1,1,1,1,1,1,3,1,1,1,1,3,1,1,1,1,0,1,2,1,1,1,1,1,1,1] | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ As it turns outs we’ve only got to remove the WHERE clause and order everybody and we’ve get the adjacency matrix for everyone: MATCH p1:Person, p2:Person WITH p1, p2 MATCH p = p1-[?:MEMBER_OF]->()<-[?:MEMBER_OF]-p2 WITH p1.name AS p1, p2.name AS p2, COUNT(p) AS links ORDER BY p2 RETURN p1, COLLECT(links) AS row ORDER BY p1 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | p1 | row | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | "Abiel Ruddock" | [0,1,1,1,0,1,0,1,0,0,1,1,1,0,1,2,0,1,0,1,1,1,2,2,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,1,1,0,0,2,2,0,0,1,1,2,1,1,1,0,1,0,1,1,0,0,2,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,0,2,1,2,1,0,0,0,0,1,1,0,1,0,0,1,0,2,0,0,1,0,0,0,1,0,0,2,0,1,0,1,1,1,0,0,1,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,1,1,0,1,1,1,2,0,0,1,1,0,0,2,0,1,2,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,2,1,0,1,1,1,1,1,0,0,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,2,1,0,0,1,1,1,1,0,1,0,0,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,0,1,0,2,1,1,0,0,2,0,1,0,0,0,0,1,0,1,0,1,0,1,0] | | "Abraham Hunt" | [1,0,1,1,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,1,0,1,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,1,1,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,1,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0] | ... +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 254 rows 9897 ms The next step was to wire up the query results with the JBLAS code that I wrote in the previous post. I ended up with the following: public class Neo4jAdjacencyMatrixSpike { public static void main(String[] args) throws SQLException { ClientResponse response = client() .resource("http://localhost:7474/db/data/cypher") .entity(queryAsJson(), MediaType.APPLICATION_JSON) .accept(MediaType.APPLICATION_JSON) .post(ClientResponse.class); JsonNode result = response.getEntity(JsonNode.class); ArrayNode rows = (ArrayNode) result.get("data"); List principalEigenvector = JBLASSpike.getPrincipalEigenvector(new DoubleMatrix(asMatrix(rows))); List people = asPeople(rows); updatePeopleWithEigenvector(people, principalEigenvector); System.out.println(sort(people).take(10)); } private static double[][] asMatrix(ArrayNode rows) { double[][] matrix = new double[rows.size()][254]; int rowCount = 0; for (JsonNode row : rows) { ArrayNode matrixRow = (ArrayNode) row.get(2); double[] rowInMatrix = new double[254]; matrix[rowCount] = rowInMatrix; int columnCount = 0; for (JsonNode jsonNode : matrixRow) { matrix[rowCount][columnCount] = jsonNode.asInt(); columnCount++; } rowCount++; } return matrix; } // rest cut for brevity } Here we are taking the query and then converting it into an array of arrays before passing it to our JBLAS code to calculate the principal eigenvector. We then return the top 10 people: Person{name='William Cooper', eigenvector=0.172604992239612, nodeId=68}, Person{name='Nathaniel Barber', eigenvector=0.17260499223961198, nodeId=18}, Person{name='John Hoffins', eigenvector=0.17260499223961195, nodeId=118}, Person{name='Paul Revere', eigenvector=0.17171142003936804, nodeId=207}, Person{name='Caleb Davis', eigenvector=0.16383970722169897, nodeId=71}, Person{name='Caleb Hopkins', eigenvector=0.16383970722169897, nodeId=121}, Person{name='Henry Bass', eigenvector=0.16383970722169897, nodeId=21}, Person{name='Thomas Chase', eigenvector=0.16383970722169897, nodeId=54}, Person{name='William Greenleaf', eigenvector=0.16383970722169897, nodeId=104}, Person{name='Edward Proctor', eigenvector=0.15600043886738055, nodeId=201} I get back the same 10 people as Kieran Healy although they have different eigenvector values. As far as I understand the absolute value doesn’t matter, what’s more important is the relative score to other people so I think we’re ok. The final step was to write these eigenvector values back into neo4j which we can do with the following code: private static void updateNeo4jWithEigenvectors(List people) { for (Person person : people) { ObjectNode request = JsonNodeFactory.instance.objectNode(); request.put("query", "START p = node({nodeId}) SET p.eigenvectorCentrality={value}"); ObjectNode params = JsonNodeFactory.instance.objectNode(); params.put("nodeId", person.nodeId); params.put("value", person.eigenvector); request.put("params", params); client() .resource("http://localhost:7474/db/data/cypher") .entity(request, MediaType.APPLICATION_JSON) .accept(MediaType.APPLICATION_JSON) .post(ClientResponse.class); } } Now we might use that eigenvector centrality value in other queries, such as one to show who the most central/potentially influential people are in each group: MATCH g:Group<-[:MEMBER_OF]-p WITH g.name AS group, p.name AS personName, p.eigenvectorCentrality as eigen ORDER BY eigen DESC WITH group, COLLECT(personName) AS people RETURN group, HEAD(people) + [HEAD(TAIL(people))] + [HEAD(TAIL(TAIL(people)))] AS mostCentral +--------------------------------------------------------------------------+ | group | mostCentral | +--------------------------------------------------------------------------+ | "StAndrewsLodge" | ["Paul Revere","Joseph Warren","Thomas Urann"] | | "BostonCommittee" | ["William Cooper","Nathaniel Barber","John Hoffins"] | | "LoyalNine" | ["Caleb Hopkins","William Greenleaf","Caleb Davis"] | | "LondonEnemies" | ["William Cooper","Nathaniel Barber","John Hoffins"] | | "LongRoomClub" | ["Paul Revere","John Hancock","Benjamin Clarke"] | | "NorthCaucus" | ["William Cooper","Nathaniel Barber","John Hoffins"] | | "TeaParty" | ["William Cooper","Nathaniel Barber","John Hoffins"] | +--------------------------------------------------------------------------+ 7 rows 280 ms Our top ten feature frequently although it’s interesting that only one of them is in the ‘LongRoomClub’ group which perhaps indicates that people in that group are less likely to be members of the other ones. I’d be interested if anyone can think of other potential uses for eigenvector centrality once we’ve got it back in the graph. All the code described in this post is on github if you want to take it for a spin.

August 12, 2013

by Mark Needham

· 5,368 Views