Data Resources

The Latest Data Topics

Getting Started with NHibernate and ASP.NET MVC- CRUD Operations

In this post we are going to learn how we can use NHibernate in ASP.NET MVC application. What is NHibernate: ORMs(Object Relational Mapper) are quite popular this days. ORM is a mechanism to map database entities to Class entity objects without writing a code for fetching data and write some SQL queries. It automatically generates SQL Query for us and fetch data behalf on us. NHibernate is also a kind of Object Relational Mapper which is a port of popular Java ORM Hibernate. It provides a framework for mapping an domain model classes to a traditional relational databases. Its give us freedom of writing repetitive ADO.NET code as this will be act as our database layer. Let’s get started with NHibernate. How to download: There are two ways you can download this ORM. From nuget package and from the source forge site. Nuget - http://www.nuget.org/packages/NHibernate/ Source Forge-http://sourceforge.net/projects/nhibernate/ Creating a table for CRUD: I am going to use SQL Server 2012 express edition as a database. Following is a table with four fields Id, First Name, Last name, Designation. Creating ASP.NET MVC project for NHibernate: Let’s create a ASP.NET MVC project for NHibernate via click on File-> New Project –> ASP.NET MVC 4 web application. Installing NuGet package for NHibernate: I have installed nuget package from Package Manager console via following Command. It will install like following. NHibertnate configuration file: Nhibernate needs one configuration file for setting database connection and other details. You need to create a file with ‘hibernate.cfg.xml’ in model Nhibernate folder of your application with following details. NHibernate.Connection.DriverConnectionProvider NHibernate.Driver.SqlClientDriver Server=(local);database=LocalDatabase;Integrated Security=SSPI; NHibernate.Dialect.MsSql2012Dialect Here you have got different settings for NHibernate. You need to selected driver class, connection provider as per your database. If you are using other databases like Orcle or MySQL you will have different configuration. ThisNHibernate ORM can work with any databases. Creating a model class for NHibernate: Now it’s time to create model class for our CRUD operations. Following is a code for that. Property name is identical to database table columns. namespace NhibernateMVC.Models { public class Employee { public virtual int Id { get; set; } public virtual string FirstName { get; set; } public virtual string LastName { get; set; } public virtual string Designation { get; set; } } } Creating a mapping file between class and table: Now we need a xml mapping file between class and model with name “Employee.hbm.xml” like following in Nhibernate folder. Creating a class to open session for NHibernate I have created a class in models folder called NHIbernateSession and a static function it to open a session for NHibertnate. using System.Web; using NHibernate; using NHibernate.Cfg; namespace NhibernateMVC.Models { public class NHibertnateSession { public static ISession OpenSession() { var configuration = new Configuration(); var configurationPath = HttpContext.Current.Server.MapPath(@"~\Models\Nhibernate\hibernate.cfg.xml"); configuration.Configure(configurationPath); var employeeConfigurationFile = HttpContext.Current.Server.MapPath(@"~\Models\Nhibernate\Employee.hbm.xml"); configuration.AddFile(employeeConfigurationFile); ISessionFactory sessionFactory = configuration.BuildSessionFactory(); return sessionFactory.OpenSession(); } } } Listing: Now we have our open session method ready its time to write controller code to fetch data from the database. Following is a code for that. using System; using System.Web.Mvc; using NHibernate; using NHibernate.Linq; using System.Linq; using NhibernateMVC.Models; namespace NhibernateMVC.Controllers { public class EmployeeController : Controller { public ActionResult Index() { using (ISession session = NHibertnateSession.OpenSession()) { var employees = session.Query().ToList(); return View(employees); } } } } Here you can see I have get a session via OpenSession method and then I have queried database for fetching employee database. Let’s create a new for this you can create this via right lick on view on above method.We are going to create a strongly typed view for this. Our listing screen is ready once you run project it will fetch data as following. Create/Add: Now its time to write add employee code. Following is a code I have written for that. Here I have used session.save method to save new employee. First method is for returning a blank view and another method with HttpPost attribute will save the data into the database. public ActionResult Create() { return View(); } [HttpPost] public ActionResult Create(Employee emplolyee) { try { using (ISession session = NHibertnateSession.OpenSession()) { using (ITransaction transaction = session.BeginTransaction()) { session.Save(emplolyee); transaction.Commit(); } } return RedirectToAction("Index"); } catch(Exception exception) { return View(); } } Now let’s create a create view strongly typed view via right clicking on view and add view. Once you run this application and click on create new it will load following screen. Edit/Update: Now let’s create a edit functionality with NHibernate and ASP.NET MVC. For that I have written two action result method once for loading edit view and another for save data. Following is a code for that. public ActionResult Edit(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } [HttpPost] public ActionResult Edit(int id, Employee employee) { try { using (ISession session = NHibertnateSession.OpenSession()) { var employeetoUpdate = session.Get(id); employeetoUpdate.Designation = employee.Designation; employeetoUpdate.FirstName = employee.FirstName; employeetoUpdate.LastName = employee.LastName; using (ITransaction transaction = session.BeginTransaction()) { session.Save(employeetoUpdate); transaction.Commit(); } } return RedirectToAction("Index"); } catch { return View(); } } Here in first action result I have fetched existing employee via get method of NHibernate session and in second I have fetched and changed the current employee with update details. You can create view for this via right click –>add view like below. I have created a strongly typed view for edit. Once you run code it will look like following. Details: Now it’s time to create a detail view where user can see the employee detail. I have written following logic for details view. public ActionResult Details(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } You can add view like following via right click on actionresult view. now once you run this in browser it will look like following. Delete: Now its time to write delete functionality code. Following code I have written for that. public ActionResult Delete(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } [HttpPost] public ActionResult Delete(int id, Employee employee) { try { using (ISession session = NHibertnateSession.OpenSession()) { using (ITransaction transaction = session.BeginTransaction()) { session.Delete(employee); transaction.Commit(); } } return RedirectToAction("Index"); } catch(Exception exception) { return View(); } } Here in the above first action result will have the delete confirmation view and another will perform actual delete operation with session delete method. When you run into the browser it will look like following. That’s it. It’s very easy to have crud operation with NHibernate. Stay tuned for more.

October 1, 2013

by Jalpesh Vadgama

· 47,326 Views

ElasticSearch: Java API

ElasticSearch provides Java API, thus it executes all operations asynchronously by using client object.

September 30, 2013

by Hüseyin Akdoğan

CORE

· 137,586 Views · 4 Likes

Clojure: Converting an Array/Set into a Hash Map

When I was implementing the Elo Rating algorithm a few weeks ago one thing I needed to do was come up with a base ranking for each team. I started out with a set of teams that looked like this: (def teams #{ "Man Utd" "Man City" "Arsenal" "Chelsea"}) and I wanted to transform that into a map from the team to their ranking e.g. Man Utd -> {:points 1200} Man City -> {:points 1200} Arsenal -> {:points 1200} Chelsea -> {:points 1200} I had read the documentation of array-map, a function which can be used to transform a collection of pairs into a map, and it seemed like it might do the trick. I started out by building an array of pairs using mapcat: > (mapcat (fn [x] [x {:points 1200}]) teams) ("Chelsea" {:points 1200} "Man City" {:points 1200} "Arsenal" {:points 1200} "Man Utd" {:points 1200}) array-map constructs a map from pairs of values e.g. > (array-map "Chelsea" {:points 1200} "Man City" {:points 1200} "Arsenal" {:points 1200} "Man Utd" {:points 1200}) ("Chelsea" {:points 1200} "Man City" {:points 1200} "Arsenal" {:points 1200} "Man Utd" {:points 1200}) Since we have a collection of pairs rather than individual pairs we need to use the apply function as well: > (apply array-map ["Chelsea" {:points 1200} "Man City" {:points 1200} "Arsenal" {:points 1200} "Man Utd" {:points 1200}]) {"Chelsea" {:points 1200}, "Man City" {:points 1200}, "Arsenal" {:points 1200}, "Man Utd" {:points 1200} And if we put it all together we end up with the following: > (apply array-map (mapcat (fn [x] [x {:points 1200}]) teams)) {"Man Utd" {:points 1200}, "Man City" {:points 1200}, "Arsenal" {:points 1200}, "Chelsea" {:points 1200} It works but the function we pass to mapcat feels a bit clunky. Since we just need to create a collection of team/ranking pairs we can use the vector and repeat functions to build that up instead: > (mapcat vector teams (repeat {:points 1200})) ("Chelsea" {:points 1200} "Man City" {:points 1200} "Arsenal" {:points 1200} "Man Utd" {:points 1200}) And if we put the apply array-map code back in we still get the desired result: > (apply array-map (mapcat vector teams (repeat {:points 1200}))) {"Chelsea" {:points 1200}, "Man City" {:points 1200}, "Arsenal" {:points 1200}, "Man Utd" {:points 1200} Alternatively we could use assoc like this: > (apply assoc {} (mapcat vector teams (repeat {:points 1200}))) {"Man Utd" {:points 1200}, "Arsenal" {:points 1200}, "Man City" {:points 1200}, "Chelsea" {:points 1200} I also came across the into function which seemed useful but took in a collection of vectors: > (into {} [["Chelsea" {:points 1200}] ["Man City" {:points 1200}] ["Arsenal" {:points 1200}] ["Man Utd" {:points 1200}] ]) We therefore need to change the code to use map instead of mapcat: > (into {} (map vector teams (repeat {:points 1200}))) {"Chelsea" {:points 1200}, "Man City" {:points 1200}, "Arsenal" {:points 1200}, "Man Utd" {:points 1200} However, my favourite version so far uses the zipmap function like so: > (zipmap teams (repeat {:points 1200})) {"Man Utd" {:points 1200}, "Arsenal" {:points 1200}, "Man City" {:points 1200}, "Chelsea" {:points 1200} I’m sure there are other ways to do this as well so if you know any let me know in the comments.

September 28, 2013

by Mark Needham

· 13,119 Views

The Real Cost of Change in Software Development

There are two widely opposed (and often misunderstood) positions on how expensive it can be to change or fix software once it has been designed, coded, tested and implemented. One holds that it is extremely expensive to leave changes until late, that the cost of change rises exponentially. The other position is that changes should be left as late as possible, because the cost of changing software is – or at least can be – essentially flat (that’s why we call it software). Which position is right? Why should we care? And what can we do about it? Exponential Cost of Change Back in the early 1980s, Barry Boehm published some statistics (Software Engineering Economics, 1981) which showed that the cost of making a software change or fix increases significantly over time – you can see the original curve that he published here. Boehm looked at data collected from Waterfall-based projects at TRW and IBM in the 1970s, and found that the cost of making a change increases as you move from the stages of requirements analysis to architecture, design, coding, testing and deployment. A requirements mistake found and corrected while you are still defining the requirements costs almost nothing. But if you wait until after you've finished designing, coding and testing the system and delivering it to the customer, it can cost up to 100 times as much. A few caveats here. First, the cost curve is much higher in large projects (in smaller projects, the cost curve is more like 1:4 instead of 1:100). Those cases when the cost of change rises up to 100 times are rare, what Boehm calls Architecture-Breakers, where the team gets a fundamental architectural assumption wrong (scaling, performance, reliability) and doesn't find out until after customers are already using the system and running into serious operational problems. This analysis was all done on a small data sample from more than 30 years ago, when developing code was much more expensive and time-consuming and paperworky, and the tools sucked. A few other studies have been done since then that mostly back up Boehm's findings – at least the basic idea that the longer it takes for you to find out that you made a mistake, the more expensive it is to correct it. These studies have been widely referenced in books like Steve McConnell’s Code Complete, and used to justify the importance of early reviews and testing: Studies over the last 25 years have proven conclusively that it pays to do things right the first time. Unnecessary changes are expensive. Researchers at Hewlett-Packard, IBM, Hughes Aircraft, TRW, and other organizations have found that purging an error by the beginning of construction allows rework to be done 10 to 100 times less expensively than when it's done in the last part of the process, during system test or after release (Fagan 1976; Humphrey, Snyder, and Willis 1991; Leffingwell 1997; Willis et al. 1998; Grady 1999; Shull et al. 2002; Boehm and Turner 2004). In general, the principle is to find an error as close as possible to the time at which it was introduced. The longer the defect stays in the software food chain, the more damage it causes further down the chain. Since requirements are done first, requirements defects have the potential to be in the system longer and to be more expensive. Defects inserted into the software upstream also tend to have broader effects than those inserted further downstream. That also makes early defects more expensive. There’s some controversy over how accurate and complete this data is, how much we can rely on it, and how relevant it is today when we have much better development tools and many teams have moved from heavyweight sequential Waterfall development to lightweight iterative, incremental development approaches. Flattening the Cost of Changing Code The rules of the game should change with iterative and incremental development – because they have to. Boehm realized back in the 1980s that we could catch more mistakes early (and therefore reduce the cost of development) if we think about risks upfront and design and build software in increments, using what he called the Spiral Model, rather than trying to define, design and build software in a Waterfall sequence. The same ideas are behind more modern, lighter Agile development approaches. In Extreme Programming Explained (the first edition, but not the second) Kent Beck states that minimizing the cost of change is one of the goals of Extreme Programming, and that a flattened change cost curve is “the technical premise of XP”: Under certain circumstances, the exponential rise in the cost of changing software over time can be flattened. If we can flatten the curve, old assumptions about the best way to develop software no longer hold … You would make big decisions as late in the process as possible, to defer the cost of making the decisions and to have the greatest possible chance that they would be right. You would only implement what you had to, in hopes that the needs you anticipate for tomorrow wouldn't come true. You would introduce elements to the design only as they simplified existing code or made writing the next bit of code simpler. It’s important to understand that Beck doesn't say that with XP the change curve is flat. He says that these costs can be flattened if teams work toward this, leveraging key practices and principles in XP, such as: Simple Design, doing the simplest thing that works, and deferring design decisions as late as possible (YAGNI), so that the design is easy to understand and easy to change Continuous, disciplined refactoring to keep the code easy to understand and easy to change Test-First Development – writing automated tests upfront to catch coding mistakes immediately, and to build up a testing safety net to catch mistakes in the future Developers collaborating closely and constantly with the customer to confirm their understanding of what they need to build and working together in pairs to design solutions and solve problems, and catch mistakes and misunderstandings early Relying on working software over documentation to minimize the amount of paperwork that needs to be done with each change (write code, not specs) The team’s experience working incrementally and iteratively – the more that people work and think this way, the better they will get at it. All of this makes sense and sounds right, although there are no studies that back up these assertions, which is why Beck dropped this change curve discussion from the second edition of his XP book. But, by then, the idea that change could be flat with Agile development had already become accepted by many people. The Importance of Feedback Scott Amber agrees that the cost curve can be flattened in Agile development, not because of Simple Design, but because of the feedback loops that are fundamental to iterative, incremental development. Agile methods optimize feedback within the team, developers working closely together with each other and with the customer and relying on continuous face-to-face communications. Following technical practices like test-first development, pair programming and continuous integration makes these feedback loops even tighter. But what really matters is getting feedback from the people using the system – it’s only then that you know if you got it right or what you missed. The longer that it takes to design and build something and get feedback from real users, the more time and work that is required to get working software into a real customer’s hands, the higher your cost of change really is. Optimizing and streamlining this feedback loop is what is driving the lean startup approach to development: defining a minimum viable product (something that just barely does the job), getting it out to customers as quickly as you can, and then responding to user feedback through continuous deployment and A/B testing techniques until you find out what customers really want. Even Flat Change Can Still Be Expensive Even if you do everything to optimize these feedback loops and minimize your overheads, this still doesn’t mean that change will come cheap. Being fast isn’t good enough if you make too many mistakes along the way. The Post Agilist uses the example of painting a house: Assume that it costs $1,000 each time you paint the house, whether you paint it blue, red or white. The cost of change is flat. But if you have to paint it blue first, then red, then white before everyone is happy, you’re wasting time and money. “No matter how expensive or cheap the "cost of change" curve may be, the fewer changes that are made, the cheaper and faster the result will be … Planning is not a four letter word.” (However, I would like to point out that “plan” is.) Spending too much time upfront in planning and design is waste. But not spending enough time upfront to find out what you should be building and how you should be building it before you build it, and not taking the care to build it carefully, is also a waste. Change Gets More Expensive Over Time You also have to accept that the incremental cost of change will go up over the life of a system, especially once a system is being used. This is not just a technical debt problem. The more people using the system, the more people who might be impacted by the change if you get it wrong, the more careful you have to be. This means that you need to spend more time on planning and communicating changes, building and testing a roll-back capability, and roll changes out slowly using canary releases and dark launching – which add costs and delays to getting feedback. There are also more operational dependencies that you have to understand and take care of, and more data that you have to change or fix up, making changes even more difficult and expensive. If you do things right, keep a good team together and manage technical debt responsibly, these costs should rise gently over the life of a system – and if you don’t, that exponential change curve will kick in. What is the real cost of change? Is the real cost of change exponential, or is it flat? The truth is somewhere in between. There’s no reason that the cost of making a change to software has to be as high as it was 30 years ago. We can definitely do better today, with better tools and better, cheaper ways of developing software. The keys to minimizing the costs of change seem to be: Get your software into customer hands as quickly as you can. I am not convinced that any organization really needs to push out software changes 10 to 50 to 100 times a day, but you don’t want to wait months or years for feedback, either. Deliver less, but more often. And because you’re going to deliver more often, it makes sense to build a continuous delivery pipeline so that you can push changes out efficiently and with confidence. Use ideas from lean software development and maybe Kanban to identify and eliminate waste and to minimize cycle time. We know that, even with lots of upfront planning and design thinking, we won’t get everything right upfront -- this is the Waterfall fallacy. But it’s also important not to waste time and money iterating when you don’t need to. Spending enough time upfront in understanding requirements and in design to get it at least mostly right the first time can save a lot later on. Whether you’re working incrementally and iteratively, or sequentially, it makes good sense to catch mistakes early when you can, whether you do this through test-first development and pairing, or requirements workshops and code reviews -- whatever works for you.

September 20, 2013

by Jim Bird

· 22,191 Views

Top 10 Methods for Java Arrays

The following are top 10 methods for Java Array. They are the most voted questions from stackoverflow. 0. Decalre an array String[] aArray = new String[5]; String[] bArray = {"a","b","c", "d", "e"}; String[] cArray = new String[]{"a","b","c","d","e"}; 1. Print an array in Java int[] intArray = { 1, 2, 3, 4, 5 }; String intArrayString = Arrays.toString(intArray); // print directly will print reference value System.out.println(intArray); // [I@7150bd4d System.out.println(intArrayString); // [1, 2, 3, 4, 5] 2. Create ArrayList from array String[] stringArray = { "a", "b", "c", "d", "e" }; ArrayList arrayList = new ArrayList(Arrays.asList(stringArray)); System.out.println(arrayList); // [a, b, c, d, e] 3. Check if an array contains a certain value String[] stringArray = { "a", "b", "c", "d", "e" }; boolean b = Arrays.asList(stringArray).contains("a"); System.out.println(b); // true 4. Concatenate two arrays int[] intArray = { 1, 2, 3, 4, 5 }; int[] intArray2 = { 6, 7, 8, 9, 10 }; // Apache Commons Lang library int[] combinedIntArray = ArrayUtils.addAll(intArray, intArray2); 5. Declare array inline method(new String[]{"a", "b", "c", "d", "e"}); 6. Joins the elements of the provided array into a single String // containing the provided list of elements // Apache common lang String j = StringUtils.join(new String[] { "a", "b", "c" }, ", "); System.out.println(j); // a, b, c 7. Covnert ArrayList to Array String[] stringArray = { "a", "b", "c", "d", "e" }; ArrayList arrayList = new ArrayList(Arrays.asList(stringArray)); String[] stringArr = new String[arrayList.size()]; arrayList.toArray(stringArr); for (String s : stringArr) System.out.println(s); 8. Convert Array to Set Set set = new HashSet(Arrays.asList(stringArray)); System.out.println(set); //[d, e, b, c, a] 9. Reverse an array int[] intArray = { 1, 2, 3, 4, 5 }; ArrayUtils.reverse(intArray); System.out.println(Arrays.toString(intArray)); //[5, 4, 3, 2, 1] 10. Remove element of an array int[] intArray = { 1, 2, 3, 4, 5 }; int[] removed = ArrayUtils.removeElement(intArray, 3);//create a new array System.out.println(Arrays.toString(removed)); One more – convert int to byte array byte[] bytes = ByteBuffer.allocate(4).putInt(8).array(); for (byte t : bytes) { System.out.format("0x%x ", t); } In addition, do you know what arrays look like in memory ?

September 18, 2013

by Ryan Wang

· 71,517 Views · 2 Likes

Solving the Detached Many-to-Many Problem with the Entity Framework

Introduction This article is part of the ongoing series I’ve been writing recently, but can be read as a standalone article. I’m going to do a better job of integrating the changes documented here into the ongoing solution I’ve been building. However, considering how much time and effort I put into solving this issue, I’ve decided to document the approach independently in case it is of use to others in the interim. The Problem Defined This issue presents itself when you are dealing with disconnected/detached Entity Framework POCO objects,. as the DbContext doesn’t track changes to entities. Specifically, trouble occurs with entities participating in a many-to-many relationship, where the EF has hidden a “join table” from the model itself. The problem with detached entities is that the data context has no way of knowing what changes have been made to an object graph, without fetching the data from the data store and doing an entity-by-entity comparison – and that assuming it’s possible to fetch the same way as it was originally. In this solution, all the entities are detached, don’t use proxy types and are designed to move between WCF service boundaries. Some Inspiration There are no out-of-the-box solutions that I’m aware of which can process POCO object graphs that are detached. I did find an interesting solution called GraphDiff which is available from github and also as a NuGet package, but it didn’t work with the latest RC version of the Entity Framework (v6). I also found a very comprehensive article on how to implement a generic repository pattern with the Entity Framework, but it was unable to handle detached many-to-many relationships. In any case, I highly recommend a read of this article, it was inspiration for some of the approach I’ve ended up taking with my own design. The Approach This morning I put together a simple data model with the relationships that I wanted to support with detached entities. I’ve attached the solution with a sample schema and test data at the bottom of this article. If you prefer to open and play with it, be sue to add the Entity Framework (v6 RC) via NuGet, I’ve omitted it for file size and licensing reasons). Here’s a logical view of the model I wanted to support: Here’s the schema view from SQL Server: Here’s the Entity Model which is generated from the above SQL schema: In the spirit of punching myself in the head, I’ve elected to have one table implement an identity specification (meaning the underlying schema allocated PK ID values) whereas the other two tables the ID must be specified. Theoretically, if I can handle the entity types in a generic fashion, then this solution can scale out to larger and more complex models. The scenarios I’m specifically looking to solve in this solution with detached object graphs are as follows: Add a relationship (many-to-many) Add a relationship (FK-based) Update a related entity (many-to-many) Update a related entity (FK-based) Remove a relationship (many-to-many) Remove a relationship (FK-based) Per the above, here’s the scenarios within the context of the above data model: Add a new Secondary entity to a Primary entity Add an Other entity to a Secondary entity Update a Secondary entity by updating a Primary entity Update an Other entity from a Secondary entity (or Primary entity) Remove (but not delete!) a Secondary entity from a Primary entity Remove (but not delete) a Other entity from a Secondary entity Establishing Test Data Just to give myself a baseline, the data model is populated (by default) with the following data. This gives us some “existing entities” to query and modify. More Work for the Consumer Although I tried my best, I couldn’t come to a design which didn’t require the consuming client to do slightly more work to enable this to work properly. Unfortunately the best place for change tracking to occur with disconnected entities is with the layer making changes – be it a business layer or something downstream. To this effect, entities will need to implement a property which reflects the state of the entity (added, modified, deleted etc.). For the object graph to be updated/managed successfully, the consumer of the entities needs to set the entity state properly. This isn’t at all as bad as it sounds, but it’s not nothing. Establishing some Scaffolding After generating the data model, the first thing to be done is ensure each entity derives from the same base class. (“EntityBase”) this is used later to establish the active state of an entity when it needs to be processed. I’ve also created an enum (“ObjectState”) which is a property of the base class and a helper function which maps ObjectState to an EF EntityState. In case this isn’t clear, here’s a class view: Constructing Data Access To ensure that the usage is consistent, I’ve defined a single Data Access class, mainly to establish the pattern for handling detached object graphs. I can’t stress enough that this is not intended as a guide to an appropriate way to structure your data access – I’ll be updating my ongoing series of articles to go into more detail – this is only to articulate a design approach to handling detached object graphs. Having said all that, here’s a look at my “DataAccessor” class, which can be used with generic data access entities (by way of generics): As with my ongoing project, the Entity Framework DbContext is instantiated by this class on construction, and implements IDisposable to ensure the DbContext is disposed properly upon construction. Here’s the constructor showing the EF configuration options I’m using: public DataAccessor() { _accessor = new SampleEntities(); _accessor.Configuration.LazyLoadingEnabled = false; _accessor.Configuration.ProxyCreationEnabled = false; } Updating an Entity We start with a basic scenario to ensure that the scaffolding has been implemented properly. The scenario is to query for a Primary entity and then change a property and update the entity in the data store. [TestMethod] public void UpdateSingleEntity() { Primary existing = null; String existingValue = String.Empty; using (DataAccessor a = new DataAccessor()) { existing = a.DataContext.Primaries.Include("Secondaries").First(); Assert.IsNotNull(existing); existingValue = existing.Title; existing.Title = "Unit " + DateTime.Now.ToString("MMdd hh:mm:ss"); } using (DataAccessor b = new DataAccessor()) { existing.State = ObjectState.Modified; b.InsertOrUpdate(existing); } using (DataAccessor c = new DataAccessor()) { existing.Title = existingValue; existing.State = ObjectState.Modified; c.InsertOrUpdate(existing); } } You’ll noticed that there is nothing particularly significant here, except that the object’s State is reset toModified between operations. Updating a Many-to-Many Relationship Now things get interesting. I’m going to query for a Primary entity, then I’ll update both a property of thePrimary entity itself, and a property of one of the entity’s relationships. [TestMethod] public void UpdateManyToMany() { Primary existing = null; Secondary other = null; String existingValue = String.Empty; String existingOtherValue = String.Empty; using (DataAccessor a = new DataAccessor()) { //Note that we include the navigation property in the query existing = a.DataContext.Primaries.Include("Secondaries").First(); Assert.IsTrue(existing.Secondaries.Count() > 1, "Should be at least 1 linked item"); } //save the original description existingValue = existing.Description; //set a new dummy value (with a date/time so we can see it working) existing.Description = "Edit " + DateTime.Now.ToString("yyyyMMdd hh:mm:ss"); existing.State = ObjectState.Modified; other = existing.Secondaries.First(); //save the original value existingOtherValue = other.AlternateDescription; //set a new value other.AlternateDescription = "Edit " + DateTime.Now.ToString("yyyyMMdd hh:mm:ss"); other.State = ObjectState.Modified; //a new data access class (new DbContext) using (DataAccessor b = new DataAccessor()) { //single method to handle inserts and updates //set a breakpoint here to see the result in the DB b.InsertOrUpdate(existing); } //return the values to the original ones existing.Description = existingValue; other.AlternateDescription = existingOtherValue; existing.State = ObjectState.Modified; other.State = ObjectState.Modified; using (DataAccessor c = new DataAccessor()) { //update the entities back to normal //set a breakpoint here to see the data before it reverts back c.InsertOrUpdate(existing); } } If we actually run this unit test and set the breakpoints accordingly, you’ll see the following in the database: Database at Breakpoint #1 / Database at Breakpoint #2 Database when Unit Test completes You’ll notice at the second breakpoint that the description of the first entities have both been updated. Examining the Insert/Update Code The function exposed by the “data access” class really just passes through to another private function which does the heavy lifting. This is mainly in case we need to reuse the logic, since it essentially processes state action on attached entities. public void InsertOrUpdate(params T[] entities) where T : EntityBase { ApplyStateChanges(entities); DataContext.SaveChanges(); } Here’s the definition of the ApplyStateChanges function, which I’ll discuss below: private void ApplyStateChanges(params T[] items) where T : EntityBase { DbSet dbSet = DataContext.Set(); foreach (T item in items) { //loads related entities into the current context dbSet.Attach(item); if (item.State == ObjectState.Added || item.State == ObjectState.Modified) { dbSet.AddOrUpdate(item); } else if (item.State == ObjectState.Deleted) { dbSet.Remove(item); } foreach (DbEntityEntry entry in DataContext.ChangeTracker.Entries() .Where(c => c.Entity.State != ObjectState.Processed && c.Entity.State != ObjectState.Unchanged)) { var y = DataContext.Entry(entry.Entity); y.State = HelperFunctions.ConvertState(entry.Entity.State); entry.Entity.State = ObjectState.Processed; } } } Notes on this Implementation What this function does is to iterate through the items to be examined, attach them to the current Data Context (which also attaches their children), act on each item accordingly (add/update/remove) and then process new entities which have been added to the Data Context’s change tracker. For each newly “discovered” entity (and ignoring entities which are unchanged or have already been examined), each entity’s DbEntityEntry is set according to the entity’s ObjectState (which is set by the calling client). Doing this allows the Entity Framework to understand what actions it needs to perform on the entities when SaveChanges() is invoked later. You’ll also note that I set the entity’s state to “Processed” when it has been examined, so we don’t act on it more than once (for performance purposes). Fun note: the AddOrUpdate extension method is something I found in theSystem.Data.Entity.Migrations namespace and it acts as an ‘Upsert’ operation, inserting or updating entities depending on whether they exist or not already. Bonus! That’s it for adding and updating, believe it or not. Corresponding Unit Test The following unit test establishes the creation of a new many-to-many entity, it is then removed (by relationship) and then finally deleted altogether from the database: [TestMethod] public void AddRemoveRelationship() { Primary existing = null; using (DataAccessor a = new DataAccessor()) { existing = a.DataContext.Primaries.Include("Secondaries") .FirstOrDefault(); Assert.IsNotNull(existing); } Secondary newEntity = new Secondary(); newEntity.State = ObjectState.Added; newEntity.AlternateTitle = "Unit"; newEntity.AlternateDescription = "Test"; newEntity.SecondaryId = 1000; existing.Secondaries.Add(newEntity); using (DataAccessor a = new DataAccessor()) { //breakpoint #1 here a.InsertOrUpdate(existing); } newEntity.State = ObjectState.Unchanged; existing.State = ObjectState.Modified; using (DataAccessor b = new DataAccessor()) { //breakpoint #2 here b.RemoveEntities(existing, x => x.Secondaries, newEntity); } using (DataAccessor c = new DataAccessor()) { //breakpoint #3 here c.Delete(newEntity); } } Test Results: Pre-Test – Breakpoint #1 / Breakpoint #2 Breakpoint #3 / Post execution (new entity deleted) SQL Profile Trace Removing a Many-to-Many Relationship Now this is where it gets tricky. I’d like to have something a little more polished, but the best I have come up with to date is a separate operation on the data provider which exposes functionality akin to “remove relationship”. The fundamental problem with how the EF POCO entities work without any modifications, is when they are detached, to remove a many-to-many relationship, the relationship to be removed is physically removed from the collection. When the object graph is sent back for processing, there’s a missing related entity, and the service or data context would have to make an assumption that the omission was on purpose, not to mention that it would have to compare against data currently in the data store. To make this easier, I’ve implemented a function called “RemoveEnttiies” which alters the relationship between the parent and the child/children. The one bug catch is that you need to specify the navigation property or collection, which might make it slightly undesirable to implement generically. In any case, I’ve provided two options – with the navigation property as a string parameter or as a LINQ expression – they both do the same thing. public void RemoveEntities(T parent, Expression> expression, params T2[] children) where T : EntityBase where T2 : EntityBase { DataContext.Set().Attach(parent); ObjectContext obj = DataContext.ToObjectContext(); foreach (T2 child in children) { DataContext.Set().Attach(child); obj.ObjectStateManager.ChangeRelationshipState(parent, child, expression, EntityState.Deleted); } DataContext.SaveChanges(); } Notes on this Implementation The “ToObjectContext” is an extension method, and is akin to (DataContext as IObjectContextAdapter).ObjectContext. This is to expose a more fundamental part of the Entity Framework’s object model. We need this level of access to get to the functionality which controls relationships. For each child to be removed (note: not deleted from the physical database), we nominate the parent object, the child, the navigation property (collection) and the nature of the relationship change (delete). Note that this will NOT WORK for Foreign Key defined relationships – more on that below. To delete entities which have active relationships, you’ll need to drop the relationship before attempting to delete or else you’ll have data integrity/referential integrity errors, unless you have accounted for cascading deletion (which I haven’t). Example execution: using (DataAccessor c = new DataAccessor()) { //c.RemoveEntities(existing, "Secondaries", s); //(or can use an expression): c.RemoveEntities(existing, x => x.Secondaries, s); } Removing FK Relationships As mentioned above, you can’t just edit the relationship to remove an FK-based relationship. Instead, you have to follow the EF practice of setting the FK entity to NULL. Here’s a Unit Test which demonstrates how this is achieved: Secondary s = ExistingEntity(); using (DataAccessor c = new DataAccessor()) { s.Other = null; s.OtherId = null; s.State = ObjectState.Modified; o.State = ObjectState.Unchanged; c.InsertOrUpdate(s); } We use the same “Insert or Update’ call – being aware that you have to set the ObjectState properties accordingly. Note: I’m in the process of testing the reverse removal – i.e. what happens if you want to remove a Secondaryentity from an Other entity’s collection. Deleting Entities This is fairly straightforward, but I’ve taken a few more precautions to ensure that the entity to be deleted is valid no the server side. public void Delete(params T[] entities) where T : EntityBase { foreach (T entity in entities) { T attachedEntity = Exists(entity); if (attachedEntity != null) { var attachedEntry = DataContext.Entry(attachedEntity); attachedEntry.State = EntityState.Deleted; } } DataContext.SaveChanges(); } To understand the above, you should take a look at the implementation of the “Exists” function which essentially checks the data store and local cache to see if there is an attached representation: protected T Exists(T entity) where T : EntityBase { var objContext = ((IObjectContextAdapter)this.DataContext) .ObjectContext; var objSet = objContext.CreateObjectSet(); var entityKey = objContext.CreateEntityKey(objSet.EntitySet.Name, entity); DbSet set = DataContext.Set(); var keys = (from x in entityKey.EntityKeyValues select x.Value).ToArray(); //Remember, there can by surrogate keys, so don't assume there's //just one column/one value //If a surrogate key isn't ordered properly, the Set().Find() //method will fail, use attributes on the entity to determine the //proper order. //context.Configuration.AutoDetectChangesEnabled = false; return set.Find(keys); } This is a fairly expensive operation which is why it’s pretty much reserved for deletes and not more frequent operations. It essentially determines the target entity’s primary key and then checks whether the entity exists or not. Note: I haven’t tested this on entities with surrogate keys, but I’ll get to it at some point. If you have surrogate key tables, you can define the PK key order using attributes on the model entity, but I haven’t done this (yet). Summary This article is the culmination of about two days of heavy analysis and investigation. I’ve got a whole lot more to contribute on this topic, but for now, I felt it was worthy enough to post as-is. What you’ve got here is still incredibly rough, and I haven’t done nearly enough testing. To be honest, I was quite excited by the initial results, which is why I decided to write this post. there’s an incredibly good chance that I’ve missed something in the design and implementation, so please be aware of that. I’ll be continuing to refine this approach in my main series of articles with much cleaner implementation. In the meantime though, if any of this helps anyone out there struggling with detached entities, I hope it helps. There’s precious few articles and samples that are up to date, and very few that seem to work. This is provided without any warranty of any kind! If you find any issues please e-mail me [email protected] and I’ll attempt to refactor/debug and find ways around some of the inherent limitations. In the meantime, there are a few helpful links I’ve come across in my travels on the WWW. See below. Example Solution Files [ Files ] Note: you’ll need to add the Entity Framework v6 RC package via NuGet, I haven’t included it in the archive. Helpful Links http://blog.magnusmontin.net/2013/05/30/generic-dal-using-entity-framework/ https://github.com/refactorthis/GraphDiff http://stackoverflow.com/questions/11686225/dbset-find-method-ridiculously-slow-compared-to-singleordefault-on-id http://stackoverflow.com/questions/10381106/cannot-update-many-to-many-relationships-in-entity-framework http://stackoverflow.com/questions/8413248/how-to-save-an-updated-many-to-many-collection-on-detached-entity-framework-4-1 http://stackoverflow.com/questions/6018711/generic-way-to-check-if-entity-exists-in-entity-framework

September 18, 2013

by Rob Sanders

· 163,523 Views

Introduction to ElasticSearch

Learn about ElasticSearch, an open source tool developed with Java. It is a Lucene-based, scalable, full-text search engine, and a data analysis tool.

September 17, 2013

by Hüseyin Akdoğan

CORE

· 12,113 Views · 5 Likes

Efficient Techniques For Loading Data Into Memory

Data loading usually has to do with initializing cache data on start up. However, quite often caches need to be loaded or reloaded periodically, and not only on start up. In cases in which you need to load lots of data, either at start up or at any point afterward, using standard cache put(...) or putAll(...) operations is generally inefficient, especially when transactional boundaries are not important. This is especially true when data has to be partitioned across the network, so you don't know in advance on which node the data will end up. For fast loading of large amounts of data, GridGain provides a cool mechanism called data loader (implemented via GridDataLoader). The data loader will properly batch keys together and collocate those batches with nodes on which the data will be cached. By controlling the size of the batch and the size of internal transactions it is possible to achieve very fast data loading rates. The code below shows an example of how it can be done: // Get the data loader reference. try (GridDataLoader ldr = grid.dataLoader("partitioned")) { // Load the entries. for (int i = 0; i < ENTRY_COUNT; i++) ldr.addData(i, Integer.toString(i)); } Whenever the data is submitted to the data loader, it is stored in the buffer, which is consumed by loader threads. If the buffer is full, the user thread will block on the addData(...) call until the loader threads free enough room for new entries. Another method of data preloading is to load it directly from a persistent data store. GridGain supports that via the GridCache.loadCache(...) method. Note that this method of loading data into cache is very efficient as it is local, non-transactional and is usually implemented using bulk data store operations. The reason it can afford to be non-transactional is because it will not override any values in cache, it can only insert new values. This means that if some transaction has already updated an entry, this entry will not be overwritten by the loadCache(...) call. Whenever the GridCache.loadCache(...) method is called, it will internally delegate to the underlying persistent store implementation by invoking the GridCacheStore.loadAll(...) method. Usually implementation of this method will load from DB either full or partial set of objects depending on requirements. Here is an example of how the GridCacheStore.loadAll(...) method may be implemented: @Override public void loadAll(@Nullable String cacheName, GridInClosure2 closure, Object... args) throws GridException { try (Connection conn = getConnection()) { // Load all Persons from database (perhaps to warm up cache?) try (PreparedStatement st = conn.prepareStatement("select * from PERSONS")) { ResultSet rs = st.executeQuery(); while (rs.next()) c.apply( // Key. UUID.fromString(rs.getString(1)), // New value. person(rs.getString(1), rs.getString(2), rs.getString(3), rs.getString(4)) ); } } } catch (SQLException e) { throw new GridException("Failed to load objects", e); } } Note that, instead of returning a collection of loaded entries, this method instead passes each load entry into the closure provided by the system, which avoids costly large collection creations and internal resizing. GridGain will then take the values passed into the closure and store them in cache. Using the above loading routines will often render 10x and above performance improvement over simple put(...) calls.

September 4, 2013

by Dmitriy Setrakyan

· 9,336 Views

When Reading Excel with POI, Beware of Floating Points

Our problem began when we tried to read a certain cell that contained the value 929 as a numeric field and store it into an integer.

August 30, 2013

by Lieven Doclo

· 48,089 Views · 1 Like

Remove Characters at the Start and End of a String in PHP

In a previous article about how you can remove whitesapce from a string, I spoke about using the functions ltrim() and rtrim(). These work by passing in a string to remove whitespace. Using the ltrim() function will remove the whitespace from the start of the string, using the rtrim() function will remove the whitespace from the end of the string. But you can also use these functions to remove characters from a string. These functions take a second parameter that allows you to specify what characters to remove. // This will search for the word start at the beginning of the string and remove it ltrim($string, 'start'); // This will search for the word end at the end of the string and remove it rtrim($string, 'end'); Remove Trailing Slashes From a String A common use for this functionality is to remove the trailing slash from a URL. Below is a code snippet that allows you to easily do this using the rtrim() function. function remove_trailing_slashes( $url ) { return rtrim($url, '/'); } A common use for the ltrim() function is to remove the "http://" from a URL. Use the function below to remove both "http" and "https" from a URL: function remove_http( $url ) { $url = ltrim($url, 'http://'); $url = ltrim($url, 'https://'); return $url; }

August 20, 2013

by Paul Underwood

· 41,737 Views

A Consistent Approach To Client-Side Cache Invalidation

Download the source code for this blog entry here: ClientSideCacheInvalidation.zip TL;DR? Please scroll down to the bottom of this article to review the summary. I ran into a problem not long ago where some JSON results from an AJAX call to an ASP.NET MVC JsonResult action were being cached by the browser, quite intentionally by design, but were no longer up-to-date, and without devising a new approach to route manipulation or any of the other fundamental infrastructural designs for the endpoints (because there were too many) our hands were tied. The caching was being done using the ASP.NETOutputCacheAttribute on the action being invoked in the AJAX call, something like this (not really, but this briefly demonstrates caching): [OutputCache(Duration = 300)] public JsonResult GetData() { return Json(new { LastModified = DateTime.Now.ToString() }, JsonRequestBehavior.AllowGet); } @model dynamic @{ ViewBag.Title = "Home"; } Home Reload @section scripts { } Since we were using a generalized approach to output caching (as we should), I knew that any solution to this problem should also be generalized. My first thought was in the mistaken assumption that the default [OutputCache] behavior was to rely on client-side caching, since client-side caching was what I was observing while using Fiddler. (Mind you, in the above sample this is not the case, it is actually server-side, but this is probably because of the amount of data being transferred. I’ll explain after I explain what I did in my false assumption.) Microsoft’s default convention for implementing cache invalidation is to rely on “VaryBy..” semantics, such as varying the route parameters. That is great except that the route and parameters were currently not changing in our implementation. So, my initial proposal was to force the caching to be done on the server instead of on the client, and to invalidate when appropriate. public JsonResult DoSomething() { // // Do something here that has a side-effect // of making the cached data stale // Response.RemoveOutputCacheItem(Url.Action("GetData")); return Json("OK"); } [OutputCache(Duration = 300, Location = OutputCacheLocation.Server)] public JsonResult GetData() { return Json(new { LastModified = DateTime.Now.ToString() }, JsonRequestBehavior.AllowGet); } Invalidate $('#invalidate').on('click', function() { $.post($APPROOT + "Home/DoSomething", null, function(o) { window.location.reload(); }, 'json'); }); While Reload has no effect on the Last modified value, the Invalidate button causes the date to increment. When testing, this actually worked quite well. But concerns were raised about the payload of memory on the server. Personally I think the memory payload in practically any server-side caching is negligible, certainly if it is small enough that it would be transmitted over the wire to a client, so long as it is measured in kilobytes or tens of kilobytes and not megabytes. I think the real concern is that transmission; the point of caching is to make the user experience as smooth and seamless as possible with minimal waiting, so if the user is waiting for a (cached) payload, while it may be much faster than the time taken to recalculate or re-acquire the data, it is still measurably slower than relying on browser cache. The default implementation of OutputCacheAttribute is actually OutputCacheLocation.Any. This indicates that the cached item can be cached on the client, on a proxy server, or on the web server. From my tests, for tiny payloads, the behavior seemed to be caching on the server and no caching on the client; for a large payload from GET requests with querystring parameters seemed to be caching on the client but with an HTTP query with an “If-Modified-Since” header, resulting in a 304 Not Modified on the server (indicating it was also cached on the server but verified by the server that the client’s cache remains valid); and for a large payload from GET requests with all parameters in the path, the behavior seemed to be caching on the client without any validation checking from the client (no HTTP request for an If-Modified-Since check). Now, to be quite honest I am only guessing that these were the distinguishing factors of these behavior observations. Honestly, I saw variations of these behaviors happening all over the place as I tinkered with scenarios, and this was the initial pattern I felt I was observing. At any rate, for our purposes we were currently stuck with relying on “Any” as the location, which in theory would remove server-side caching if the server ran short on RAM (in theory, I don’t know, although the truth can probably be researched, which I don’t have time to get into). The point of all this is, we have client-side caching that we cannot get away from. So, how do you invalidate the client-side cache? Technically, you really can’t. The browser controls the cache bucket and no browsers provide hooks into the cache to invalidate them. But we can get smart about this, and work around the problem, by bypassing the cached data. Cached HTTP results are stored on the basis of varying by the full raw URL on HTTP GET methods, they are cached with an expiration (in the above sample’s case, 300 seconds, or 5 minutes), and are only cached if allowed to be cached in the first place as per the HTTP header directives in the HTTP response. So, to bypass the cache you don’t cache, or you need to know up front how long the cache should remain until it expires—neither of these being acceptable in a dynamic application—or you need to use POST instead of GET, or you need to vary up the URL. Microsoft originally got around the caching problem in ASP.NET 1.x by forcing the “normal” development cycle in the lifecycle of tags that always used the POST method over HTTP. Responses from POST requests are never cached. But POSTing is not clean as it does not follow the semantics of the verbiage if nothing is being sent up and data is only being retrieved. You can also use ETag in the HTTP headers, which isn’t particularly helpful in a dynamic application as it is no different from a URL + expiration policy. To summarize, to control cache: Disable caching from the server in the Response header (Pragma: no-cache) Predict the lifetime of the content and use an expiration policy Use POST not GET Etag Vary the URL (case-sensitive) Given our options, we need to vary up the URL. There a number of approaches to this, but almost all of the approaches involve relying on appending or modifying the querystring with parameters that are expected to be ignored by the server. $.getJSON($APPROOT + "Home/GetData?_="+Date.now(), function (o) { $('#results').text("Last modified: " + o.LastModified); }); In this sample, the URL is appended with “?_=”+Date.now(), resulting in this URL in the GET: /Home/GetData?_=1376170287015 This technique is often referred to as cache-busting. (And if you’re reading this blog article, you’re probably rolling your eyes. “Duh.”) jQuery inherently supports cache-busting, but it does not do it on its own from $.getJSON(), it only does it in $.ajax() when the options parameter includes {cache: false}, unless you invoke $.ajaxSetup({ cache: false }); first to disable all caching. Otherwise, for $.getJSON() you would have to do it manually by appending the URL. (Alright, you can stop rolling your eyes at me now, I’m just trying to be thorough here..) This is not our complete solution. We have a couple problems we still have to solve. First of all, in a complex client codebase, hacking at the URL from application logic might not be the most appropriate approach. Consider if you’re using Backbone.js with routes that synchronize objects to and from the server. It would be inappropriate to modify the routes themselves just for cache invalidation. A more generalized cache invalidation technique needs to be implemented in the XHR-invoking AJAX function itself. The approach in doing this will depend upon your Javascript libraries you are using, but, for example, if jQuery.getJSON() is being used in application code, then jQuery.getJSON itself could perhaps be replaced with an invalidation routine. var gj = $.getJSON; $.getJSON = function (url, data, callback) { url = invalidateCacheIfAppropriate(url); // todo: implement something like this return gj.call(this, url, data, callback); }; This is unconventional and probably a bad example since you’re hacking at a third party library, a better approach might be to wrap the invocation of $.getJSON() with an application function. var getJSONWrapper = function (url, data, callback) { url = invalidateCacheIfAppropriate(url); // todo: implement something like this return $.getJSON(url, data, callback); }; And from this point on, instead of invoking $.getJSON() in application code, you would invoke getJSONWrapper, in this example. The second problem we still need to solve is that the invalidation of cached data that derived from the server needs to be triggered by the server because it is the server, not the client, that knows that client cached data is no longer up-to-date. Depending on the application, the client logic might just know by keeping track of what server endpoints it is touching, but it might not! Besides, a server endpoint might have conditional invalidation triggers; the data might be stale given specific conditions that only the server may know and perhaps only upon some calculation. In other words, invalidation needs to be pushed by the server. One brute force, burdensome, and perhaps a little crazy approach to this might be to use actual “push technology”, formerly “Comet” or “long-polling”, now WebSockets, implemented perhaps with ASP.NET SignalR, where a connection is maintained between the client and the server and the server then has this open socket that can push invalidation flags to the client. We had no need for that level of integration and you probably don’t either, I just wanted to mention it because it might come back as food for thought for a related solution. One scenario I suppose where this might be useful is if another user of the web application has caused the invalidation, in which case the current user will not be in the request/response cycle to acquire the invalidation flag. Otherwise, it is perhaps a reasonable assumption that invalidation is only needed, and only triggered, in the context of a user’s own session. If not, perhaps it is a “good enough” assumption even if it is sometimes not true. The expiration policy can be set low enough that a reasonable compromise can be made between the current user’s changes and changes invoked by other systems or other users. While we may not know what server endpoint might introduce the invalidation of client cache data, we could assume that the invalidation will be triggered by any server endpoint(s), and build invalidation trigger logic on the response of server HTTP responses. To begin implementing some sort of invalidation trigger on the server I could flag invalidations to the client using HTTP header(s). public JsonResult DoSomething() { // // Do something here that has a side-effect // of making the cached data stale // InvalidateCacheItem(Url.Action("GetData")); return Json("OK"); } public void InvalidateCacheItem(string url) { Response.RemoveOutputCacheItem(url); // invalidate on server Response.AddHeader("X-Invalidate-Cache-Item", url); // invalidate on client } [OutputCache(Duration = 300)] public JsonResult GetData() { return Json(new { LastModified = DateTime.Now.ToString() }, JsonRequestBehavior.AllowGet); } At this point, the server is emitting a trigger to the HTTP client that says that “as a result of a recent operation, that other URL, the one for GetData, is no longer valid for your current cache, if you have one”. The header alone can be handled by different client implementations (or proxies) in different ways. I didn’t come across any “standard” HTTP response that does this “officially”, so I’ll come up with a convention here. Now we need to handle this on the client. Before I do anything first of all I need to refactor the existing AJAX functionality on the client so that instead of using $.getJSON, I might use $.ajax or some other flexible XHR handler, and wrap it all in custom functions such as httpGET()/httpPOST() and handleResponse(). var httpGET = function(url, data, callback) { return httpAction(url, data, callback, "GET"); }; var httpPOST = function (url, data, callback) { return httpAction(url, data, callback, "POST"); }; var httpAction = function(url, data, callback, method) { url = cachebust(url); if (typeof(data) === "function") { callback = data; data = null; } $.ajax(url, { data: data, type: "GET", success: function(responsedata, status, xhr) { handleResponse(responsedata, status, xhr, callback); } }); }; var handleResponse = function (data, status, xhr, callback) { handleInvalidationFlags(xhr); callback.call(this, data, status, xhr); }; function handleInvalidationFlags(xhr) { // not yet implemented }; function cachebust(url) { // not yet implemented return url; }; // application logic httpGET($APPROOT + "Home/GetData", function(o) { $('#results').text("Last modified: " + o.LastModified); }); $('#reload').on('click', function() { window.location.reload(); }); $('#invalidate').on('click', function() { httpPOST($APPROOT + "Home/Invalidate", function (o) { window.location.reload(); }); }); At this point we’re not doing anything yet, we’ve just broken up the HTTP/XHR functionality into wrapper functions that we can now modify to manipulate the request and to deal with the invalidation flag in the response. Now all our work will be in handleInvalidationFlags() for capturing that new header we just emitted from the server, and cachebust() for hijacking the URLs of future requests. To deal with the invalidation flag in the response, we need to detect that the header is there, and add the cached item to a cached data set that can be stored locally in the browser with web storage. The best place to put this cached data set is in sessionStorage, which is supported by all current browsers. Putting it in a session cookie (a cookie with no expiration flag) works but is less ideal because it adds to the payload of all HTTP requests. Putting it in localStorage is less ideal because we do want the invalidation flag(s) to go away when the browser session ends, because that’s when the original browser cache will expire anyway. There is one caveat to sessionStorage: if a user opens a new tab or window, the browser will drop the sessionStorage in that new tab or window, but may reuse the browser cache. The only workaround I know of at the moment is to use localStorage (permanently retaining the invalidation flags) or a session cookie. In our case, we used a session cookie. Note also that IIS is case-insensitive on URI paths, but HTTP itself is not, and therefore browser caches will not be. We will need to ignore case when matching URLs with cache invalidation flags. Here is a more or less complete client-side implementation that seems to work in my initial test for this blog entry. function handleInvalidationFlags(xhr) { // capture HTTP header var invalidatedItemsHeader = xhr.getResponseHeader("X-Invalidate-Cache-Item"); if (!invalidatedItemsHeader) return; invalidatedItemsHeader = invalidatedItemsHeader.split(';'); // get invalidation flags from session storage var invalidatedItems = sessionStorage.getItem("invalidated-cache-items"); invalidatedItems = invalidatedItems ? JSON.parse(invalidatedItems) : {}; // update invalidation flags data set for (var i in invalidatedItemsHeader) { invalidatedItems[prepurl(invalidatedItemsHeader[i])] = Date.now(); } // store revised invalidation flags data set back into session storage sessionStorage.setItem("invalidated-cache-items", JSON.stringify(invalidatedItems)); } // since we're using IIS/ASP.NET which ignores case on the path, we need a function to force lower-case on the path function prepurl(u) { return u.split('?')[0].toLowerCase() + (u.indexOf("?") > -1 ? "?" + u.split('?')[1] : ""); } function cachebust(url) { // get invalidation flags from session storage var invalidatedItems = sessionStorage.getItem("invalidated-cache-items"); invalidatedItems = invalidatedItems ? JSON.parse(invalidatedItems) : {}; // if item match, return concatonated URL var invalidated = invalidatedItems[prepurl(url)]; if (invalidated) { return url + (url.indexOf("?") > -1 ? "&" : "?") + "_nocache=" + invalidated; } // no match; return unmodified return url; } Note that the date/time value of when the invalidation occurred is permanently stored as the concatenation value. This allows the data to remain cached, just updated to that point in time. If invalidation occurs again, that concatenation value is revised to the new date/time. Running this now, after invalidation is triggered by the server, the subsequent request of data is appended with a cache-buster querystring field. In Summary, .. .. a consistent approach to client-side cache invalidation triggered by the server might be by following these steps. Use X-Invalidate-Cache-Item as an HTTP response header to flag potentially cached URLs as expired. You might consider using a semicolon-delimited response to list multiple items. (Do not URI-encode the semicolon when using it as a URI list delimiter.) Semicolon is a reserved/invalid character in URI and is a valid delimiter in HTTP headers, so this is valid. Someday, browsers might support this HTTP response header by automatically invalidating browser cache items declared in this header, which would be awesome. In the mean time ... Capture these flags on the client into a data set, and store the data set into session storage in the format: { "http://url.com/route/action": (date_value_of_invalidation_flag), "http://url.com/route/action/2": (date_value_of_invalidation_flag) } { "http://url.com/route/action": (date_value_of_invalidation_flag), "http://url.com/route/action/2": (date_value_of_invalidation_flag) } Hijack all XHR requests so that the URL is appropriately appended with cachebusting querystring parameter if the URL was found in the invalidation flags data set, i.e. http://url.com/route/action becomes something like http://url.com/route/action?_nocache=(date_value_of_invalidation_flag), being sure to hijack only the XHR request and not any logic that generated the URL in the first place. Remember that IIS and ASP.NET by default convention ignore case (“/Route/Action” == “/route/action”) on the path, but the HTTP specification does not and therefore the browser cache bucket will not ignore case. Force all URL checks for invalidation flags to be case-insensitive to the left of the querystring (if there is a querystring, otherwise for the entire URL). Make sure the AJAX requests’ querystring parameters are in consistent order. Changing the sequential order of parameters may be handled the same on the server but will be cached differently on the client. These steps are for “pull”-based XHR-driven invalidation flags being pulled from the server via XHR. For “push”-based invalidation triggered by the server, consider using something like a SignalR channel or hub to maintain an open channel of communication using WebSockets or long polling. Server application logic can then invoke this channel or hub to send an invalidation flag to the client or to all clients. On the client side, an invalidation flag “push” triggered in #7 above, for which #1 and #2 above would no longer apply, can still utilize #3 through #6. You can download the project I used for this blog entry here: ClientSideCacheInvalidation.zip

August 16, 2013

by Jon Davis

· 11,395 Views

Resource Pooling, Virtualization, Fabric, and the Cloud

One of the five essential attributes of cloud computing (see The 5-3-2 Principle of Cloud Computing) is resource pooling, which is an important differentiator separating the thought process of traditional IT from that of a service-based, cloud computing approach. Resource pooling in the context of cloud computing and from a service provider’s viewpoint denotes a set of strategies and a methodical way of managing resources. For a user, resource pooling institutes an abstraction for presenting and consuming resources in a consistent and transparent fashion. This article presents these key concepts derived from resource pooling: Resource Pools Virtualization in the Context of Cloud Computing Standardization, Automation, and Optimization Fabric Cloud Closing Thoughts Resource Pools Ultimately, data center resources can be logically placed into three categories. They are: compute, networks, and storage. For many, this grouping may appear trivial. It is, however, a foundation upon which some cloud computing methodologies are developed, products designed, and solutions formulated. Compute This is a collection of all CPU capabilities. Essentially all data center servers, either for supporting or actually running a workload, are all part of this compute group. Compute pool represents the total capacity for executing code and running instances. The process to construct a compute pool is to first inventory all servers and identify virtualization candidates followed by implementing server virtualization. It is never too early to introduce a system management solution to facilitate the processes, which in my view is a strategic investment and a critical component for all cloud initiatives. Networks The physical and logical artifacts putting in place to connect resources, segment, and isolate resources from layer three and below, etc., are gathered in the network pool. Networking enables resources becoming visible and hence possibly manageable. In the age of instant gratification, networks and mobility are redefining the security and system administration boundaries, and play a direct and impactful role in user productivity and customer satisfaction. Networking in cloud computing is more than just remote access, but empowerment for a user to self-serve and consume resources anytime, anywhere, with any device. BYOD and consumerization of IT are various expressions of these concepts. Storage This has long been a very specialized and sometimes mysterious part of IT. An enterprise storage solution is frequently characterized as a high-cost item with a significant financial and contractual commitment, specialized hardware, proprietary API and software, a dependency on direct vendor support, etc. In cloud computing, storage has become even more noticeable since the ability to grow and shrink based on demands, i.e. elasticity, demands an enterprise-level, massive, reliable, and resilient storage solution at a global scale. While enterprise IT is consolidating resources and transforming the existing establishment into a cloud computing environment, how to leverage existing storage devices from various vendors and integrate them with the next generation storage solutions is among the highest priorities for modernizing a data center. Virtualization in the Context of Cloud Computing In the last decade, virtualization has proved its value and accelerated the realization of cloud computing. Then, virtualization was mainly server virtualization, which in an over-simplified statement means hosting multiple server instances with the same hardware while each instance runs transparently and in insolation, as if each consumes the entire hardware and is the only instance running. Much of the customer expectations, business needs, and methodologies has since evolved. Now, we should validate virtualization in the context of cloud computing to fully address the innovations rapidly changing how IT conducts business and delivers services. As discussed below, in the context of cloud computing, consumable resources are delivered in some virtualized form. Various virtualization layers collectively construct and form the so-called fabric. Server Virtualization The concept of server virtualization remains: running multiple server instances with the same hardware while each instance runs transparently and in isolation, as if each instance is the only instance running and consuming the entire server hardware. In addition to virtualizing and consolidating servers, server virtualization also signifies the practices of standardizing server deployment switching away from physical boxes to VMs. Server virtualization is for packaging, delivering, and consuming a compute pool. There are a few important considerations of virtualizing servers. IT needs the ability to identify and manage bare metal such that the entire resource life-cycle management from commencing to decommissioning can be standardized and automated. To fundamentally reduce the support and training cost while increasing productivity, a consistent platform with tools applicable across physical, virtual, on-premises, and off-premises deployments is essential. The last thing IT wants is one set of tools for physical resources and another for those virtualized, one set of tools for on-premises deployment and another for those deployed to a service provider, and one set of tools for development and another for deploying applications. The requirement is one methodology for all, one skill set for all, and one set of tools for all. This advantage is obvious when developing applications and deploying Windows Server 2012 R2 on premises or off premises to Windows Azure. The Active Directory security model can work across sites, System Center can manage resources deployed off premises to Windows Azure, and Visual Studio can publish applications across platforms. Windows infrastructure architecture, security, and deployment models are all directly applicable. Network Virtualization The similar idea of server virtualization applies here. Network virtualization is the ability to run multiple networks on the same network device while each network runs transparently and in isolation, as if each network is the only network running and consuming the entire network hardware. Conceptually, since each network instance is running in isolation, one tenant’s 192.168.x network is not aware of another tenant’s identical192.168.x network running with the same network device. Network virtualization provides the translation between physical network characteristics and the representation of and a resource identity in a virtualized network. Consequently, above the network virtualization layer, various tenants while running in isolation can have identical network configurations. A great example of network virtualization is Windows Azure virtual networking. At any given time, there can be multiple Windows Azure subscribers all allocating the same 192.168.x address space with an identical subnet scheme (192.168.1.x/16) for deploying VMs. Those VMs belonging to one subscriber will however not be aware of or visible to those deployed by others, despite the fact that the network configuration, IP scheme, and IP address assignments may all be identical. Network virtualization in Windows Azure isolates on subscriber from the others such that each subscriber operates as if the subscription is the only one employing a 192.168.x address space. Storage Virtualization I believe this is where the next wave of drastic cost reduction of IT post-server virtualization happens. Historically, storage has been a high cost item in any IT budget in each and every aspects including hardware, software, staffing, maintenance, SLA, etc. Since the introduction of Windows Server 2012, there is a clear direction where storage virtualization is built into OS and becoming a commodity. New capabilities like Storage Pool, Hyper-V over SMB, Scale-Out Fire Share, etc., are now part of Windows Server OS and are making storage virtualization part of server administration routines and easily manageable with tools and utilities like PowerShell, which is familiar to many IT professionals. The concept of storage virtualization remains consistent with the idea of logically separating a computing object from its hardware, in this case the storage capacity. Storage virtualization is the ability to integrate multiple and heterogeneous storage devices, aggregate the storage capacities, and present/manage as one logical storage device with a continuous storage space. JBOD is a technology to realize this concept. Standardization, Automation and Optimization Each of the three resource pools has an abstraction to logically present itself with characteristics and work patterns. A compute pool is a collection of physical (virtualization and infrastructure) hosts and VMs. A virtualization host hosts VMs that run workloads deployed by service owners and consumed by authorized users. A network pool encompasses network resources including physical devices, logical switches, address spaces, and site configurations. Network virtualization as enabled/defined in configurations can identify and translate a logical/virtual IP address into a physical one, such that tenants with the same network hardware can implement an identical network scheme without a concern. A storage pool is based on storage virtualization which is a concept of presenting an aggregated storage capacity as one continuous storage space as if provided from one logical storage device. In other words, the three resource pools are wrapped with server virtualization, network virtualization, and storage virtualization, respectively. Each virtualization presents a set of methodologies on which work patterns are derived and common practices are developed. These virtualization layers provides opportunities to standardize, automate, and optimize deployments and considerably facilitates the adoption of cloud computing. Standardization Virtualizing resources decouples the dependency between instances and the underlying hardware. This offers an opportunity to simplify and standardize the logical representation of a resource. For instance, a VM is defined and deployed with a VM template that provides a level of consistency with a standardized configuration. Automation Once VM characteristics are identified and standardized, we can now generate an instance by providing only instance-based information or information that depends on run-time, such as the VM machine name, which must be validated at run-time to prevent duplicated names. This requirement for providing only minimal information at deployment can be significantly simplify and streamline operations for automation. And with automation, resources can then be deployed, instantiated, relocated, taken off-line, brought back online, or removed rapidly and automatically based on set criteria. Standardization and automation are essential mechanisms so that workload can be scaled on demand, i.e., become elastic. Optimization Standardization provides a set of common criteria. Automation executes operations based on set criteria with volumes, consistency, and expediency. With standardization and automation, instances can be instantiated with consistency, efficiency, and predictability. In other words, resources can be operated in bulk with consistency and predictability. The next logical step is then to optimize the usage based on SLA. The presented progression is what resource pooling and virtualizations can provide and facilitate. These methodologies are now built into products and solutions. Windows Server 2012 R2 and System Center 2012 and later integrate server virtualization, network virtualization, and storage virtualization into one consistent solution platform with standardization, automation, and optimization for building and managing clouds. Fabric This is a significant abstraction in cloud computing. Fabric implies accessibility and discoverability, and denotes the ability to discover, identify, and manage a resource. Conceptually, fabric is an umbrella term encompassing all the underlying infrastructure supporting a cloud computing environment. At the same time, a fabric controller represents the system management solution which manages, i.e. owns, fabric. In cloud architecture, fabric consists of the three resource pools: compute, networks, and storage. Compute provides the computing capabilities, executes code, and runs instances. Networks glues the resources based on requirements. Storage is where VMs, configurations, data, and resources are kept. Fabric shields the physical complexities of the three resource pools presented with server virtualization, network virtualization, and storage virtualization. All operations are eventually directed by the fabric controller of a data center. Above fabric, there are logical views of consumable resources including VMs, virtual networks, and logical storage drives. By deploying VMs, configuring virtual networks, or acquiring storage, a user consumes resources. Under fabric, there are virtualization and infrastructure hosts, Active Directory, DNS, clusters, load balancers, address pools, network sites, library shares, storage arrays, topology, racks, cables, etc., all under the fabric controller’s command to collectively present and support fabric. For a service provider, building a cloud computing environment is essentially establishing a fabric controller and constructing fabric. Namely, instituting a comprehensive management solution, building the three resource pools, and integrating server virtualization, network virtualization, and storage virtualization to form fabric. From a user’s point of view, how and where a resource is physically provided is not a concern, but the accessibility, readiness, scalability, and fulfillment of SLA are. Cloud This is a well-defined term and we should not be confused with it. (see NIST SP 800-145 and the 5-3-2 Principle of Cloud Computing) We need to be very clear on: what a cloud must exhibit (the five essential attributes), how to consume it (with SaaS, PaaS, or IaaS), and the model a service is deployed in (like private cloud, public cloud, and hybrid cloud). Cloud is a concept, a state, a set of capabilities such that a business can be delivered as a service, i.e. available on demand. The architecture of a cloud computing environment is presented with three resource pools: compute, networks, and storage. Each is an abstraction provided by a virtualization layer. Server virtualization presents a compute pool with VMs that supply the computing, i.e. CPUs, and power to execute code and run instances. Network virtualization offers a network pool and is the mechanism that allows multiple tenants with identical network configurations on the same virtualization host while connecting, segmenting, isolating network traffic with virtual NICs, logical switches, address space, network sites, IP pools, etc. Storage virtualization provides a logical storage device with the capacity to appear continuous and aggregated with a pool of storage devices behind the scene. The three resource pools together constitute the fabric (of a cloud) while the three virtualization layers collectively form the abstraction, such that while the underlying physical infrastructure may be intricate, the user experience above fabric remains logical and consistent. Deploying a VM, configuring a virtual network, or acquiring storage is transparent with virtualization regardless of where the VM actually resides, how the virtual network is physically wired, or what devices in the aggregate the requested storage is provided with. Closing Thoughts Cloud is a very consumer-focused approach. It is about a customer’s ability and control based on SLA in getting resources when needed and with scale, and equally important releasing resources when no longer required. It is not about products and technologies. It is about servicing, consuming, and strengthening the bottom line.

August 12, 2013

by Yung Chou

· 10,444 Views

In-Process Caching vs. Distributed Caching

In this post I will compare the pros and cons of using an in-process cache versus a distributed cache and when you should use either one. First, let's see what each one is. As the name suggests, an in-process cache is an object cache built within the same address space as your application. The Google Guava Library provides a simple in-process cache API that is a good example. On the other hand, a distributed cache is external to your application and quite possibly deployed on multiple nodes forming a large logical cache. Memcached is a popular distributed cache. Ehcache from Terracotta is a product that can be configured to function either way. Following are some considerations that should be kept in mind when making a decision. Considerations In-Process Cache Distributed Cache Comments Consistency While using an in-process cache, your cache elements are local to a single instance of your application. Many medium-to-large applications, however, will not have a single application instance as they will most likely be load-balanced. In such a setting, you will end up with as many caches as your application instances, each having a different state resulting in inconsistency. State may however be eventually consistent as cached items time-out or are evicted from all cache instances. Distributed caches, although deployed on a cluster of multiple nodes, offer a single logical view (and state) of the cache. In most cases, an object stored in a distributed cache cluster will reside on a single node in a distributed cache cluster. By means of a hashing algorithm, the cache engine can always determine on which node a particular key-value resides. Since there is always a single state of the cache cluster, it is never inconsistent. If you are caching immutable objects, consistency ceases to be an issue. In such a case, an in-process cache is a better choice as many overheads typically associated with external distributed caches are simply not there. If your application is deployed on multiple nodes, you cache mutable objects and you want your reads to always be consistent rather than eventually consistent, a distributed cache is the way to go. Overheads This dated but very descriptive article describes how an in-process cache can negatively effect performance of an application with an embedded cache primarily due to garbage collection overheads. Your results however are heavily dependent on factors such as the size of the cache and how quickly objects are being evicted and timed-out. A distributed cache will have two major overheads that will make it slower than an in-process cache (but better than not caching at all): network latency and object serialization As described earlier, if you are looking for an always-consistent global cache state in a multi-node deployment, a distributed cache is what you are looking for (at the cost of performance that you may get from a local in-process cache). Reliability An in-process cache makes use of the same heap space as your program so one has to be careful when determining the upper limits of memory usage for the cache. If your program runs out of memory there is no easy way to recover from it. A distributed cache runs as an independent processes across multiple nodes and therefore failure of a single node does not result in a complete failure of the cache. As a result of a node failure, items that are no longer cached will make their way into surviving nodes on the next cache miss. Also in the case of distributed caches, the worst consequence of a complete cache failure should be degraded performance of the application as opposed to complete system failure. An in-process cache seems like a better option for a small and predictable number of frequently accessed, preferably immutable objects. For large, unpredictable volumes, you are better off with a distributed cache. Recommendation For a small, predictable number of preferably immutable objects that have to be read multiple times, an in-process cache is a good solution because it will perform better than a distributed cache. However, for cases in which the number of objects that can be or should be cached is unpredictable and large, and consistency of reads is a must-have, a distributed cache is perhaps a better solution even though it may not bring the same performance benefits as an in-process cache. It goes without saying that your application can use both schemes for different types of objects depending on what suits the scenario best.

August 8, 2013

by Faheem Sohail

· 107,454 Views · 8 Likes

NoSQL with JPA

EclipseLink, reference implementation of JPA, has JPA support for NoSQL databases (MongoDB and Oracle NoSQL) as of the version 2.4. In this tutorial we will discuss the use of MongoDB database with the JPA support of EclipseLink. The transaction previously done using the console and native java driver will be done in a web application with the help of EclipseLink. Tools and technologies used in the sample application are as follows: MongoDB version 2.4.1 MongoDB Java Driver version 2.11.1 JSF version 2.2 PrimeFaces version 3.5 EclipseLink version 2.4 Jetty 7.x Maven Plugin JDK version 1.7 Maven 3.0.4 Project Dependencies org.glassfish javax.faces 2.2.0-SNAPSHOT org.primefaces primefaces 3.5 org.primefaces.themes bootstrap 1.0.10 org.eclipse.persistence org.eclipse.persistence.jpa 2.4.0-SNAPSHOT org.eclipse.persistence org.eclipse.persistence.nosql 2.4.0-SNAPSHOT jboss jboss-j2ee 4.2.2.GA org.mongodb mongo-java-driver 2.11.1 commons-fileupload commons-fileupload 1.3 Entity Class @Entity @NoSql(dataFormat=DataFormatType.MAPPED) public class Article implements Serializable { public Article() { } @Id @GeneratedValue @Field(name="_id") private String id; @ElementCollection private List categoryLists = new ArrayList(); @Basic private String title; @Basic private String content; @Basic @Temporal(javax.persistence.TemporalType.DATE) private Date date; @Basic private String author; @ElementCollection private List tagLists = new ArrayList(); @NoSQL notation sets the data format and type and maps the NoSQL data. Because of using MongoDB in our sample application and documents in MongoDB stored in BSON format, MAP is used as data type. @ElementCollection notation maps the embedded collection into the parent document. Because more than one category and tag associated with an article would be a matter in our sample application, we map them as an element collection. Embedded Objects @Embeddable @NoSql(dataFormat=DataFormatType.MAPPED) public class Categories implements Serializable { @Basic private String category; @Embeddable @NoSql(dataFormat=DataFormatType.MAPPED) public class Tags implements Serializable { @Basic private String tag; We see @Embeddable notation at the top of the Categories and Tags’ class unlike Article entity class. The documents stored in the parent document are mapped with this notation. Please note that embedded objects do not need unique field. persistence.xml com.kodcu.entity.Article com.com.kodcu.entity.Categories com.kodcu.entity.Tags CRUD Operations index.xhtml MyBean.java public void saveArticle() { em.getTransaction().begin(); if(null == article.getId()) em.persist(article); else em.merge(article); em.getTransaction().commit(); } public void removeArticle() { em.getTransaction().begin(); em.remove(selectArticle); em.getTransaction().commit(); } 6. Demo Application Real content above and the demo application, can be accessed at NoSQL with JPA

August 6, 2013

by Hüseyin Akdoğan

CORE

· 31,659 Views · 1 Like

What Is NoSQL?

Dan McCreary and Ann Kelly, authors of 'Making Sense of NoSQL,' discuss the business drivers and motivations that make NoSQL so popular to organizations today.

August 1, 2013

by Eric Gregory

· 21,291 Views · 4 Likes

OLAP Operation in R

OLAP (Online Analytical Processing) is a very common way to analyze raw transaction data by aggregating along different combinations of dimensions. This is a well-established field in Business Intelligence / Reporting. In this post, I will highlight the key ideas in OLAP operation and illustrate how to do this in R. Facts and Dimensions The core part of OLAP is a so-called "multi-dimensional data model", which contains two types of tables; "Fact" table and "Dimension" table A Fact table contains records each describe an instance of a transaction. Each transaction records contains categorical attributes (which describes contextual aspects of the transaction, such as space, time, user) as well as numeric attributes (called "measures" which describes quantitative aspects of the transaction, such as no of items sold, dollar amount). A Dimension table contain records that further elaborates the contextual attributes, such as user profile data, location details ... etc. In a typical setting of Multi-dimensional model ... Each fact table contains foreign keys that references the primary key of multiple dimension tables. In the most simple form, it is called a STAR schema. Dimension tables can contain foreign keys that references other dimensional tables. This provides a sophisticated detail breakdown of the contextual aspects. This is also called a SNOWFLAKE schema. Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn't contain reference pointer among each other. However, different Fact table usually share the same set of dimension tables. This is also called GALAXY schema. But it is a hard rule that Dimension table NEVER points / references Fact table A simple STAR schema is shown in following diagram. Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity. For example, the time dimension can be broken down into days, weeks, months, quarter and annual; Similarly, location dimension can be broken down into countries, states, cities ... etc. Here we first create a sales fact table that records each sales transaction. # Setup the dimension tables state_table <- data.frame(key=c("CA", "NY", "WA", "ON", "QU"), name=c("California", "new York", "Washington", "Ontario", "Quebec"), country=c("USA", "USA", "USA", "Canada", "Canada")) month_table <- data.frame(key=1:12, desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4")) prod_table <- data.frame(key=c("Printer", "Tablet", "Laptop"), price=c(225, 570, 1120)) # Function to generate the Sales table gen_sales <- function(no_of_recs) { # Generate transaction data randomly loc <- sample(state_table$key, no_of_recs, replace=T, prob=c(2,2,1,1,1)) time_month <- sample(month_table$key, no_of_recs, replace=T) time_year <- sample(c(2012, 2013), no_of_recs, replace=T) prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2)) unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3)) amount <- unit*prod_table[prod,]$price sales <- data.frame(month=time_month, year=time_year, loc=loc, prod=prod, unit=unit, amount=amount) # Sort the records by time order sales <- sales[order(sales$year, sales$month),] row.names(sales) <- NULL return(sales) } # Now create the sales fact table sales_fact <- gen_sales(500) # Look at a few records head(sales_fact) month year loc prod unit amount 1 1 2012 NY Laptop 1 225 2 1 2012 CA Laptop 2 450 3 1 2012 ON Tablet 2 2240 4 1 2012 NY Tablet 1 1120 5 1 2012 NY Tablet 2 2240 6 1 2012 CA Laptop 1 225 Multi-dimensional Cube Now, we turn this fact table into a hypercube with multiple dimensions. Each cell in the cube represents an aggregate value for a unique combination of each dimension. # Build up a cube revenue_cube <- tapply(sales_fact$amount, sales_fact[,c("prod", "month", "year", "loc")], FUN=function(x){return(sum(x))}) # Showing the cells of the cude revenue_cube , , year = 2012, loc = CA month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 1350 225 900 675 675 NA 675 1350 NA 1575 900 1350 Printer NA 2280 NA NA 1140 570 570 570 NA 570 1710 NA Tablet 2240 4480 12320 3360 2240 4480 3360 3360 5600 2240 2240 3360 , , year = 2013, loc = CA month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 225 225 450 675 225 900 900 450 675 225 675 1125 Printer NA 1140 NA 1140 570 NA NA 570 NA 1140 1710 1710 Tablet 3360 3360 1120 4480 2240 1120 7840 3360 3360 1120 5600 4480 , , year = 2012, loc = NY month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 450 450 NA NA 675 450 675 NA 225 225 NA 450 Printer NA 2280 NA 2850 570 NA NA 1710 1140 NA 570 NA Tablet 3360 13440 2240 2240 2240 5600 5600 3360 4480 3360 4480 3360 , , year = 2013, loc = NY ..... dimnames(revenue_cube) $prod [1] "Laptop" "Printer" "Tablet" $month [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" $year [1] "2012" "2013" $loc [1] "CA" "NY" "ON" "QU" "WA" OLAP Operations Here are some common operations of OLAP Slice Dice Rollup Drilldown Pivot "Slice" is about fixing certain dimensions to analyze the remaining dimensions. For example, we can focus in the sales happening in "2012", "Jan", or we can focus in the sales happening in "2012", "Jan", "Tablet". # Slice # cube data in Jan, 2012 revenue_cube[, "1", "2012",] loc prod CA NY ON QU WA Laptop 1350 450 NA 225 225 Printer NA NA NA 1140 NA Tablet 2240 3360 5600 1120 2240 # cube data in Jan, 2012 revenue_cube["Tablet", "1", "2012",] CA NY ON QU WA 2240 3360 5600 1120 2240 "Dice" is about limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube. For example, we can focus in sales happening in [Jan/ Feb/Mar, Laptop/Tablet, CA/NY]. revenue_cube[c("Tablet","Laptop"), c("1","2","3"), , c("CA","NY")] , , year = 2012, loc = CA month prod 1 2 3 Tablet 2240 4480 12320 Laptop 1350 225 900 , , year = 2013, loc = CA month prod 1 2 3 Tablet 3360 3360 1120 Laptop 225 225 450 , , year = 2012, loc = NY month prod 1 2 3 Tablet 3360 13440 2240 Laptop 450 450 NA , , year = 2013, loc = NY month prod 1 2 3 Tablet 3360 4480 6720 Laptop 450 NA 225 "Rollup" is about applying an aggregation function to collapse a number of dimensions. For example, we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't care where we sold our product). apply(revenue_cube, c("year", "prod"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) prod year Laptop Printer Tablet 2012 22275 31350 179200 2013 25200 33060 166880 "Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of granularity. For example, we want to focus in the annual and monthly revenue for each product and collapse the location dimension (ie: we don't care where we sold our product). apply(revenue_cube, c("year", "month", "prod"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) , , prod = Laptop month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 2250 2475 1575 1575 2250 1800 1575 1800 900 2250 1350 2475 2013 2250 900 1575 1575 2250 2475 2025 1800 2025 2250 3825 2250 , , prod = Printer month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 1140 5700 570 3990 4560 2850 1140 2850 2850 1710 3420 570 2013 1140 4560 3420 4560 2850 1140 570 3420 1140 3420 3990 2850 , , prod = Tablet month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 14560 23520 17920 12320 10080 14560 13440 15680 25760 12320 11200 7840 2013 8960 11200 10080 7840 14560 10080 29120 15680 15680 8960 12320 22400 "Pivot" is about analyzing the combination of a pair of selected dimensions. For example, we want to analyze the revenue by year and month. Or we want to analyze the revenue by product and location. apply(revenue_cube, c("year", "month"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 17950 31695 20065 17885 16890 19210 16155 20330 29510 16280 15970 10885 2013 12350 16660 15075 13975 19660 13695 31715 20900 18845 14630 20135 27500 apply(revenue_cube, c("prod", "loc"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) loc prod CA NY ON QU WA Laptop 16425 9450 7650 7425 6525 Printer 15390 19950 7980 10830 10260 Tablet 90720 117600 45920 34720 57120 I hope you can get a taste of the richness of data processing model in R. However, since R is doing all the processing in RAM. This requires your data to be small enough so it can fit into the local memory in a single machine.

July 30, 2013

by Ricky Ho

· 18,002 Views · 3 Likes

JMS vs RabbitMQ

Definition : JMS : Java Message Service is an API that is part of Java EE for sending messages between two or more clients. There are many JMS providers such as OpenMQ (glassfish’s default), HornetQ(Jboss), and ActiveMQ. RabbitMQ: is an open source message broker software which uses the AMQP standard and is written by Erlang. Messaging Model: JMS supports two models: one to one and publish/subscriber. RabbitMQ supports the AMQP model which has 4 models : direct, fanout, topic, headers. Data types: JMS supports 5 different data types but RabbitMQ supports only the binary data type. Workflow strategy: In AMQP, producers send to the exchange then the queue, but in JMS, producers send to the queue or topic directly. Technology compatibility: JMS is specific for java users only, but RabbitMQ supports many technologies. Performance: If you would like to know more about their performance, this benchmark is a good place to start, but look for others as well.

July 30, 2013

by Saeid Siavashi

· 51,771 Views · 16 Likes

Why String is Immutable in Java

this is an old yet still popular question. there are multiple reasons that string is designed to be immutable in java. a good answer depends on good understanding of memory, synchronization, data structures, etc. in the following, i will summarize some answers. 1. requirement of string pool string pool (string intern pool) is a special storage area in java heap. when a string is created and if the string already exists in the pool, the reference of the existing string will be returned, instead of creating a new object and returning its reference. the following code will create only one string object in the heap. string string1 = "abcd"; string string2 = "abcd"; here is how it looks: if string is not immutable, changing the string with one reference will lead to the wrong value for the other references. 2. allow string to cache its hashcode the hashcode of string is frequently used in java. for example, in a hashmap. being immutable guarantees that hashcode will always the same, so that it can be cashed without worrying the changes.that means, there is no need to calculate hashcode every time it is used. this is more efficient. 3. security string is widely used as parameter for many java classes, e.g. network connection, opening files, etc. were string not immutable, a connection or file would be changed and lead to serious security threat. the method thought it was connecting to one machine, but was not. mutable strings could cause security problem in reflection too, as the parameters are strings. here is a code example: boolean connect(string s){ if (!issecure(s)) { throw new securityexception(); } //here will cause problem, if s is changed before this by using other references. causeproblem(s); } in summary, the reasons include design, efficiency, and security. actually, this is also true for many other “why” questions in a java interview.

July 29, 2013

by Ryan Wang

· 217,418 Views · 9 Likes

Displaying and Searching std::map Contents in WinDbg

This time we’re up for a bigger challenge. We want to automatically display and possibly search and filter std::map objects in WinDbg. The script for std::vectors was relatively easy because of the flat structure of the data in a vector; maps are more complex beasts. Specifically, an map in the Visual C++ STL is implemented as a red-black tree. Each tree node has three important pointers: _Left, _Right, and _Parent. Additionally, each node has a _Myval field that contains the std::pair with the key and value represented by the node. Iterating a tree structure requires recursion, and WinDbg scripts don’t have any syntax to define functions. However, we can invoke a script recursively – a script is allowed to contain the $$>a< command that invokes it again with a different set of arguments. The path to the script is also readily available in ${$arg0}. Before I show you the script, there’s just one little challenge I had to deal with. When you call a script recursively, the values of the pseudo-registers (like $t0) will be clobbered by the recursive invocation. I was on the verge of allocating memory dynamically or calling into a shell process to store and load variables, when I stumbled upon the .push and .pop commands, which store the register context and load it, respectively. These are a must for recursive WinDbg scripts. OK, so suppose you want to display values from an std::map where the key is less than or equal to 2. Here we go: 0:000> $$>a< traverse_map.script my_map -c ".block { .if (@@(@$t9.first) <= 2) { .echo ----; ?? @$t9.second } }" size = 10 ---- struct point +0x000 x : 0n1 +0x004 y : 0n2 +0x008 data : extra_data ---- struct point +0x000 x : 0n0 +0x004 y : 0n1 +0x008 data : extra_data ---- struct point +0x000 x : 0n2 +0x004 y : 0n3 +0x008 data : extra_data For each pair (stored in the $t9 pseudo-register), the block checks if the first component is less than or equal to 2, and if it is, outputs the second component. Next, here’s the script. Note it’s considerably more complex that what we had to with vectors, because it essentially invokes itself with a different set of parameters and then repeats recursively. .if ($sicmp("${$arg1}", "-n") == 0) { .if (@@(@$t0->_Isnil) == 0) { .if (@$t2 == 1) { .printf /D "%p\n", @$t0, @$t0 .printf "key = " ?? @$t0->_Myval.first .printf "value = " ?? @$t0->_Myval.second } .else { r? $t9 = @$t0->_Myval command } } $$ Recurse into _Left, _Right unless they point to the root of the tree .if (@@(@$t0->_Left) != @@(@$t1)) { .push /r /q r? $t0 = @$t0->_Left $$>a< ${$arg0} -n .pop /r /q } .if (@@(@$t0->_Right) != @@(@$t1)) { .push /r /q r? $t0 = @$t0->_Right $$>a< ${$arg0} -n .pop /r /q } } .else { r? $t0 = ${$arg1} .if (${/d:$arg2}) { .if ($sicmp("${$arg2}", "-c") == 0) { r $t2 = 0 aS ${/v:command} "${$arg3}" } } .else { r $t2 = 1 aS ${/v:command} " " } .printf "size = %d\n", @@(@$t0._Mysize) r? $t0 = @$t0._Myhead->_Parent r? $t1 = @$t0->_Parent $$>a< ${$arg0} -n ad command } Of particular note are the aS command which configures an alias that is then used by the recursive invocation to invoke a command block for each of the map’s elements; the $sicmp function which compares strings; and the .printf /D function, which outputs a chunk of DML. Finally, the recursion terminates when _Left or _Right are equal to the root of the tree (that’s just how the tree is implemented in this case).

July 26, 2013

by Sasha Goldshtein

· 4,938 Views

Converting Java Objects to Byte Array, JSON and XML

Quick reference for converting Java objects to various formats (byte array, JSON, XML) and back, using different libraries for serialization and deserialization.

July 22, 2013

by Faheem Sohail

· 107,081 Views