Data Resources

The Latest Data Topics

My last blog demonstrated the application of Spring 3.1’s @Cacheable annotation that’s used to mark methods whose return values will be stored in a cache. However, @Cacheable is only one of a pair of annotations that the Guys at Spring have devised for caching, the other being @CacheEvict. Like @Cacheable, @CacheEvict has value, key and condition attributes. These work in exactly the same way as those supported by @Cacheable, so for more information on them see my previous blog: Spring 3.1 Caching and @Cacheable. @CacheEvict supports two additional attributes: allEntries and beforeInvocation. If I were a gambling man I'd put money on the most popular of these being allEntries. allEntries is used to completely clear the contents of a cache defined by @CacheEvict's mandatory value argument. The method below demonstrates how to apply allEntries: @CacheEvict(value = "employee", allEntries = true) public void resetAllEntries() { // Intentionally blank } resetAllEntries() sets @CacheEvict’s allEntries attribute to “true” and, assuming that the findEmployee(...) method looks like this: @Cacheable(value = "employee") public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } ...then in the following code, resetAllEntries(), will clear the “employees” cache. This means that in the JUnit test below employee1 will not reference the same object as employee2: @Test public void testCacheResetOfAllEntries() { Person employee1 = instance.findEmployee("John", "Smith", 22); instance.resetAllEntries(); Person employee2 = instance.findEmployee("John", "Smith", 22); assertNotSame(employee1, employee2); } The second attribute is beforeInvocation. This determines whether or not a data item(s) is cleared from the cache before or after your method is invoked. The code below is pretty nonsensical; however, it does demonstrate that you can apply both @CacheEvict and @Cacheable simultaneously to a method. @CacheEvict(value = "employee", beforeInvocation = true) @Cacheable(value = "employee") public Person evictAndFindEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the code above, @CacheEvict deletes any entries in the cache with a matching key before @Cacheable searches the cache. As @Cacheable won’t find any entries it’ll call my code storing the result in the cache. The subsequent call to my method will invoke @CacheEvict which will delete any appropriate entries with the result that in the JUnit test below the variable employee1 will never reference the same object as employee2: @Test public void testBeforeInvocation() { Person employee1 = instance.evictAndFindEmployee("John", "Smith", 22); Person employee2 = instance.evictAndFindEmployee("John", "Smith", 22); assertNotSame(employee1, employee2); } As I said above, evictAndFindEmployee(...) seems somewhat nonsensical as I’m applying both @Cacheable and @CacheEvict to the same method. But, it’s more that that, it makes the code unclear and breaks the Single Responsibility Principle; hence, I’d recommend creating separate cacheable and cache-evict methods. For example, if you have a cacheing method such as: @Cacheable(value = "employee", key = "#surname") public Person findEmployeeBySurname(String firstName, String surname, int age) { return new Person(firstName, surname, age); } then, assuming you need finer cache control than a simple ‘clear-all’, you can easily define its counterpart: @CacheEvict(value = "employee", key = "#surname") public void resetOnSurname(String surname) { // Intentionally blank } This is a simple blank marker method that uses the same SpEL expression that’s been applied to @Cacheable to evict all Person instances from the cache where the key matches the ‘surname’ argument. @Test public void testCacheResetOnSurname() { Person employee1 = instance.findEmployeeBySurname("John", "Smith", 22); instance.resetOnSurname("Smith"); Person employee2 = instance.findEmployeeBySurname("John", "Smith", 22); assertNotSame(employee1, employee2); } In the above code the first call to findEmployeeBySurname(...) creates a Person object, which Spring stores in the “employee” cache with a key defined as: “Smith”. The call to resetOnSurname(...) clears all entries from the “employee” cache with a surname of “Smith” and finally the second call to findEmployeeBySurname(...) creates a new Person object, which Spring again stores in the “employee” cache with the key of “Smith”. Hence, the variables employee1, and employee2 do not reference the same object. Having covered Spring’s caching annotations, the next piece of the puzzle is to look into setting up a practical cache: just how do you enable Spring caching and which caching implementation should you use? More on that later...

September 21, 2012

by Roger Hughes

· 123,712 Views · 7 Likes

Spring 3.1 Caching and Config

I’ve recently being blogging about Spring 3.1 and its new caching annotations @Cacheable and @CacheEvict. As with all Spring features you need to do a certain amount of setup and, as usual, this is done with Spring’s XML configuration file. In the case of caching, turning on @Cacheable and @CacheEvict couldn’t be simpler as all you need to do is to add the following to your Spring config file: ...together with the appropriate schema definition in your beans XML element declaration: ...with the salient lines being: xmlns:cache="http://www.springframework.org/schema/cache" ...and: http://www.springframework.org/schema/cache http://www.springframework.org/schema/cache/spring-cache.xsd However, that’s not the end of the story, as you also need to specify a caching manager and a caching implementation. The good news is that if you’re familiar with the set up of other Spring components, such as the database transaction manager, then there’s no surprises in how this is done. A cache manager class seems to be any class that implements Spring’s org.springframework.cache.CacheManager interface. It’s responsible for managing one or more cache implementations where the cache implementation instance(s) are responsible for actually caching your data. The XML sample below is taken from the example code used in my last two blogs. In the above configurtion, I’m using Spring’s SimpleCacheManager to manage an instance of their ConcurrentMapCacheFactoryBean with a cache implementation named: “employee”. One important point to note is that your cache manager MUST have a bean id of cacheManager. If you get this wrong then you’ll get the following exception: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.springframework.cache.interceptor.CacheInterceptor#0': Cannot resolve reference to bean 'cacheManager' while setting bean property 'cacheManager'; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No bean named 'cacheManager' is defined at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:328) at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:106) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1360) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1118) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:517) : : trace details removed for clarity : at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: org.springframework.beans.factory.NoSuchBeanDefinitionException: No bean named 'cacheManager' is defined at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBeanDefinition(DefaultListableBeanFactory.java:553) at org.springframework.beans.factory.support.AbstractBeanFactory.getMergedLocalBeanDefinition(AbstractBeanFactory.java:1095) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:277) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:193) at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:322) As I said above, in my simple configuration, the whole affair is orchestrated by the SimpleCacheManager. This, according to the documentation, is normally “Useful for testing or simple caching declarations”. Although you could write your own CacheManager implementation, the Guys at Spring have provided other cache managers for different situations SimpleCacheManager - see above. NoOpCacheManager - used for testing, in that it doesn’t actually cache anything, although be careful here as testing your code without caching may trip you up when you turn caching on. CompositeCacheManager - allows the use multiple cache managers in a single application. EhCacheCacheManager - a cache manager that wraps an ehCache instance. See http://ehcache.org   Selecting which cache manager to use in any given environment seems like a really good use for Spring Profiles. See:    Using Spring Profiles in XML Config Using Spring Profiles and Java Configuration And, that just about wraps things up, although just for completeness, below is the complete configuration file used in my previous two blogs: As a Lieutenant Columbo is fond of saying “And just one more thing, you know what bothers me about this case...”; well there are several things that bother me about cache managers, for example: What do the Guys at Spring mean by “Useful for testing or simple caching declarations” when talking about the SimpleCacheManager? Just exactly when should you use it in anger rather than for testing? Would it ever be advisable to write your own CacheManager implementation or even a Cache implementation? What exactly are the advantages of using the EhCacheCacheManager? How often would you really need CompositeCacheManager? All of which I may be looking into in the future...

September 19, 2012

by Roger Hughes

· 27,548 Views · 2 Likes

8 Common Code Violations in Java

At work, recently I did a code cleanup of an existing Java project. After that exercise, I could see a common set of code violations that occur again and again in the code. So, I came up with a list of such common violations and shared it with my peers so that an awareness would help to improve the code quality and maintainability. I’m sharing the list here to a bigger audience. The list is not in any particular order and all derived from the rules enforced by code quality tools such as CheckStyle, FindBugs and PMD. Here we go! Format source code and Organize imports in Eclipse: Eclipse provides the option to auto-format the source code and organize the imports (thereby removing unused ones). You can use the following shortcut keys to invoke these functions. Ctrl + Shift + F – Formats the source code. Ctrl + Shift + O – Organizes the imports and removes the unused ones. Instead of you manually invoking these two functions, you can tell Eclipse to auto-format and auto-organize whenever you save a file. To do this, in Eclipse, go to Window -> Preferences -> Java -> Editor -> Save Actions and then enable Perform the selected actions on save and check Format source code + Organize imports. Avoid multiple returns (exit points) in methods: In your methods, make sure that you have only one exit point. Do not use returns in more than one places in a method body. For example, the below code is NOT RECOMMENDED because it has more then one exit points (return statements). private boolean isEligible(int age){ if(age > 18){ return true; }else{ return false; } } The above code can be rewritten like this (of course, the below code can be still improved, but that’ll be later). private boolean isEligible(int age){ boolean result; if(age > 18){ result = true; }else{ result = false; } return result; } Simplify if-else methods: We write several utility methods that takes a parameter, checks for some conditions and returns a value based on the condition. For example, consider the isEligible method that you just saw in the previous point. private boolean isEligible(int age){ boolean result; if(age > 18){ result = true; }else{ result = false; } return result; } The entire method can be re-written as a single return statement as below. private boolean isEligible(int age){ return age > 18; } Do not create new instances of Boolean, Integer or String: Avoid creating new instances of Boolean, Integer, String etc. For example, instead of using new Boolean(true), use Boolean.valueOf(true). The later statement has the same effect of the former one but it has improved performance. Use curly braces around block statements. Never forget to use curly braces around block level statements such as if, for, while. This reduces the ambiguity of your code and avoids the chances of introducing a new bug when you modify the block level statement. NOT RECOMMENDED if(age > 18) return true; else return false; RECOMMENDED if(age > 18){ return true; }else{ return false; } Mark method parameters as final, wherever applicable: Always mark the method parameters as final wherever applicable. If you do so, when you accidentally modify the value of the parameter, you’ll get a compiler warning. Also, it makes the compiler to optimize the byte code in a better way. RECOMMENDED private boolean isEligible(final int age){ ... } Name public static final fields in UPPERCASE: Always name the public static final fields (also known as Constants) in UPPERCASE. This lets you to easily differentiate constant fields from the local variables. NOT RECOMMENDED public static final String testAccountNo = "12345678"; RECOMMENDED public static final String TEST_ACCOUNT_NO = "12345678";, Combine multiple if statements into one: Wherever possible, try to combine multiple if statements into single one. For example, the below code; if(age > 18){ if( voted == false){ // eligible to vote. } } can be combined into single if statements, as: if(age > 18 && !voted){ // eligible to vote } switch should have default: Always add a default case for the switch statements. Avoid duplicate string literals, instead create a constant: If you have to use a string in several places, avoid using it as a literal. Instead create a String constant and use it. For example, from the below code, private void someMethod(){ logger.log("My Application" + e); .... .... logger.log("My Application" + f); } The string literal “My Application” can be made as an Constant and used in the code. public static final String MY_APP = "My Application"; private void someMethod(){ logger.log(MY_APP + e); .... .... logger.log(MY_APP + f); } Additional Resources: A collection of Java best practices. List of available Checkstyle checks. List of PMD Rule sets

September 14, 2012

by Veera Sundar

· 45,999 Views · 1 Like

Spring 3.1 Caching and @Cacheable

Caches have been around in the software world for long time. They’re one of those really useful things that once you start using them, you wonder how on earth you got along without them so, it seems a little strange that the Guys at Spring only got around to adding a caching implementation to Spring core in version 3.1. I’m guessing that previously it wasn’t seen as a priority and besides, before the introduction of Java annotations one of the difficulties of caching was the coupling of caching code with your business code, which could often become pretty messy. However, the Guys at Spring have now devised a simple to use caching system based around a couple of annotations: @Cacheable and @CacheEvict. The idea of the @Cacheable annotation is that you use it to mark the method return values that will be stored in the cache. The @Cacheable annotation can be applied either at method or type level. When applied at method level, then the annotated method’s return value is cached. When applied at type level, then the return value of every method is cached. @Cacheable(value = "employee") public class EmployeeDAO { public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } public Person findAnotherEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } } The Cacheable annotation takes three arguments: value, which is mandatory, together with key and condition. The first of these, value, is used to specify the name of the cache (or caches) in which the a method’s return value is stored. @Cacheable(value = "employee") public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } The code above ensures that the new Person object is stored in the “employee” cache. Any data stored in a cache requires a key for its speedy retrieval. Spring, by default, creates caching keys using the annotated method’s signature as demonstrated by the code above. You can override this using @Cacheable’s second parameter: key. To define a custom key you use a SpEL expression. @Cacheable(value = "employee", key = "#surname") public Person findEmployeeBySurname(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the findEmployeeBySurname(...) code, the ‘#surname’ string is a SpEL expression that means ‘go and create a key using the surname argument of the findEmployeeBySurname(...) method’. The final @Cacheable argument is the optional condition argument. Again, this references a SpEL expression, but this time it’s specifies a condition that’s used to determine whether or not your method’s return value is added to the cache. @Cacheable(value = "employee", condition = "#age < 25") public Person findEmployeeByAge(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the code above, I’ve applied the ludicrous business rule of only caching Person objects if the employee is less than 25 years old. Having quickly demonstrated how to apply some caching, the next thing to do is to take a look at what it all means. @Test public void testCache() { Person employee1 = instance.findEmployee("John", "Smith", 22); Person employee2 = instance.findEmployee("John", "Smith", 22); assertEquals(employee1, employee2); } The above test demonstrates caching at its simplest. The first call to findEmployee(...), the result isn’t yet cached so my code will be called and Spring will store its return value in the cache. In the second call to findEmployee(...) my code isn’t called and Spring returns the cached value; hence the local variable employee1 refers to the same object reference a @Test public void testCacheWithAgeAsCondition() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 22); Person employee2 = instance.findEmployeeByAge("John", "Smith", 22); assertEquals(employee1, employee2); } s employee2, which means that the following is true: assertEquals(employee1, employee2); But, things aren’t always so clear cut. Remember that in findEmployeeBySurname I’ve modified the caching key so that the surname argument is used to create the key and the thing to watch out for when creating your own keying algorithm is to ensure that any key refers to a unique object. @Test public void testCacheOnSurnameAsKey() { Person employee1 = instance.findEmployeeBySurname("John", "Smith", 22); Person employee2 = instance.findEmployeeBySurname("Jack", "Smith", 55); assertEquals(employee1, employee2); } The code above finds two Person instances which are clearly refer to different employees; however, because I’m caching on surname only, Spring will return a reference to the object that’s created during my first call to findEmployeeBySurname(...). This isn’t a problem with Spring, but with my poor cache key definition. Similar care has to be taken when referring to objects created by methods that have a condition applied to the @Cachable annotation. In my sample code I’ve applied the arbitrary condition of only caching Person instances where the employee is under 25 years old. @Test public void testCacheWithAgeAsCondition() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 22); Person employee2 = instance.findEmployeeByAge("John", "Smith", 22); assertEquals(employee1, employee2); } In the above code, the references to employee1 and employee2 are equal because in the second call to findEmployeeByAge(...) Spring returns its cached instance. @Test public void testCacheWithAgeAsCondition2() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 30); Person employee2 = instance.findEmployeeByAge("John", "Smith", 30); assertFalse(employee1 == employee2); } Similarly, in the unit test code above, the references to employee1 and employee2 refer to different objects as, in this case, John Smith is over 25. That just about covers @Cacheable, but what about @CacheEvict and clearing items form the cache? Also, there’s the question adding caching to your Spring config and choosing a suitable caching implementation. However, more on that later....

September 14, 2012

by Roger Hughes

· 197,061 Views · 8 Likes

Caching and @Cacheable

Caches have been around in the software world for long time. They’re one of those really useful things that once you start using them you wonder how on earth you got along without them so, it seems a little strange that the guys at Spring only got around to adding a caching implementation to Spring core in version 3.1. I’m guessing that previously it wasn’t seen as a priority and besides, before the introduction of Java annotations, one of the difficulties of caching was the coupling of caching code with your business code, which could often become pretty messy. However, the guys at Spring have now devised a simple to use caching system based around a couple of annotations: @Cacheable and @CacheEvict. The idea of the @Cacheable annotation is that you use it to mark the method return values that will be stored in the cache. The @Cacheable annotation can be applied either at method or type level. When applied at method level, then the annotated method’s return value is cached. When applied at type level, then the return value of every method is cached. The code below demonstrates how to apply @Cacheable at type level: @Cacheable(value = "employee") public class EmployeeDAO { public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } public Person findAnotherEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } } The Cacheable annotation takes three arguments: value, which is mandatory, together with key and condition. The first of these, value, is used to specify the name of the cache (or caches) in which the a method’s return value is stored. @Cacheable(value = "employee") public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } The code above ensures that the new Person object is stored in the “employee” cache. Any data stored in a cache requires a key for its speedy retrieval. Spring, by default, creates caching keys using the annotated method’s signature as demonstrated by the code above. You can override this using @Cacheable’s second parameter: key. To define a custom key you use a SpEL expression. @Cacheable(value = "employee", key = "#surname") public Person findEmployeeBySurname(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the findEmployeeBySurname(...) code, the ‘#surname’ string is a SpEL expression that means ‘go and create a key using the surname argument of the findEmployeeBySurname(...) method’. The final @Cacheable argument is the optional condition argument. Again, this references a SpEL expression, but this time it’s specifies a condition that’s used to determine whether or not your method’s return value is added to the cache. @Cacheable(value = "employee", condition = "#age < 25") public Person findEmployeeByAge(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the code above, I’ve applied the ludicrous business rule of only caching Person objects if the employee is less than 25 years old. Having quickly demonstrated how to apply some caching, the next thing to do is to take a look at what it all means. @Test public void testCache() { Person employee1 = instance.findEmployee("John", "Smith", 22); Person employee2 = instance.findEmployee("John", "Smith", 22); assertEquals(employee1, employee2); } The above test demonstrates caching at its simplest. The first call to findEmployee(...), the result isn’t yet cached so my code will be called and Spring will store its return value in the cache. In the second call to findEmployee(...) my code isn’t called and Spring returns the cached value; hence the local variable employee1 refers to the same object reference as employee2, which means that the following is true: assertEquals(employee1, employee2); But, things aren’t always so clear cut. Remember that in findEmployeeBySurname I’ve modified the caching key so that the surname argument is used to create the key and the thing to watch out for when creating your own keying algorithm is to ensure that any key refers to a unique object. @Test public void testCacheOnSurnameAsKey() { Person employee1 = instance.findEmployeeBySurname("John", "Smith", 22); Person employee2 = instance.findEmployeeBySurname("Jack", "Smith", 55); assertEquals(employee1, employee2); } The code above finds two Person instances which are clearly refer to different employees; however, because I’m caching on surname only, Spring will return a reference to the object that’s created during my first call to findEmployeeBySurname(...). This isn’t a problem with Spring, but with my poor cache key definition. Similar care has to be taken when referring to objects created by methods that have a condition applied to the @Cachable annotation. In my sample code I’ve applied the arbitrary condition of only caching Person instances where the employee is under 25 years old. @Test public void testCacheWithAgeAsCondition() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 22); Person employee2 = instance.findEmployeeByAge("John", "Smith", 22); assertEquals(employee1, employee2); } In the above code, the references to employee1 and employee2 are equal because in the second call to findEmployeeByAge(...) Spring returns its cached instance. @Test public void testCacheWithAgeAsCondition2() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 30); Person employee2 = instance.findEmployeeByAge("John", "Smith", 30); assertFalse(employee1 == employee2); } Similarly, in the unit test code above, the references to employee1 and employee2 refer to different objects as, in this case, John Smith is over 25. That just about covers @Cacheable, but what about @CacheEvict and clearing items form the cache? Also, there’s the question adding caching to your Spring config and choosing a suitable caching implementation. However, more on that later...

September 12, 2012

by Roger Hughes

· 15,785 Views

Fixing Bugs - If You Can't Reproduce a Bug, You Can't Fix It

Fixing a problem usually starts with reproducing it – what Steve McConnell calls “stabilizing the error.” Technically speaking, you can’t be sure you are fixing the problem unless you can run through the same steps, see the problem happen yourself, fix it, and then run through the same steps and make sure that the problem went away. If you can’t reproduce it, then you are only guessing at what’s wrong, and that means you are only guessing that your fix is going to work. But let’s face it – it’s not always practical or even possible to reproduce a problem. Lots of bug reports don’t include enough information for you to understand what the hell the problem actually was, never mind what was going on when the problem occurred – especially bug reports from the field. Rahul Premraj and Thomas Zimmermann found in The Art of Collecting Bug Reports (from the book Making Software), that the two most important factors in determining whether a bug report will get fixed or not are: Is the description well-written, can the programmer understand what was wrong or why the customer thought something was wrong? Does it include steps to reproduce the problem, even basic information about what they were doing when the problem happened? It’s not a lot to ask – from a good tester at least. But you can’t reasonably expect this from customers. There are other cases where you have enough information, but don’t have the tools or expertise to reproduce a problem – for example, when a pen tester has found a security bug using specialist tools that you don’t have or don’t understand how to use. Sometimes you can fix a problem without being able to see it happen in front of you, come up with a theory on your own, trusting your gut – especially if this is code that you recently worked on. But reproducing the problem first gives you the confidence that you aren’t wasting your time and that you actually fixed the right issue. Trying to reproduce the problem should almost always be your first step. What’s involved in reproducing a bug? What you want to do is to find, as quickly as possible, a simple test that consistently shows the problem, so that you can then run a set of experiments, trace through the code, isolate what’s wrong, and prove that it went away after you fixed the code. The best explanation that I’ve found of how to reproduce a bug is in Debug It! where Paul Butcher patiently explains the pre-conditions (identifying the differences between your test environment and the customer’s environment, and trying to control as many of them as possible), and then how to walk backwards from the error to recreate the conditions required to make the problem happen again. Butcher is confident that if you take a methodical approach, you will (almost) always be able to reproduce the problem successfully. In Why Programs Fail: A guide to Systematic Debugging, Andreas Zeller, a German Comp Sci professor, explains that it’s not enough just to make the problem happen again. Your goal is to come up with the simplest set of circumstances that will trigger the problem – the smallest set of data and dependencies, the simplest and most efficient test(s) with the fewest variables, the shortest path to making the problem happen. You need to understand what is not relevant to the problem, what’s just noise that adds to the cost and time of debugging and testing – and get rid of it. You do this using binary techniques to slice up the input data set, narrowing in on the data and other variables that you actually need, repeating this until the problem starts to become clear. Code Complete’s chapter on Debugging is another good guide on how to reproduce a problem following a set of iterative steps, and how to narrow in on the simplest and most useful set of test conditions required to make the problem happen; as well as common places to look for bugs: checking for code that has been changed recently, code that has a history of other bugs, code that is difficult to understand (if you find it hard to understand, there’s a good chance that the programmers who worked on it before you did too). Replay Tools One of the most efficient ways to reproduce a problem, especially in server code, is by automatically replaying the events that led up to the problem. To do this you’ll need to capture a time-sequenced record of what happened, usually from an audit log, and a driver to read and play the events against the system. And for this to work properly, the behavior of the system needs to be deterministic – given the same set of inputs in the same sequence, the same results will occur each time. Otherwise you’ll have to replay the logs over and over and hope for the right set of circumstances to occur again. On one system that I worked on, the back-end engine was a deterministic state machine designed specifically to support replay. All of the data and events, including configuration and control data and timer events, were recorded in an inbound event log that we could replay. There were no random factors or unpredictable external events – the behavior of the system could always be recreated exactly by replaying the log, making it easy to reproduce bugs from the field. It was a beautiful thing, but most code isn’t designed to support replay in this way. Recent research in virtual machine technology has led to the development of replay tools to snapshot and replay events in a virtual machine. VMWare Workstation, for example, included a cool replay debugging facility for C/C++ programmers which was “guaranteed to have instruction-by-instruction identical behavior each time.” Unfortunately, this was an expensive thing to make work, and it was dropped in version 8, at the end of last year. Replay Solutions provides replay for Java programs, creating a virtual machine to record the complete stream of events (including database I/O, network I/O, system calls, interrupts) as the application is running, and then later letting you simulate and replay the same events against a copy of the running system, so that you can debug the application and observe its behavior. They also offer similar application record and replay technology for mobile HTML5 and JavaScript applications. This is exciting stuff, especially for complex systems where it is difficult to setup and reproduce problems in different environments. Fuzzing and Randomness If the problem is non-deterministic, or you can't come up with the right set of inputs, one approach to try is to simulate random data inputs and watch to see what happens - hoping to happen on a set of input variables that will trigger the problem. This is called fuzzing. Fuzzing is a brute force testing technique that is used to uncover data validation weaknesses that can cause reliability and security problems. It's effective at finding bugs, but it’s a terribly inefficient way to reproduce a specific problem. First you need to setup something to fuzz the inputs (this is easy if a program is reading from a file, or a web form – there are fuzzing tools to help with this – but a hassle if you need to write your own smart protocol fuzzer to test against internal APIs). Then you need time to run through all of the tests (with mutation fuzzing, you may need to run tens of thousands or hundreds of thousands of tests to get enough interesting combinations) and more time to sift through and review all of the test results and understand any problems that are found. Through fuzzing you will get new information about the system to help you identity problem areas in the code, and maybe find new bugs, but you may not end up any closer to fixing the problem that you started on. Reproducing problems, especially when you are working from a bad bug report (“the system was running fine all day, then it crashed… the error said something about a null pointer I think?”) can be a serious time sink. But what if you can’t reproduce the problem at all? Let’s look at that next…

September 9, 2012

by Jim Bird

· 45,605 Views

"Schemas" in CouchDB

schema noun ( pl. schemata or schemas ) 1 technical a representation of a plan or theory in the form of an outline or model: a schema of scientific reasoning. 2 Logic a syllogistic figure. 3 (in Kantian philosophy) a conception of what is common to all members of a class; a general or essential type or form. CouchDB is a schema-less document store, but there are times when a schema is a good thing to have around, one way or another. So can you have your cake and eat it too? Below I'll take a high level look at adding a kind of schema to an application and the benefits and draw backs associated with this way of working. What I describe below isn't for everyone. It goes against some of the core principles of CouchDB and makes your data much less human readable, but there are cases where that trade off is worth making. Schemas: WTF?! It might seem a bit weird to add a schema to a schema-less database but sometimes it is a very useful thing indeed. When you're dealing with large datasets verbose object key names can be a problem (e.g. cost you money) so you end up stuck between a rock and a hard place; either make your data terse and hard to use or be explicit and spend more on storage and network. { "shape": "triangle", "colour_label": "red", "opposite_length_in_mm": 767.12254256805875, "angle_in_radians": 1.5514293603308698, "adjacent_length_in_mm": 73.59881843627835 } What usually happens is some middle ground where a nice descriptive name like "angle_in_radians" gets reduced to "angle" or "rads". That's fine in that it reduces the storage and network required to deal with all that data. { "adj": 73.59881843627835, "shape": "triangle", "angle": 1.5514293603308698, "opp": 767.12254256805875, "colour": "red" } However, by making this small change you move the description of the data out of your database and into some undefined place; higher level code, documentation, shared knowledge, a whiteboard, a notebook, someones head. As your data becomes more terse you might rely on duck typing (deriving from the data itself what the data describes) to get data that quacks right in your application. That's fine so long as you have data that is sufficiently distinguishable from the other ducks on the pond; if I rely on pulling a triangle object from the database because it has an angle member I might accidentally pull out a rhombus or an icosahedron. To make sure you get the data you expect you might add an explicit type field to each data (e.g. "type=goose" or "shape=triangle") something which I've always felt was rather odd. This starts to add up on storage (remember you have a large dataset/flock of ducks) and, more importantly, it doesn't help with where the description of the data is held - you know that you have a goose but don't know what a goose is. This last point is important, especially if you're working in a team of developers. Knowing what describing a shape as a triangle means is vital in producing consistent code that many people can work on. The straight jacket of a SQL schema looks pretty comfy sometimes. Okay, I'll buy that a schema might be useful... So how do you add a schema into a CouchDB database, something that is inherently schema-less? Can I get the best of both worlds? Here's a little trick that might help. First you define a document that is the schema for a particular type of data: { "_id": "datatype/triangle/v1", "fields": [ "opposite_length_in_mm", "adjacent_length_in_mm", "angle_in_radians", "colour_label" ] } Then you change your document structure to reference that "schema": { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ] } Note that the schema is versioned and that ordering in the data list is important here! I now know precisely what the data represents without having to store that description in the data itself. This way of working has benefits beyond disk storage; you reduce wire traffic, and there is less for a client to parse before rendering it. This is especially useful if you're rendering into a browser based visualisation - you don't need a complex set of objects to make a bar chart, just a list of x and y values. I can also share the data structure with colleagues and be reasonably confident that when I'm talking about a "v1 triangle" they'll know that lengths are in millimeters, are the opposite and adjacent sides and that the angle is in radians, hopefully reducing the chance of costly mistakes. Isn't that error prone? Yes and no. If you make a mistake in the ordering of your fields then, yes you are going to have issues. This is reasonably easy to manage with some form of client verification (e.g. validation on a web form) and generating the interface from the data (e.g. use the schema definition to build the GUI). If you're adding these data into the database by hand (e.g. via a curl or futon) then you aren't going to be in the regime where this trick is useful; your dataset needs to be large for this to make sense. Things still quack What's particularly nice about this way of working is that I can still duck type the data, add additional fields to annotate it etc. since the schema isn't strictly enforced. Nothing stops me from having a triangle document like: { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" } My views that deal with the data with a schema will still work (by ignoring these additional fields), my MVC framework will still render my pages, and I'll still have all the data I want in my database. Nesting You could have a nested object structure like: { "datatype": "pattern/v1", "data": [ { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" }, { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "blue" ], "owner": "Fred", "location" "space" }, { "datatype": "square/v1", data: [ 10, "green" ] } ] } But if you're going to have a schema you may as well reflect the nesting inside it, e.g say that you have a list of triangles and a list of squares: { "_id": "datatype/pattern/v1", "fields": [ ["triangle/v1"], ["square/v1"] ] } { "datatype": "pattern/v1", "data": [ [ { "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" }, { "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "blue" ], "owner": "Fred", "location" "space" } ], [ { data: [ 10, "green" ] } ] } Schema evolution A nice feature of this way of working is that you can deal with schema evolutions; changing the format of your data. { "_id": "datatype/triangle/v2", "fields": [ "opposite_length_in_cm", "hypotenuse_length_in_cm", "angle_in_degrees", "colour_label" ] } There are only so many ways you can represent the data. While sometimes you may have a major schema evolution, one where old data is completely unusable, often changes are just tweaks for consistency (say changing the units of a quantity) or extending the schema by adding in optional data. In either case you should be able to use data from multiple schema versions together by using appropriate manipulations on the data. For example you could instantiate shape objects via a factory which knows how to create the right object for different schema versions. Validation The above does no validation of the data; the color field in the input data could be set to a number instead of a string, the angle to something non- physical etc. If you really needed validation you could do it with CouchDB's validation functions. If you go the fully validated route you'd want to define the schema in the design document (instead of as a normal doc) and use a CommonJS include to make sure that the validator in the app was doing the same thing as the schema. This ties you to a version of the design document (which is where the validators live), which may or may not be an issue. It will also considerably slow down insertion rate as CouchDB has to do more work to add your data. Personally I prefer to put validation logic in the client making writes. Views If I were using this way of working I would want to have a view which returned all the schema's defined on the database. This then allows me to build objects appropriately. A view to return schema's documents would look like: function(doc) { if (doc._id.slice(0, 'datatype'.length) == 'datatype') { emit (doc._id.slice('datatype/'.length, doc._id.length), doc.fields) } } You can pull out documents that have a schema with a simple view like: function(doc) { if (doc.datatype){ emit(doc.datatype, doc.data); } } This can be queried to find objects of a given shape using CouchDB's view slicing (e.g. ?startkey="square/v1"&endkey="square/v2") which returns data like: {"id":"datatype/square/v1","key":["square/v1",0],"value":["side_length_in_mm","colour_label"]}, {"id":"f98ffe7e4cd91cbb0d904f9098499ca8","key":["square/v1",1],"value":[872.4342711412228,"green"]}, {"id":"f98ffe7e4cd91cbb0d904f909849a218","key":["square/v1",1],"value":[370.29971491443905,"yellow"]}, {"id":"f98ffe7e4cd91cbb0d904f909849acd0","key":["square/v1",1],"value":[8.799279300193753,"yellow"]} You'll notice the name of the "schema" is the key and the values are held in value. This means I can parse the data into a set of appropriate objects with something like: var objects = []; function build(schema, data){ // Build the appropriate object for the schema... } for (row in data){ // build up the objects in a factory var obj = build(row.key, row.value); objects.push(obj); } If I wanted all versions of a shape the query would be, and used a vNUMERIC_COUNTER notation for versioning, ?startkey="square/v1"&endkey="square/vXXX" as numbers sort lower than strings. Taking it to the extreme If you are really worried about data size you can take this technique to the extreme by encoding the data arrays as a byte string and using the schema documents to describe that byte array. This effectively turns your JSON structure into something not dissimilar to a protocol buffer, at the expense of human readability and view complexity. If you are particularly concerned with data size over the wire (for example are writing an MMORPG) then this may be an acceptable trade off. Reminder This trick isn't suitable for every dataset. If you modify the data by hand it is prone to error. If you have a small dataset, or only ever send a small subset of the data to the client it's massive overkill. But if you have a large dataset of machine generated data, that needs to be frequently accessed over the WAN (think a monitoring app or game) then this is a nice way to reduce storage, network IO and browser render time. It's also worth reiterating that the schema is not enforced, you could have a square with 3 sides, and that adding strict schema enforcement with a validation function will considerably slow down insert rate.

September 8, 2012

by Simon Metson

· 10,381 Views

Java 7: HashMap vs ConcurrentHashMap

As you may have seen from my past performance related articles and HashMap case studies, Java thread safety problems can bring down your Java EE application and the Java EE container fairly easily. One of most common problems I have observed when troubleshooting Java EE performance problems is infinite looping triggered from the non-thread safe HashMap get() and put() operations. This problem is known since several years but recent production problems have forced me to revisit this issue one more time. This article will revisit this classic thread safety problem and demonstrate, using a simple Java program, the risk associated with a wrong usage of the plain old java.util.HashMap data structure involved in a concurrent threads context. This proof of concept exercise will attempt to achieve the following 3 goals: Revisit and compare the Java program performance level between the non-thread safe and thread safe Map data structure implementations (HashMap, Hashtable, synchronized HashMap, ConcurrentHashMap) Replicate and demonstrate the HashMap infinite looping problem using a simple Java program that everybody can compile, run and understand Review the usage of the above Map data structures in a real-life and modern Java EE container implementation such as JBoss AS7 For more detail on the ConcurrentHashMap implementation strategy, I highly recommend the great article from Brian Goetz on this subject. Tools and server specifications As a starting point, find below the different tools and software’s used for the exercise: Sun/Oracle JDK & JRE 1.7 64-bit Eclipse Java EE IDE Windows Process Explorer (CPU per Java Thread correlation) JVM Thread Dump (stuck thread analysis and CPU per Thread correlation) The following local computer was used for the problem replication process and performance measurements: Intel(R) Core(TM) i5-2520M CPU @ 2.50Ghz (2 CPU cores, 4 logical cores) 8 GB RAM Windows 7 64-bit * Results and performance of the Java program may vary depending of your workstation or server specifications. Java program In order to help us achieve the above goals, a simple Java program was created as per below: The main Java program is HashMapInfiniteLoopSimulator.java A worker Thread class WorkerThread.java was also created The program is performing the following: Initialize different static Map data structures with initial size of 2 Assign the chosen Map to the worker threads (you can chose between 4 Map implementations) Create a certain number of worker threads (as per the header configuration). 3 worker threads were created for this proof of concept NB_THREADS = 3; Each of these worker threads has the same task: lookup and insert a new element in the assigned Map data structure using a random Integer element between 1 – 1 000 000. Each worker thread perform this task for a total of 500K iterations The overall program performs 50 iterations in order to allow enough ramp up time for the HotSpot JVM The concurrent threads context is achieved using the JDK ExecutorService As you can see, the Java program task is fairly simple but complex enough to generate the following critical criteria’s: Generate concurrency against a shared / static Map data structure Use a mix of get() and put() operations in order to attempt to trigger internal locks and / or internal corruption (for the non-thread safe implementation) Use a small Map initial size of 2, forcing the internal HashMap to trigger an internal rehash/resize Finally, the following parameters can be modified at your convenience: ## Number of worker threads private static final int NB_THREADS = 3; ## Number of Java program iterations private static final int NB_TEST_ITERATIONS = 50; ## Map data structure assignment. You can choose between 4 structures // Plain old HashMap (since JDK 1.2) nonThreadSafeMap = new HashMap(2); // Plain old Hashtable (since JDK 1.0) threadSafeMap1 = new Hashtable(2); // Fully synchronized HashMap threadSafeMap2 = new HashMap(2); threadSafeMap2 = Collections.synchronizedMap(threadSafeMap2); // ConcurrentHashMap (since JDK 1.5) threadSafeMap3 = new ConcurrentHashMap(2); /*** Assign map at your convenience ****/ assignedMapForTest = threadSafeMap3; Now find below the source code of our sample program. #### HashMapInfiniteLoopSimulator.java package org.ph.javaee.training4; import java.util.Collections; import java.util.Map; import java.util.HashMap; import java.util.Hashtable; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; /** * HashMapInfiniteLoopSimulator * @author Pierre-Hugues Charbonneau * */ public class HashMapInfiniteLoopSimulator { private static final int NB_THREADS = 3; private static final int NB_TEST_ITERATIONS = 50; private static Map assignedMapForTest = null; private static Map nonThreadSafeMap = null; private static Map threadSafeMap1 = null; private static Map threadSafeMap2 = null; private static Map threadSafeMap3 = null; /** * Main program * @param args */ public static void main(String[] args) { System.out.println("Infinite Looping HashMap Simulator"); System.out.println("Author: Pierre-Hugues Charbonneau"); System.out.println("http://javaeesupportpatterns.blogspot.com"); for (int i=0; i(2); // Plain old Hashtable (since JDK 1.0) threadSafeMap1 = new Hashtable(2); // Fully synchronized HashMap threadSafeMap2 = new HashMap(2); threadSafeMap2 = Collections.synchronizedMap(threadSafeMap2); // ConcurrentHashMap (since JDK 1.5) threadSafeMap3 = new ConcurrentHashMap(2); // ConcurrentHashMap /*** Assign map at your convenience ****/ assignedMapForTest = threadSafeMap3; long timeBefore = System.currentTimeMillis(); long timeAfter = 0; Float totalProcessingTime = null; ExecutorService executor = Executors.newFixedThreadPool(NB_THREADS); for (int j = 0; j < NB_THREADS; j++) { /** Assign the Map at your convenience **/ Runnable worker = new WorkerThread(assignedMapForTest); executor.execute(worker); } // This will make the executor accept no new threads // and finish all existing threads in the queue executor.shutdown(); // Wait until all threads are finish while (!executor.isTerminated()) { } timeAfter = System.currentTimeMillis(); totalProcessingTime = new Float( (float) (timeAfter - timeBefore) / (float) 1000); System.out.println("All threads completed in "+totalProcessingTime+" seconds"); } } } #### WorkerThread.java package org.ph.javaee.training4; import java.util.Map; /** * WorkerThread * * @author Pierre-Hugues Charbonneau * */ public class WorkerThread implements Runnable { private Map map = null; public WorkerThread(Map assignedMap) { this.map = assignedMap; } @Override public void run() { for (int i=0; i<500000; i++) { // Return 2 integers between 1-1000000 inclusive Integer newInteger1 = (int) Math.ceil(Math.random() * 1000000); Integer newInteger2 = (int) Math.ceil(Math.random() * 1000000); // 1. Attempt to retrieve a random Integer element Integer retrievedInteger = map.get(String.valueOf(newInteger1)); // 2. Attempt to insert a random Integer element map.put(String.valueOf(newInteger2), newInteger2); } } } Performance comparison between thread safe Map implementations The first goal is to compare the performance level of our program when using different thread safe Map implementations: Plain old Hashtable (since JDK 1.0) Fully synchronized HashMap (via Collections.synchronizedMap()) ConcurrentHashMap (since JDK 1.5) Find below the graphical results of the execution of the Java program for each iteration along with a sample of the program console output. # Output when using ConcurrentHashMap Infinite Looping HashMap Simulator Author: Pierre-Hugues Charbonneau http://javaeesupportpatterns.blogspot.com All threads completed in 0.984 seconds All threads completed in 0.908 seconds All threads completed in 0.706 seconds All threads completed in 1.068 seconds All threads completed in 0.621 seconds All threads completed in 0.594 seconds All threads completed in 0.569 seconds All threads completed in 0.599 seconds ……………… As you can see, the ConcurrentHashMap is the clear winner here, taking in average only half a second (after an initial ramp-up) for all 3 worker threads to concurrently read and insert data within a 500K looping statement against the assigned shared Map. Please note that no problem was found with the program execution e.g. no hang situation. The performance boost is definitely due to the improved ConcurrentHashMap performance such as the non-blocking get() operation. The 2 other Map implementations performance level was fairly similar with a small advantage for the synchronized HashMap. HashMap infinite looping problem replication The next objective is to replicate the HashMap infinite looping problem observed so often from Java EE production environments. In order to do that, you simply need to assign the non-thread safe HashMap implementation as per code snippet below: /*** Assign map at your convenience ****/ assignedMapForTest = nonThreadSafeMap; Running the program as is using the non-thread safe HashMap should lead to: No output other than the program header Significant CPU increase observed from the system At some point the Java program will hang and you will be forced to kill the Java process What happened? In order to understand this situation and confirm the problem, we will perform a CPU per Thread analysis from the Windows OS using Process Explorer and JVM Thread Dump. 1 - Run the program again then quickly capture the thread per CPU data from Process Explorer as per below. Under explore.exe you will need to right click over the javaw.exe and select properties. The threads tab will be displayed. We can see overall 4 threads using almost all the CPU of our system. 2 – Now you have to quickly capture a JVM Thread Dump using the JDK 1.7 jstack utility. For our example, we can see our 3 worker threads which seems busy/stuck performing get() and put() operations. ..\jdk1.7.0\bin>jstack 272 2012-08-29 14:07:26 Full thread dump Java HotSpot(TM) 64-Bit Server VM (21.0-b17 mixed mode): "pool-1-thread-3" prio=6 tid=0x0000000006a3c000 nid=0x18a0 runnable [0x0000000007ebe000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.put(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) "pool-1-thread-2" prio=6 tid=0x0000000006a3b800 nid=0x6d4 runnable [0x000000000805f000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.get(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:29) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) "pool-1-thread-1" prio=6 tid=0x0000000006a3a800 nid=0x2bc runnable [0x0000000007d9e000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.put(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) .............. 3 – CPU per thread correlation It is now time to convert the Process Explorer thread ID DECIMAL format to HEXA format as per below. The HEXA value allows us to map and identify each thread as per below: ## TID: 1748 (nid=0X6D4) Thread name: pool-1-thread-2 CPU @25.71% Task: Worker thread executing a HashMap.get() operation at java.util.HashMap.get(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:29) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ## TID: 700 (nid=0X2BC) Thread name: pool-1-thread-1 CPU @23.55% Task: Worker thread executing a HashMap.put() operation at java.util.HashMap.put(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ## TID: 6304 (nid=0X18A0) Thread name: pool-1-thread-3 CPU @12.02% Task: Worker thread executing a HashMap.put() operation at java.util.HashMap.put(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ## TID: 5944 (nid=0X1738) Thread name: pool-1-thread-1 CPU @20.88% Task: Main Java program execution "main" prio=6 tid=0x0000000001e2b000 nid=0x1738 runnable [0x00000000029df000] java.lang.Thread.State: RUNNABLE at org.ph.javaee.training4.HashMapInfiniteLoopSimulator.main(HashMapInfiniteLoopSimulator.java:75) As you can see, the above correlation and analysis is quite revealing. Our main Java program is in a hang state because our 3 worker threads are using lot of CPU and not going anywhere. They may appear "stuck" performing HashMap get() & put() but in fact they are all involved in an infinite loop condition. This is exactly what we wanted to replicate. HashMap infinite looping deep dive Now let’s push the analysis one step further to better understand this looping condition. For this purpose, we added tracing code within the JDK 1.7 HashMap Java class itself in order to understand what is happening. Similar logging was added for the put() operation and also a trace indicating that the internal & automatic rehash/resize got triggered. The tracing added in get() and put() operations allows us to determine if the for() loop is dealing with circular dependency which would explain the infinite looping condition. #### HashMap.java get() operation public V get(Object key) { if (key == null) return getForNullKey(); int hash = hash(key.hashCode()); /*** P-H add-on- iteration counter ***/ int iterations = 1; for (Entry e = table[indexFor(hash, table.length)]; e != null; e = e.next) { /*** Circular dependency check ***/ Entry currentEntry = e; Entry nextEntry = e.next; Entry nextNextEntry = e.next != null?e.next.next:null; K currentKey = currentEntry.key; K nextNextKey = nextNextEntry != null?(nextNextEntry.key != null?nextNextEntry.key:null):null; System.out.println("HashMap.get() #Iterations : "+iterations++); if (currentKey != null && nextNextKey != null ) { if (currentKey == nextNextKey || currentKey.equals(nextNextKey)) System.out.println(" ** Circular Dependency detected! ["+currentEntry+"]["+nextEntry+"]"+"]["+nextNextEntry+"]"); } /***** END ***/ Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) return e.value; } return null; } HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.resize() in progress... HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 2 HashMap.resize() in progress... HashMap.resize() in progress... HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 2 HashMap.put() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 ** Circular Dependency detected! [362565=362565][333326=333326]][362565=362565] HashMap.put() #Iterations : 2 ** Circular Dependency detected! [333326=333326][362565=362565]][333326=333326] HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 ............................. HashMap.put() #Iterations : 56823 Again, the added logging was quite revealing. We can see that following a few internal HashMap.resize() the internal structure became affected, creating circular dependency conditions and triggering this infinite looping condition (#iterations increasing and increasing...) with no exit condition. It is also showing that the resize() / rehash operation is the most at risk of internal corruption, especially when using the default HashMap size of 16. This means that the initial size of the HashMap appears to be a big factor in the risk & problem replication. Finally, it is interesting to note that we were able to successfully run the test case with the non-thread safe HashMap by assigning an initial size setting at 1000000, preventing any resize at all. Find below the merged graph results: The HashMap was our top performer but only when preventing an internal resize. Again, this is definitely not a solution to the thread safe risk but just a way to demonstrate that the resize operation is the most at risk given the entire manipulation of the HashMap performed at that time. The ConcurrentHashMap, by far, is our overall winner by providing both fast performance and thread safety against that test case. JBoss AS7 Map data structures usage We will now conclude this article by looking at the different Map implementations within a modern Java EE container implementation such as JBoss AS 7.1.2. You can obtain the latest source code from the github master branch. Find below the report: Total JBoss AS7.1.2 Java files (August 28, 2012 snapshot): 7302 Total Java classes using java.util.Hashtable: 72 Total Java classes using java.util.HashMap: 512 Total Java classes using synchronized HashMap: 18 Total Java classes using ConcurrentHashMap: 46 Hashtable references were found mainly within the test suite components and from naming and JNDI related implementations. This low usage is not a surprise here. References to the java.util.HashMap were found from 512 Java classes. Again not a surprise given how common this implementation is since the last several years. However, it is important to mention that a good ratio was found either from local variables (not shared across threads), synchronized HashMap or manual synchronization safeguard so “technically” thread safe and not exposed to the above infinite looping condition (pending/hidden bugs is still a reality given the complexity with Java concurrency programming…this case study involving Oracle Service Bus 11g is a perfect example). A low usage of synchronized HashMap was found with only 18 Java classes from packages such as JMS, EJB3, RMI and clustering. Finally, find below a breakdown of the ConcurrentHashMap usage which was our main interest here. As you will see below, this Map implementation is used by critical JBoss components layers such as the Web container, EJB3 implementation etc. ## JBoss Single Sign On Used to manage internal SSO ID's involving concurrent Thread access Total: 1 ## JBoss Java EE & Web Container Not surprising here since lot of internal Map data structures are used to manage the http sessions objects, deployment registry, clustering & replication, statistics etc. with heavy concurrent Thread access. Total: 11 ## JBoss JNDI & Security Layer Used by highly concurrent structures such as internal JNDI security management. Total: 4 ## JBoss domain & managed server management, rollout plans... Total: 7 ## JBoss EJB3 Used by data structures such as File Timer persistence store, application Exception, Entity Bean cache, serialization, passivation... Total: 8 ## JBoss kernel, Thread Pools & protocol management Used by high concurrent Threads Map data structures involved in handling and dispatching/processing incoming requests such as HTTP. Total: 3 ## JBoss connectors such as JDBC/XA DataSources... Total: 2 ## Weld (reference implementation of JSR-299: Contexts and Dependency Injection for the JavaTM EE platform) Used in the context of ClassLoader and concurrent static Map data structures involving concurrent Threads access. Total: 3 ## JBoss Test Suite Used in some integration testing test cases such as an internal Data Store, ClassLoader testing etc. Total: 3 Final words I hope this article has helped you revisit this classic problem and understand one of the common problems and risks associated with a wrong usage of the non-thread safe HashMap implementation. My main recommendation to you is to be careful when using an HashMap in a concurrent threads context. Unless you are a Java concurrency expert, I recommend that you use ConcurrentHashMap instead which offers a very good balance between performance and thread safety. As usual, extra due diligence is always recommended such as performing cycles of load & performance testing. This will allow you to detect thread safety and / or performance problems before you promote the solution to your client production environment. Please provide any comments and share your experience with ConcurrentHashMap or HashMap implementations and troubleshooting.

September 7, 2012

by Pierre - Hugues Charbonneau

· 154,940 Views · 5 Likes

Adding Hibernate Entity Level Filtering feature to Spring Data JPA Repository

Original Article: http://borislam.blogspot.hk/2012/07/adding-hibernate-entity-level-filter.html Those who have used data filtering features of hibernate should know that it is very powerful. You could define a set of filtering criteria to an entity class or a collection. Spring data JPA is a very handy library but it does not have fitering features. In this post, I will demonstarte how to add the hibernate filter features at entity level. You can use this features when you are using Hibernate Entity Manager. We can just define annotation in your repositoy interface to enable this features. Step 1. Define filter at entity level as usual. Just use hibernate @FilterDef annotation @Entity @Table(name = "STUDENT") @FilterDef(name="filterBySchoolAndClass", parameters={@ParamDef(name="school", type="string"),@ParamDef(name="class", type="integer")}) public class Student extends GenericEntity implements Serializable { // add your properties ... } Step2. Define two custom annotations. These two annotations are to be used in your repository interfaces. You could apply the hibernate filter defined in step 1 to specific query through these annotations. @Target(ElementType.TYPE) @Retention(RetentionPolicy.RUNTIME) public @interface EntityFilter { FilterQuery[] filterQueries() default {}; } @Retention(RetentionPolicy.RUNTIME) public @interface FilterQuery { String name() default ""; String jpql() default ""; } Step3. Add a method to your Spring data JPA base repository. This method will read the annotation you defined (i.e. @FilterQuery) and apply hibernate filter to the query by just simply unwrap the EntityManager. You could specify the parameter in your hibernate filter and also the parameter in you query in this method. If you do not know how to add custom method to your Spring data JPA base repository, please see my previous article for how to customize your Spring data JPA base repository for detail. You can see in previous article that I intentionally expose the repository interface (i.e. the springDataRepositoryInterface property) in the GenericRepositoryImpl. This small tricks enable me to access the annotation in the repository interface easily. public List doQueryWithFilter( String filterName, String filterQueryName, Map inFilterParams, Map inQueryParams){ if (GenericRepository.class.isAssignableFrom(getSpringDataRepositoryInterface())) { Annotation entityFilterAnn = getSpringDataRepositoryInterface().getAnnotation(EntityFilter.class); if(entityFilterAnn != null){ EntityFilter entityFilter = (EntityFilter)entityFilterAnn; FilterQuery[] filterQuerys = entityFilter.filterQueries() ; for (FilterQuery fQuery : filterQuerys) { if (StringUtils.equals(filterQueryName, fQuery.name())) { String jpql = fQuery.jpql(); Filter filter = em.unwrap(Session.class).enableFilter(filterName); //set filter parameter for (Object key: inFilterParams.keySet()) { String filterParamName = key.toString(); Object filterParamValue = inFilterParams.get(key); filter.setParameter(filterParamName, filterParamValue); } //set query parameter Query query= em.createQuery(jpql); for (Object key: inQueryParams.keySet()) { String queryParamName = key.toString(); Object queryParamValue = inQueryParams.get(key); query.setParameter(queryParamName, queryParamValue); } return query.getResultList(); } } } } } return null; } Last Step: example usage In your repositry, define which query you would like to apply hibernate filter through your @EntityFilter and @FilterQuery annotation. @EntityFilter ( filterQueries = { @FilterQuery(name="query1", jpql="SELECT s FROM Student LEFT JOIN FETCH s.Subject where s.subject = :subject" ), @FilterQuery(name="query2", jpql="SELECT s FROM Student LEFT JOIN s.TeacherSubject where s.teacher = :teacher") } ) public interface StudentRepository extends GenericRepository { } In your service or business class that inject your repository, you could just simply call the doQueryWithFilter() method to enable the filtering function. @Service public class StudentService { @Inject private StudentRepository studentRepository; public List searchStudent( String subject, String school, String class) { List studentList; // Prepare parameters for query filter HashMap inFilterParams = new HashMap(); inFilterParams.put("school", "Hong Kong Secondary School"); inFilterParams.put("class", "S5"); // Prepare parameters for query HashMap inParams = new HashMap(); inParams.put("subject", "Physics"); studentList = studentRepository.doQueryWithFilter( "filterBySchoolAndClass", "query1", inFilterParams, inParams); return studentList; } }

August 24, 2012

by Boris Lam

· 56,850 Views · 1 Like

Spring Data, Spring Security and Envers integration

Learn about pros, cons, and basics of Spring security and data, plus Envers integration.

August 20, 2012

by Nicolas Fränkel

· 25,090 Views · 1 Like

EF Migrations Command Reference

Entity Framework Migrations are handled from the package manager console in Visual Studio. The usage is shown in various tutorials, but I haven’t found a complete list of the commands available and their usage, so I created my own. There are four available commands. Enable-Migrations: Enables Code First Migrations in a project. Add-Migration: Scaffolds a migration script for any pending model changes. Update-Database: Applies any pending migrations to the database. Get-Migrations: Displays the migrations that have been applied to the target database. The information here is the output of running get-help command-name -detailed for each of the commands in the package manager console (running EF 4.3.1). I’ve also added some own comments where I think some information is missing. My own comments are placed under the Additional Information heading. Please note that all commands should be entered on the same line. I’ve added line breaks to avoid vertical scrollbars. Enable-Migrations Enables Code First Migrations in a project. Syntax Enable-Migrations [-EnableAutomaticMigrations] [[-ProjectName] ] [-Force] [] Description Enables Migrations by scaffolding a migrations configuration class in the project. If the target database was created by an initializer, an initial migration will be created (unless automatic migrations are enabled via the EnableAutomaticMigrations parameter). Parameters -EnableAutomaticMigrations Specifies whether automatic migrations will be enabled in the scaffolded migrations configuration. If ommitted, automatic migrations will be disabled. -ProjectName Specifies the project that the scaffolded migrations configuration class will be added to. If omitted, the default project selected in package manager console is used. -Force Specifies that the migrations configuration be overwritten when running more than once for given project. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Enable-Migrations -examples. For more information, type: get-help Enable-Migrations -detailed. For technical information, type: get-help Enable-Migrations -full. Additional Information The flag for enabling automatic migrations is saved in the Migrations\Configuration.cs file, in the constructor. To later change the option, just change the assignment in the file. public Configuration() { AutomaticMigrationsEnabled = false; } Add-Migration Scaffolds a migration script for any pending model changes. Syntax Add-Migration [-Name] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [-IgnoreChanges] [] Add-Migration [-Name] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [-IgnoreChanges] [] Description Scaffolds a new migration script and adds it to the project. Parameters -Name Specifies the name of the custom script. -Force Specifies that the migration user code be overwritten when re-scaffolding an existing migration. -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. -IgnoreChanges Scaffolds an empty migration ignoring any pending changes detected in the current model. This can be used to create an initial, empty migration to enable Migrations for an existing database. N.B. Doing this assumes that the target database schema is compatible with the current model. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Add-Migration -examples. For more information, type: get-help Add-Migration -detailed. For technical information, type: get-help Add-Migration -full. Update-Database Applies any pending migrations to the database. Syntax Update-Database [-SourceMigration ] [-TargetMigration ] [-Script] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [] Update-Database [-SourceMigration ] [-TargetMigration ] [-Script] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [] Description Updates the database to the current model by applying pending migrations. Parameters -SourceMigration Only valid with -Script. Specifies the name of a particular migration to use as the update’s starting point. If ommitted, the last applied migration in the database will be used. -TargetMigration Specifies the name of a particular migration to update the database to. If ommitted, the current model will be used. -Script Generate a SQL script rather than executing the pending changes directly. -Force Specifies that data loss is acceptable during automatic migration of the database. -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Update-Database -examples. For more information, type: get-help Update-Database -detailed. For technical information, type: get-help Update-Database -full. Additional Information The command always runs any pending code-based migrations first. If the database is still incompatible with the model the additional changes required are applied as an separate automatic migration step if automatic migrations are enabled. If automatic migrations are disabled an error message is shown. Get-Migrations Displays the migrations that have been applied to the target database. Syntax Get-Migrations [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [] Get-Migrations [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [] Description Displays the migrations that have been applied to the target database. Parameters -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Get-Migrations -examples. For more information, type: get-help Get-Migrations -detailed. For technical information, type: get-help Get-Migrations -full. Additional Information The powershell commands are complex powershell functions, located in the tools\EntityFramework.psm1 file of the Entity Framework installation. The powershell code is mostly a wrapper around the System.Data.Entity.Migrations.MigrationsCommands found in the tools\EntityFramework\EntityFramework.PowerShell.dll file. First a MigrationsCommands object is instantiated with all configuration parameters. Then there is a public method on the MigrationsCommands object for each of the available commands.

August 20, 2012

by Anders Abel

· 31,400 Views · 1 Like

How to Migrate Drupal to Azure Web Sites

DrupalCon Munich is next week, and I am lucky enough to be going. As part of preparing for the conference, I thought it would be worthwhile to see just how easy (or difficult) it would be to migrate an existing Drupal site to Windows Azure Web Sites. So, in this post, I’ll do just that. Fortunately, because Windows Azure Web Sites supports both PHP and MySQL, the migration process is relatively straightforward. And, because Drupal and PHP run on any platform, the process I’ll describe should work for moving Drupal to Windows Azure Web Sites regardless of what platform you are moving from. Of course, Drupal installations can vary widely, so YMMV. I tested the instructions below on relatively small (and simple) Drupal installation running on CentOS 5. (Unfortunately, I won’t be using Drush since it isn’t supported on Windows Azure Websites.) If you are considering moving a large and complex Drupal application, may want to consider moving to Windows Azure Cloud Services (more information about that here: Migrating a Drupal Site from LAMP to Windows Azure). Before getting started, it’s worth noting that Windows Azure Websites lets you run up to 10 Web Sites for free in a multitenant environment. And, you can seamlessly upgrade to private, reserved VM instances as your traffic grows. To sign up, try the Windows Azure 90-day free trial. 1. Create a Windows Azure Web Site and MySQL database There is a step-by-step tutorial on http://www.windowsazure.com that walks you through creating a new website and a MySQL database, so I’ll refer you there to get started: Create a PHP-MySQL Windows Azure web site and deploy using Git. If you intend to use Git to publish your Drupal site, then go ahead and follow the instructions for setting up a Git repository. Make sure to follow the instructions in the Get remote MySQL connection information section as you will need that information later. You can ignore the remainder of the tutorial for the purposes of deploying your Drupal site, but if you are new to Windows Azure Web Sites (and to Git), you might find the additional reading informative. Ok, now you have a new website with a MySQL database, your have your MySQL database connection information, and you have (optionally) created a remote Git repository and made note of the Git deployment instructions. Now you are ready to copy your database to MySQL in Windows Azure Web Sites. 2. Copy database to MySQL in Windows Azure Web Sites I’m sure there is more than one way to copy your Drupal database, but I found the mysqldump tool to be effective and easy to use. To copy from a local machine to Windows Azure Web Sites, here’s the command I used: mysqldump -u local_username --password=local_password drupal | mysql -h remote_host -u remote_username --password=remote_password remote_db_name You will, of course, have to provide the username and password for your existing Drupal database, and you will have to provide the hostname, username, password, and database name for the MySQL database you created in step 1. This information is available in the connection string information that you should have noted in step 1. i.e. You should have a connection string that looks something like this: Database=remote_db_name;Data Source=remote_host;User Id=remote_username;Password=remote_password Depending on the size of your database, the copying process could take several minutes. Now your Drupal database is live in Windows Azure Websites. Before you deploy your Drupal code, you need to modify it so it can connect to the new database. 3. Modify database connection info in settings.php Here, you will again need your new database connection information. Open the /drupal/sites/default/setting.php file in your favorite text editor, and replace the values of ‘database’, ‘username’, ‘password’, and ‘host’ in the $databases array with the correct values for your new database. When you are finished, you should have something similar to this: $databases = array ( 'default' => array ( 'default' => array ( 'database' => 'remote_db_name', 'username' => 'remote_username', 'password' => 'remote_password', 'host' => 'remote_host', 'port' => '', 'driver' => 'mysql', 'prefix' => '', ), ), ); Be sure to save the settings.phpfile, then you are ready to deploy. 4. Deploy Drupal code using Git or FTP The last step is to deploy your code to Windows Azure Web Sites using Git or FTP. If you are using FTP, you can get the FTP hostname and username from you website’s dashboard. Then, use your favorite FTP client to upload your Drupal files to the /site/wwwroot folder of the remote site. If you are using Git, you need to set up a Git repository in Windows Azure Web Sites (steps for this are in the tutorial mentioned earlier). And, you will need Git installed on your local machine. Then, just follow the instructions provided after you created the repository: One note about using Git here: depending on your Git settings, your .gitignore file (a hidden file and a sibling to the .git folder created in your local root directory after you executed git commit), some files in your Drupal application may be ignored. In my case, all the files in the sites directory were ignored. If this happens, you will want to edit the .gitignore file so that these files aren’t ignored and redeploy. After you have deployed Drupal to Windows Azure Web Sites, you can continue to deploy updates via Git or FTP. Related information If you are looking for more information about Windows Azure Web Sites, these posts might be helpful: Windows Azure Websites- A PHP Perspective Windows Azure Websites, Web Roles, and VMs- When to use which- Configuring PHP in Windows Azure Websites with .user.ini Files One last thing you might consider, depending on your site, is using the Windows Azure Integration Module to store and serve your site’s media files.

August 19, 2012

by Brian Swan

· 10,261 Views

Machine Learning: Measuring Similarity and Distance

Measuring similarity or distance between two data points is fundamental to many Machine Learning algorithms such as K-Nearest-Neighbor, Clustering ... etc.

August 10, 2012

by Ricky Ho

· 54,352 Views · 6 Likes

Generate a Random Alpha Numeric String

Generate a random alpha numeric string whose length is the number of characters specified. Characters will be chosen from the set of alpha-numeric characters. Count is the length of random string to create. private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; public static String randomAlphaNumeric(int count) { StringBuilder builder = new StringBuilder(); while (count-- != 0) { int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length()); builder.append(ALPHA_NUMERIC_STRING.charAt(character)); } return builder.toString(); }

August 9, 2012

by Kunal Bhatia

· 164,569 Views · 3 Likes

String Memory Internals

This article is based on my answer on StackOverflow. I am trying to explain how String class stores the texts, how interning and constant pool works. The main point to understand here is the distinction between String Java object and its contents - char[] under private value field. String is basically a wrapper around char[] array, encapsulating it and making it impossible to modify so the String can remain immutable. Also the String class remembers which parts of this array is actually used (see below). This all means that you can have two different String objects (quite lightweight) pointing to the same char[]. I will show you few examples, together with hashCode() of each String and hashCode() of internal char[] value field (I will call it text to distinguish it from string). Finally I'll show javap -c -verbose output, together with constant pool for my test class. Please do not confuse class constant pool with string literal pool. They are not quite the same. See also Understanding javap's output for the Constant Pool Prerequisites For the purpose of testing I created such a utility method that breaks String encapsulation: private int showInternalCharArrayHashCode(String s) { final Field value = String.class.getDeclaredField("value"); value.setAccessible(true); return value.get(s).hashCode(); } It will print hashCode() of char[] value, effectively helping us understand whether this particular String points to the same char[] text or not. Two string literals in a class Let's start from the simplest example. Java code String one = "abc"; String two = "abc"; BTW if you simply write "ab" + "c", Java compiler will perform concatenation at compile time and the generated code will be exactly the same. This only works if all strings are known at compile time. Class constant pool Each class has its own constant pool - a list of constant values that can be reused if they occur several times in the source code. It includes common strings, numbers, method names, etc. Here are the contents of the constant pool in our example above: const #2 = String #38; // abc //... const #38 = Asciz abc; The important thing to note is the distinction between String constant object (#2) and Unicode encoded text "abc" (#38) that the string points to. Byte code Here is generated byte code. Note that both one and two references are assigned with the same #2 constant pointing to "abc" string: ldc #2; //String abc astore_1 //one ldc #2; //String abc astore_2 //two Output For each example I am printing the following values: System.out.println("one.value: " + showInternalCharArrayHashCode(one)); System.out.println("two.value: " + showInternalCharArrayHashCode(two)); System.out.println("one" + System.identityHashCode(one)); System.out.println("two" + System.identityHashCode(two)); No surprise that both pairs are equal: one.value: 23583040 two.value: 23583040 one: 8918249 two: 8918249 Which means that not only both objects point to the same char[] (the same text underneath) so equals() test will pass. But even more, one and two are the exact same references! So one == two is true as well. Obviously if one and two point to the same object then one.value and two.value must be equal. Literal and new String() Java code Now the example we all waited for - one string literal and one new String using the same literal. How will this work? String one = "abc"; String two = new String("abc"); The fact that "abc" constant is used two times in the source code should give you some hint... Class constant pool Same as above. Byte code ldc #2; //String abc astore_1 //one new #3; //class java/lang/String dup ldc #2; //String abc invokespecial #4; //Method java/lang/String."":(Ljava/lang/String;)V astore_2 //two Look carefully! The first object is created the same way as above, no surprise. It just takes a constant reference to already created String (#2) from the constant pool. However the second object is created via normal constructor call. But! The first String is passed as an argument. This can be decompiled to: String two = new String(one); Output The output is a bit surprising. The second pair, representing references to String object is understandable - we created two String objects - one was created for us in the constant pool and the second one was created manually for two. But why, on earth the first pair suggests that both String objects point to the same char[] value array?! one.value: 41771 two.value: 41771 one: 8388097 two: 16585653 It becomes clear when you look at how String(String) constructor works (greatly simplified here): public String(String original) { this.offset = original.offset; this.count = original.count; this.value = original.value; } See? When you are creating new String object based on existing one, it reuses char[] value. Strings are immutable, there is no need to copy data structure that is known to be never modified. Moreover, since new String(someString) creates an exact copy of existing string and strings are immutable, there is clearly no reason for the two to exist at the same time. I think this is the clue of some misunderstandings: even if you have two String objects, they might still point to the same contents. And as you can see the String object itself is quite small. Runtime modification and intern() Java code Let's say you initially used two different strings but after some modifications they are all the same: String one = "abc"; String two = "?abc".substring(1); //also two = "abc" The Java compiler (at least mine) is not clever enough to perform such operation at compile time, have a look: Class constant pool Suddenly we ended up with two constant strings pointing to two different constant texts: const #2 = String #44; // abc const #3 = String #45; // ?abc const #44 = Asciz abc; const #45 = Asciz ?abc; Byte Code ldc #2; //String abc astore_1 //one ldc #3; //String ?abc iconst_1 invokevirtual #4; //Method String.substring:(I)Ljava/lang/String; astore_2 //two The fist string is constructed as usual. The second is created by first loading the constant "?abc" string and then calling substring(1) on it. Output No surprise here - we have two different strings, pointing to two different char[] texts in memory: one.value: 27379847 two.value: 7615385 one: 8388097 two: 16585653 Well, the texts aren't really different, equals() method will still yield true. We have two unnecessary copies of the same text. Now we should run two exercises. First, try running: two = two.intern(); before printing hash codes. Not only both one and two point to the same text, but they are the same reference! one.value: 11108810 two.value: 11108810 one: 15184449 two: 15184449 This means both one.equals(two) and one == two tests will pass. Also we saved some memory because "abc" text appears only once in memory (the second copy will be garbage collected). The second exercise is slightly different, check out this: String one = "abc"; String two = "abc".substring(1); Obviously one and two are two different objects, pointing to two different texts. But how come the output suggests that they both point to the same char[] array?!? one.value: 23583040two.value: 23583040one: 11108810two: 8918249 I'll leave the answer to you. It'll teach you how substring() works, what are the advantages of such approach and when it can lead to big troubles. Lessons learnt String object itself is rather cheap. It's the text it points to that consumes most of the memory String is just a thin wrapper around char[] to preserve immutability new String("abc") isn't really that expensive as the internal text representation is reused. But still avoid such construct. When String is concatenated from constant values known at compile time, concatenation is done by the compiler, not by the JVM substring() is tricky, but most importantly, it is very cheap, both in terms of used memory and run time (constant in both cases)

August 8, 2012

by Tomasz Nurkiewicz

· 52,555 Views · 5 Likes

Spring Data With Cassandra Using JPA

We recently adopted the use of Spring Data. Spring Data provides a nice pattern/API that you can layer on top of JPA to eliminate boiler-plate code. With that adoption, we started looking at the DAO layer we use against Cassandra for some of our operations. Some of the data we store in Cassandra is simple. It does *not* leverage the flexible nature of NoSQL. In other words, we know all the table names, the column names ahead of time, and we don't anticipate them changing all that often. We could have stored this data in an RDBMs, using hibernate to access it, but standing up another persistence mechanism seemed like overkill. For simplicity's sake, we preferred storing this data in Cassandra. That said, we want the flexibility to move this to an RDBMs if we need to. Enter JPA. JPA would provide us a nice layer of abstraction away from the underlying storage mechanism. Wouldn't it be great if we could annotate the objects with JPA annotations, and persist them to Cassandra? Enter Kundera. Kundera is a JPA implementation that supports Cassandra (among other storage mechanisms). OK -- so JPA is great, and would get us what we want, but we had just adopted the use of Spring Data. Could we use both? The answer is "sort of". I forked off SpringSource's spring-data-cassandra: https://github.com/boneill42/spring-data-cassandra And I started hacking on it. I managed to get an implementation of the PagingAndSortingRepository for which I wrote unit tests that worked, but I was duplicating a lot of what should have come for free in the SimpleJpaRepository. When I tried to substitute my CassandraJpaRepository for the SimpleJpaRepository, I ran into some trouble w/ Kundera. Specifically, the MetaModel implementation appeared to be incomplete. MetaModelImpl was returning null for all managedTypes(). SimpleJpa wasn't too happy with this. Instead of wrangling with Kundera, we punted. We can achieve enough of the value leveraging JPA directly. Perhaps more importantly, there is still an impedance mismatch between JPA and NoSQL. In our case, it would have been nice to get at Cassandra through Spring Data using JPA for a few cases in our app, but for the vast majority of the application, a straight up ORM layer whereby we know the tables, rows and column names ahead of time is insufficient. For those cases where we don't know the schema ahead of time, we're going to need to leverage the converters pattern in Spring Data. So, I started hacking on a proper Spring Data layer using Astyanax as the client. Follow along here: https://github.com/boneill42/spring-data-cassandra More to come on that....

July 31, 2012

by Brian O' Neill

· 30,277 Views

Use Lucene’s MMapDirectory on 64bit Platforms, Please!

Don’t be afraid – Some clarification to common misunderstandings Since version 3.1, Apache Lucene and Solr use MMapDirectory by default on 64bit Windows and Solaris systems; since version 3.3 also for 64bit Linux systems. This change lead to some confusion among Lucene and Solr users, because suddenly their systems started to behave differently than in previous versions. On the Lucene and Solr mailing lists a lot of posts arrived from users asking why their Java installation is suddenly consuming three times their physical memory or system administrators complaining about heavy resource usage. Also consultants were starting to tell people that they should not use MMapDirectory and change their solrconfig.xml to work instead with slow SimpleFSDirectory or NIOFSDirectory (which is much slower on Windows, caused by a JVM bug #6265734). From the point of view of the Lucene committers, who carefully decided that using MMapDirectory is the best for those platforms, this is rather annoying, because they know, that Lucene/Solr can work with much better performance than before. Common misinformation about the background of this change causes suboptimal installations of this great search engine everywhere. In this blog post, I will try to explain the basic operating system facts regarding virtual memory handling in the kernel and how this can be used to largely improve performance of Lucene (“VIRTUAL MEMORY for DUMMIES”). It will also clarify why the blog and mailing list posts done by various people are wrong and contradict the purpose of MMapDirectory. In the second part I will show you some configuration details and settings you should take care of to prevent errors like “mmap failed” and suboptimal performance because of stupid Java heap allocation. Virtual Memory[1] Let’s start with your operating system’s kernel: The naive approach to do I/O in software is the way, you have done this since the 1970s – the pattern is simple: whenever you have to work with data on disk, you execute a syscall to your operating system kernel, passing a pointer to some buffer (e.g. a byte[] array in Java) and transfer some bytes from/to disk. After that you parse the buffer contents and do your program logic. If you don’t want to do too many syscalls (because those may cost a lot processing power), you generally use large buffers in your software, so synchronizing the data in the buffer with your disk needs to be done less often. This is one reason, why some people suggest to load the whole Lucene index into Java heap memory (e.g., by using RAMDirectory). But all modern operating systems like Linux, Windows (NT+), MacOS X, or Solaris provide a much better approach to do this 1970s style of code by using their sophisticated file system caches and memory management features. A feature called “virtual memory” is a good alternative to handle very large and space intensive data structures like a Lucene index. Virtual memory is an integral part of a computer architecture; implementations require hardware support, typically in the form of a memory management unit (MMU) built into the CPU. The way how it works is very simple: Every process gets his own virtual address space where all libraries, heap and stack space is mapped into. This address space in most cases also start at offset zero, which simplifies loading the program code because no relocation of address pointers needs to be done. Every process sees a large unfragmented linear address space it can work on. It is called “virtual memory” because this address space has nothing to do with physical memory, it just looks like so to the process. Software can then access this large address space as if it were real memory without knowing that there are other processes also consuming memory and having their own virtual address space. The underlying operating system works together with the MMU (memory management unit) in the CPU to map those virtual addresses to real memory once they are accessed for the first time. This is done using so called page tables, which are backed by TLBs located in the MMU hardware (translation lookaside buffers, they cache frequently accessed pages). By this, the operating system is able to distribute all running processes’ memory requirements to the real available memory, completely transparent to the running programs. Schematic drawing of virtual memory (image from Wikipedia [1], http://en.wikipedia.org/wiki/File:Virtual_memory.svg, licensed by CC BY-SA 3.0) By using this virtualization, there is one more thing, the operating system can do: If there is not enough physical memory, it can decide to “swap out” pages no longer used by the processes, freeing physical memory for other processes or caching more important file system operations. Once a process tries to access a virtual address, which was paged out, it is reloaded to main memory and made available to the process. The process does not have to do anything, it is completely transparent. This is a good thing to applications because they don’t need to know anything about the amount of memory available; but also leads to problems for very memory intensive applications like Lucene. Lucene & Virtual Memory Let’s take the example of loading the whole index or large parts of it into “memory” (we already know, it is only virtual memory). If we allocate a RAMDirectory and load all index files into it, we are working against the operating system: The operating system tries to optimize disk accesses, so it caches already all disk I/O in physical memory. We copy all these cache contents into our own virtual address space, consuming horrible amounts of physical memory (and we must wait for the copy operation to take place!). As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory and where does it land? – On disk again (in the OS swap file)! In fact, we are fighting against our O/S kernel who pages out all stuff we loaded from disk [2]. So RAMDirectory is not a good idea to optimize index loading times! Additionally, RAMDirectory has also more problems related to garbage collection and concurrency. Because the data residing in swap space, Java’s garbage collector has a hard job to free the memory in its own heap management. This leads to high disk I/O, slow index access times, and minute-long latency in your searching code caused by the garbage collector driving crazy. On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again. Memory Mapping Files The solution to the above issues is MMapDirectory, which uses virtual memory and a kernel feature called “mmap” [3] to access the disk files. In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does! Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache. It is now just a native memory access, nothing more! We don’t have to take care of paging in/out of buffers, all this is managed by the O/S kernel. Furthermore, we have no concurrency issue, the only overhead over a standard byte[] array is some wrapping caused by Java’s ByteBuffer interface (it is still slower than a real byte[] array, but that is the only way to use mmap from Java and is much faster than all other directory implementations shipped with Lucene). We also waste no physical memory, as we operate directly on the O/S cache, avoiding all Java GC issues described before. What does this all mean to our Lucene/Solr application? We should not work against the operating system anymore, so allocate as less as possible heap space (-Xmx Java option). Remember, our index accesses rely on passed directly to O/S cache! This is also very friendly to the Java garbage collector. Free as much as possible physical memory to be available for the O/S kernel as file system cache. Remember, our Lucene code works directly on it, so reducing the number of paging/swapping between disk and memory. Allocating too much heap to our Lucene application hurts performance! Lucene does not require it with MMapDirectory. Why does this only work as expected on operating systems and Java virtual machines with 64bit? One limitation of 32bit platforms is the size of pointers, they can refer to any address within 0 and 232-1, which is 4 Gigabytes. Most operating systems limit that address space to 3 Gigabytes because the remaining address space is reserved for use by device hardware and similar things. This means the overall linear address space provided to any process is limited to 3 Gigabytes, so you cannot map any file larger than that into this “small” address space to be available as big byte[] array. And when you mapped that one large file, there is no virtual space (address like “house number”) available anymore. As physical memory sizes in current systems already have gone beyond that size, there is no address space available to make use for mapping files without wasting resources (in our case “address space”, not physical memory!). On 64bit platforms this is different: 264-1 is a very large number, a number in excess of 18 quintillion bytes, so there is no real limit in address space. Unfortunately, most hardware (the MMU, CPU’s bus system) and operating systems are limiting this address space to 47 bits for user mode applications (Windows: 43 bits) [4]. But there is still much of addressing space available to map terabytes of data. Common misunderstandings If you have read carefully what I have told you about virtual memory, you can easily verify that the following is true: MMapDirectory does not consume additional memory and the size of mapped index files is not limited by the physical memory available on your server. By mmap() files, we only reserve address space not memory! Remember, address space on 64bit platforms is for free! MMapDirectory will not load the whole index into physical memory. Why should it do this? We just ask the operating system to map the file into address space for easy access, by no means we are requesting more. Java and the O/S optionally provide the option to try loading the whole file into RAM (if enough is available), but Lucene does not use that option (we may add this possibility in a later version). MMapDirectory does not overload the server when “top” reports horrible amounts of memory. “top” (on Linux) has three columns related to memory: “VIRT”, “RES”, and “SHR”. The first one (VIRT, virtual) is reporting allocated virtual address space (and that one is for free on 64 bit platforms!). This number can be multiple times of your index size or physical memory when merges are running in IndexWriter. If you have only one IndexReader open it should be approximately equal to allocated heap space (-Xmx) plus index size. It does not show physical memory used by the process. The second column (RES, resident) memory shows how much (physical) memory the process allocated for operating and should be in the size of your Java heap space. The last column (SHR, shared) shows how much of the allocated virtual address space is shared with other processes. If you have several Java applications using MMapDirectory to access the same index, you will see this number going up. Generally, you will see the space needed by shared system libraries, JAR files, and the process executable itself (which are also mmapped). How to configure my operating system and Java VM to make optimal use of MMapDirectory? First of all, default settings in Linux distributions and Solaris/Windows are perfectly fine. But there are some paranoid system administrators around, that want to control everything (with lack of understanding). Those limit the maximum amount of virtual address space that can be allocated by applications. So please check that “ulimit -v” and “ulimit -m” both report “unlimited”, otherwise it may happen that MMapDirectory reports “mmap failed” while opening your index. If this error still happens on systems with lot’s of very large indexes, each of those with many segments, you may need to tune your kernel parameters in /etc/sysctl.conf: The default value of vm.max_map_count is 65530, you may need to raise it. I think, for Windows and Solaris systems there are similar settings available, but it is up to the reader to find out how to use them. For configuring your Java VM, you should rethink your memory requirements: Give only the really needed amount of heap space and leave as much as possible to the O/S. As a rule of thumb: Don’t use more than ¼ of your physical memory as heap space for Java running Lucene/Solr, keep the remaining memory free for the operating system cache. If you have more applications running on your server, adjust accordingly. As usual the more physical memory the better, but you don’t need as much physical memory as your index size. The kernel does a good job in paging in frequently used pages from your index. A good possibility to check that you have configured your system optimally is by looking at both "top" (and correctly interpreting it, see above) and the similar command "iotop" (can be installed, e.g., on Ubuntu Linux by "apt-get install iotop"). If your system does lots of swap in/swap out for the Lucene process, reduce heap size, you possibly used too much. If you see lot's of disk I/O, buy more RUM (Simon Willnauer) so mmapped files don't need to be paged in/out all the time, and finally: buy SSDs. Happy mmapping! Bibliography [1] http://en.wikipedia.org/wiki/Virtual_memory [2] https://www.varnish-cache.org/trac/wiki/ArchitectNotes [3] http://en.wikipedia.org/wiki/Memory-mapped_file [4] http://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details

July 31, 2012

by Uwe Schindler

· 13,947 Views · 1 Like

Managing Camel Routes With JMX APIs

Here is a quick example of how to programmatically access Camel MBeans to monitor and manipulate routes... first, get a connection to a JMX server (assumes localhost, port 1099, no auth) note, always cache the connection for subsequent requests (can cause memory utilization issues otherwise) JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://localhost:1099/jmxrmi"); JMXConnector jmxc = JMXConnectorFactory.connect(url); MBeanServerConnection server = jmxc.getMBeanServerConnection(); use the following to iterate over all routes and retrieve statistics (state, exchanges, etc)... ObjectName objName = new ObjectName("org.apache.camel:type=routes,*"); List cacheList = new LinkedList(server.queryNames(objName, null)); for (Iterator iter = cacheList.iterator(); iter.hasNext();) { objName = iter.next(); String keyProps = objName.getCanonicalKeyPropertyListString(); ObjectName objectInfoName = new ObjectName("org.apache.camel:" + keyProps); String routeId = (String) server.getAttribute(objectInfoName, "RouteId"); String description = (String) server.getAttribute(objectInfoName, "Description"); String state = (String) server.getAttribute(objectInfoName, "State"); ... } use the following to execute operations against a Camel route (stop,start, etc) ObjectName objName = new ObjectName("org.apache.camel:type=routes,*"); List cacheList = new LinkedList(server.queryNames(objName, null)); for (Iterator iter = cacheList.iterator(); iter.hasNext();) { objName = iter.next(); String keyProps = objName.getCanonicalKeyPropertyListString(); if(keyProps.contains(routeID)) { ObjectName objectRouteName = new ObjectName("org.apache.camel:" + keyProps); Object[] params = {}; String[] sig = {}; server.invoke(objectRouteName, operationName, params, sig); return; } } summary These APIs can easily be used to build a web or command line based tool to support remote Camel management features. All of these features are available via the JMX console and Camel does provide a web console to support some management/monitoring tasks. See these pages for more information... http://camel.apache.org/camel-jmx.html http://camel.apache.org/web-console.html

July 30, 2012

by Ben O'Day

· 12,045 Views

Understanding Vector Clocks with Riak

Riak is one databases that uses vector clocks for conflict resolution. I came across these two blog posts on Basho.com, company which develops Riak, and these posts are great at explaining the basics of Vector Clocks - definitely a must read if you're into distributed systems: Why vector clocks are easy? Why vector clocks are hard? Voldermort DB (by LinkedIn) is another DB that uses Vector Clocks, as explained below. Not surprisingly, it also takes the idea from Amazon's Dynamo (like Riak): The redundancy of storage makes the system more resilient to server failure. Since each value is stored N times, you can tolerate as many as N – 1 machine failures without data loss. This causes other problems, though. Since each value is stored in multiple places it is possible that one of these servers will not get updated (say because it is crashed when the update occurs). To help solve this problem Voldemort uses a data versioning mechanism called Vector Clocks that are common in distributed programming. This is an idea we took from Amazon’s Dynamo system. This data versioning allows the servers to detect stale data when it is read and repair it. Voldermort's code in Java can be on code.google.com. Finally, before I end this post, you may be asking "why complicate so much?" (if I could get a penny every time I heard that when discussing distributed systems... :-). But in this case, it's a good and typical question: can't we just use timestamp and last one wins? The problem, though, is that it requires times to be perfectly synchronized - which is very difficult and oftentimes impossible. By using vector clocks, you don't have this requirement on the system.

July 25, 2012

by Rodrigo De Castro

· 9,192 Views

Generational Caching and Envers

Konrad recently shared on our company’s technical room an interesting article on how caching is done is a big polish social network, nk.pl. One of the central concepts in the algorithm is generational caching (see here or here). The basic idea is that for cache keys you use some entity-specific string + version number. The version number increases whenever data changes, thus invalidating any old cache entries, and preventing stale data reads. This makes the assumption that the cache has some garbage collection, e.g. it may simply be a LRU cache. Of course on each request we must know the version number – that’s why it must be stored in a global cache (but depending on our consistency requirements, it also may be distributed across the cluster asynchronously). However the data itself can be stored in local caches. So if our system is read-most, the only “expensive” operation that we will have to do per request is retrieve the version numbers for the entities we are interested in. And this is usually very simple information, which can be kept entirely in-memory. Depending on the type of data and the usage patterns, you can cache individual entities (e.g. for a Person entity, the cache key could be person-9128-123, 9128 being the id, 123 the version number), or the whole lot (e.g. for a Countries entity, the cache key could be countries-8, 8 being the version number). Moreover in the global cache you can keep the latest version number per-id or per-entity; meaning that when the version changes, you invalidate a specific entity or all of them. Having written most of Envers, it quite naturally occurred to me that you may use the entity revision numbers as the cache versions. Subsequent Envers revisions are monotonically increasing numbers, for each transaction you get the next one. So whenever a cached entity changes, you would have to populate the global cache with the latest revision number. Envers provides several ways to get the revision numbers. During the transaction, you can call AuditReader.getCurrentRevision() method, which will give you the revision metadata, including the revision number. If you want more fine-grained control, you may implement your own listener (EntityTrackingRevisionListener), see the docs), and get notified whenever an entity is changed, and update the global cache in there. You can also register an after-transaction-completed callback, and update the cache outside of the transaction boundaries. Or, if you know the entity ids, you may lookup the maximum revision number using either AuditReader.getRevisions or an AuditQueryCreator. As you can obtain the current revision number during a transaction, you may even update the version/revision in the global cache atomically, if you use a transactional cache such as Infinispan. All of that of course in addition to auditing, which is still the main purpose of Envers :)

July 24, 2012

by Adam Warski

· 4,891 Views