The Latest and Popular SDLC Topics

The Latest Popular Topics

Use Lucene’s MMapDirectory on 64bit Platforms, Please!

Don’t be afraid – Some clarification to common misunderstandings Since version 3.1, Apache Lucene and Solr use MMapDirectory by default on 64bit Windows and Solaris systems; since version 3.3 also for 64bit Linux systems. This change lead to some confusion among Lucene and Solr users, because suddenly their systems started to behave differently than in previous versions. On the Lucene and Solr mailing lists a lot of posts arrived from users asking why their Java installation is suddenly consuming three times their physical memory or system administrators complaining about heavy resource usage. Also consultants were starting to tell people that they should not use MMapDirectory and change their solrconfig.xml to work instead with slow SimpleFSDirectory or NIOFSDirectory (which is much slower on Windows, caused by a JVM bug #6265734). From the point of view of the Lucene committers, who carefully decided that using MMapDirectory is the best for those platforms, this is rather annoying, because they know, that Lucene/Solr can work with much better performance than before. Common misinformation about the background of this change causes suboptimal installations of this great search engine everywhere. In this blog post, I will try to explain the basic operating system facts regarding virtual memory handling in the kernel and how this can be used to largely improve performance of Lucene (“VIRTUAL MEMORY for DUMMIES”). It will also clarify why the blog and mailing list posts done by various people are wrong and contradict the purpose of MMapDirectory. In the second part I will show you some configuration details and settings you should take care of to prevent errors like “mmap failed” and suboptimal performance because of stupid Java heap allocation. Virtual Memory[1] Let’s start with your operating system’s kernel: The naive approach to do I/O in software is the way, you have done this since the 1970s – the pattern is simple: whenever you have to work with data on disk, you execute a syscall to your operating system kernel, passing a pointer to some buffer (e.g. a byte[] array in Java) and transfer some bytes from/to disk. After that you parse the buffer contents and do your program logic. If you don’t want to do too many syscalls (because those may cost a lot processing power), you generally use large buffers in your software, so synchronizing the data in the buffer with your disk needs to be done less often. This is one reason, why some people suggest to load the whole Lucene index into Java heap memory (e.g., by using RAMDirectory). But all modern operating systems like Linux, Windows (NT+), MacOS X, or Solaris provide a much better approach to do this 1970s style of code by using their sophisticated file system caches and memory management features. A feature called “virtual memory” is a good alternative to handle very large and space intensive data structures like a Lucene index. Virtual memory is an integral part of a computer architecture; implementations require hardware support, typically in the form of a memory management unit (MMU) built into the CPU. The way how it works is very simple: Every process gets his own virtual address space where all libraries, heap and stack space is mapped into. This address space in most cases also start at offset zero, which simplifies loading the program code because no relocation of address pointers needs to be done. Every process sees a large unfragmented linear address space it can work on. It is called “virtual memory” because this address space has nothing to do with physical memory, it just looks like so to the process. Software can then access this large address space as if it were real memory without knowing that there are other processes also consuming memory and having their own virtual address space. The underlying operating system works together with the MMU (memory management unit) in the CPU to map those virtual addresses to real memory once they are accessed for the first time. This is done using so called page tables, which are backed by TLBs located in the MMU hardware (translation lookaside buffers, they cache frequently accessed pages). By this, the operating system is able to distribute all running processes’ memory requirements to the real available memory, completely transparent to the running programs. Schematic drawing of virtual memory (image from Wikipedia [1], http://en.wikipedia.org/wiki/File:Virtual_memory.svg, licensed by CC BY-SA 3.0) By using this virtualization, there is one more thing, the operating system can do: If there is not enough physical memory, it can decide to “swap out” pages no longer used by the processes, freeing physical memory for other processes or caching more important file system operations. Once a process tries to access a virtual address, which was paged out, it is reloaded to main memory and made available to the process. The process does not have to do anything, it is completely transparent. This is a good thing to applications because they don’t need to know anything about the amount of memory available; but also leads to problems for very memory intensive applications like Lucene. Lucene & Virtual Memory Let’s take the example of loading the whole index or large parts of it into “memory” (we already know, it is only virtual memory). If we allocate a RAMDirectory and load all index files into it, we are working against the operating system: The operating system tries to optimize disk accesses, so it caches already all disk I/O in physical memory. We copy all these cache contents into our own virtual address space, consuming horrible amounts of physical memory (and we must wait for the copy operation to take place!). As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory and where does it land? – On disk again (in the OS swap file)! In fact, we are fighting against our O/S kernel who pages out all stuff we loaded from disk [2]. So RAMDirectory is not a good idea to optimize index loading times! Additionally, RAMDirectory has also more problems related to garbage collection and concurrency. Because the data residing in swap space, Java’s garbage collector has a hard job to free the memory in its own heap management. This leads to high disk I/O, slow index access times, and minute-long latency in your searching code caused by the garbage collector driving crazy. On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again. Memory Mapping Files The solution to the above issues is MMapDirectory, which uses virtual memory and a kernel feature called “mmap” [3] to access the disk files. In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does! Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache. It is now just a native memory access, nothing more! We don’t have to take care of paging in/out of buffers, all this is managed by the O/S kernel. Furthermore, we have no concurrency issue, the only overhead over a standard byte[] array is some wrapping caused by Java’s ByteBuffer interface (it is still slower than a real byte[] array, but that is the only way to use mmap from Java and is much faster than all other directory implementations shipped with Lucene). We also waste no physical memory, as we operate directly on the O/S cache, avoiding all Java GC issues described before. What does this all mean to our Lucene/Solr application? We should not work against the operating system anymore, so allocate as less as possible heap space (-Xmx Java option). Remember, our index accesses rely on passed directly to O/S cache! This is also very friendly to the Java garbage collector. Free as much as possible physical memory to be available for the O/S kernel as file system cache. Remember, our Lucene code works directly on it, so reducing the number of paging/swapping between disk and memory. Allocating too much heap to our Lucene application hurts performance! Lucene does not require it with MMapDirectory. Why does this only work as expected on operating systems and Java virtual machines with 64bit? One limitation of 32bit platforms is the size of pointers, they can refer to any address within 0 and 232-1, which is 4 Gigabytes. Most operating systems limit that address space to 3 Gigabytes because the remaining address space is reserved for use by device hardware and similar things. This means the overall linear address space provided to any process is limited to 3 Gigabytes, so you cannot map any file larger than that into this “small” address space to be available as big byte[] array. And when you mapped that one large file, there is no virtual space (address like “house number”) available anymore. As physical memory sizes in current systems already have gone beyond that size, there is no address space available to make use for mapping files without wasting resources (in our case “address space”, not physical memory!). On 64bit platforms this is different: 264-1 is a very large number, a number in excess of 18 quintillion bytes, so there is no real limit in address space. Unfortunately, most hardware (the MMU, CPU’s bus system) and operating systems are limiting this address space to 47 bits for user mode applications (Windows: 43 bits) [4]. But there is still much of addressing space available to map terabytes of data. Common misunderstandings If you have read carefully what I have told you about virtual memory, you can easily verify that the following is true: MMapDirectory does not consume additional memory and the size of mapped index files is not limited by the physical memory available on your server. By mmap() files, we only reserve address space not memory! Remember, address space on 64bit platforms is for free! MMapDirectory will not load the whole index into physical memory. Why should it do this? We just ask the operating system to map the file into address space for easy access, by no means we are requesting more. Java and the O/S optionally provide the option to try loading the whole file into RAM (if enough is available), but Lucene does not use that option (we may add this possibility in a later version). MMapDirectory does not overload the server when “top” reports horrible amounts of memory. “top” (on Linux) has three columns related to memory: “VIRT”, “RES”, and “SHR”. The first one (VIRT, virtual) is reporting allocated virtual address space (and that one is for free on 64 bit platforms!). This number can be multiple times of your index size or physical memory when merges are running in IndexWriter. If you have only one IndexReader open it should be approximately equal to allocated heap space (-Xmx) plus index size. It does not show physical memory used by the process. The second column (RES, resident) memory shows how much (physical) memory the process allocated for operating and should be in the size of your Java heap space. The last column (SHR, shared) shows how much of the allocated virtual address space is shared with other processes. If you have several Java applications using MMapDirectory to access the same index, you will see this number going up. Generally, you will see the space needed by shared system libraries, JAR files, and the process executable itself (which are also mmapped). How to configure my operating system and Java VM to make optimal use of MMapDirectory? First of all, default settings in Linux distributions and Solaris/Windows are perfectly fine. But there are some paranoid system administrators around, that want to control everything (with lack of understanding). Those limit the maximum amount of virtual address space that can be allocated by applications. So please check that “ulimit -v” and “ulimit -m” both report “unlimited”, otherwise it may happen that MMapDirectory reports “mmap failed” while opening your index. If this error still happens on systems with lot’s of very large indexes, each of those with many segments, you may need to tune your kernel parameters in /etc/sysctl.conf: The default value of vm.max_map_count is 65530, you may need to raise it. I think, for Windows and Solaris systems there are similar settings available, but it is up to the reader to find out how to use them. For configuring your Java VM, you should rethink your memory requirements: Give only the really needed amount of heap space and leave as much as possible to the O/S. As a rule of thumb: Don’t use more than ¼ of your physical memory as heap space for Java running Lucene/Solr, keep the remaining memory free for the operating system cache. If you have more applications running on your server, adjust accordingly. As usual the more physical memory the better, but you don’t need as much physical memory as your index size. The kernel does a good job in paging in frequently used pages from your index. A good possibility to check that you have configured your system optimally is by looking at both "top" (and correctly interpreting it, see above) and the similar command "iotop" (can be installed, e.g., on Ubuntu Linux by "apt-get install iotop"). If your system does lots of swap in/swap out for the Lucene process, reduce heap size, you possibly used too much. If you see lot's of disk I/O, buy more RUM (Simon Willnauer) so mmapped files don't need to be paged in/out all the time, and finally: buy SSDs. Happy mmapping! Bibliography [1] http://en.wikipedia.org/wiki/Virtual_memory [2] https://www.varnish-cache.org/trac/wiki/ArchitectNotes [3] http://en.wikipedia.org/wiki/Memory-mapped_file [4] http://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details

July 31, 2012

by Uwe Schindler

· 13,947 Views · 1 Like

Implementing a Command Line With Eval in JavaScript

This blog post explores JavaScript’s eval function by implementing the foundation for an interactive command line. As a bonus, you’ll get to work with ECMAScript.next’s generators (which can already be tried out on current Firefox versions). Writing an evaluator Let’s say you want to implement an interactive command line for JavaScript (such as [1]). On one hand, you would need to get the graphical user interface right: The user inputs JavaScript code, the command line evaluates the code and displays the result. On the other hand, you would have to implement the evaluation. That’s what we will take on here. It is more complex that it initially seems and teaches us a lot about eval. For starters, let’s write a constructor Evaluator: function Evaluator() { } Evaluator.prototype.evaluate = function (str) { return JSON.stringify(eval(str)); }; To use the evaluator, we create an instance and send JavaScript code to it: > var e = new Evaluator(); > e.evaluate("Math.pow(2, 53)") '9007199254740992' > e.evaluate("3 * 7") '21' > e.evaluate("'foo'+'bar'") '"foobar"' JSON.stringify is used so that the evaluation results can be shown to the user and look like the input. Without stringify, things look as follows: > console.log(123) // OK 123 > console.log("abc") // not OK abc With stringify, everything looks OK: > console.log(JSON.stringify(123)) 123 > console.log(JSON.stringify("abc")) "abc" Note that undefined is not valid JSON, but stringify converts it to undefined (the value, not the string), which is fine for our purposes. What we have implemented so far works for basic things, but still has several problems. Let’s tackle them one at a time. Problem: declarations You can evaluate variable and function declarations, but they are forgotten immediately afterwards: > e.evaluate("var x = 12;") undefined > e.evaluate("x") ReferenceError: x is not defined How do we fix this? The following code is a solution: function Evaluator() { this.env = {}; } Evaluator.prototype.evaluate = function (str) { str = rewriteDeclarations(str); var __environment__ = this.env; // (1) with (__environment__) { // (2) return JSON.stringify(eval(str)); } }; function rewriteDeclarations(str) { // Prefix a newline so that search and replace is simpler str = "\n" + str; str = str.replace(/\nvar\s+(\w+)\s*=/g, "\n__environment__.$1 ="); // (3) str = str.replace(/\nfunction\s+(\w+)/g, "\n__environment__.$1 = function"); return str.slice(1); // remove prefixed newline } this.env holds all variable declarations and function declarations in its properties. We make it accessible to the input in two steps. Step 1 – declare: We assign this.env to __environment__ (1) and rewrite the input so that, among other things, each var declaration assigns to __environment__ (3). That demonstrates one important aspect of eval: it sees all variables in surrounding scopes. That is, if you invoke eval inside your function, you expose all of its internals. The only way to keep those internals secret is to put the eval call in a separate function and call that function. Step 2 – access: Use a with statement so that the properties of __environment__ appear as variables to the eval-ed code. This is not an ideal solution, more of a compromise: with should be avoided [2] and can’t be used in the advantageous strict mode [3]. But it is a quick solution for us now. A work-around is quite complex [4]. > var e = new Evaluator(); > e.evaluate("var x = 123;") '123' > e.evaluate("x") '123' Minor drawback: Normal var declarations have the result undefined; due to our rewriting we now get the value that is assigned to the variable. Problem: exceptions Right now, throwing an exception in evaluate’s input means that the method will throw: > e.evaluate("* 3") SyntaxError: Unexpected token * That is obviously unacceptable: In a graphical user interface, we want to report errors back to the user, not (invisibly) throw an exception. Here is one simple way of doing so: Evaluator.prototype.evaluate = function (str) { try { str = rewriteDeclarations(str); var __environment__ = this.env; with (__environment__) { return JSON.stringify(eval(str)); } } catch (e) { return e.toString(); } }; There is nothing surprising in this code, we simply use try-catch and report back what happened. More sophisticated solutions will want to do more, e.g. display the exception’s stack trace. The new evaluator in action: > var e = new Evaluator(); > e.evaluate("* 3") 'SyntaxError: Unexpected token *' Problem: console.log How do we handle calls to console.log in the input? Logged messages should be shown to the user, not be sent to the browser’s console. The solution is surprisingly easy: function Evaluator(cons) { this.env = {}; this.cons = cons; } Evaluator.prototype.evaluate = function (str) { try { str = rewriteDeclarations(str); var __environment__ = this.env; var console = this.cons; with (__environment__) { return JSON.stringify(eval(str)); } } catch (e) { return e.toString(); } }; The constructor now receives a custom implementation of console and assigns it to this.cons. By assigning that object to a local variable named console (1), we temporarily shadow the global console for eval, there is no need to replace it. Beware that that shadowing affects all of the function, you won’t be able to use the browser’s console anywhere in evaluate. The new evaluator in action: > var cons = { log: function (m) { console.log("### "+m) } }; > var e = new Evaluator(cons); > e.evaluate("console.log('hello')") ### hello undefined Problem: eval creates bindings inside the function One scary feature of eval is that it creates variable bindings inside the function that invokes it: > (function () { eval("var x=3"); return x }()) 3 Fortunately, the fix is easy: use strict mode. > (function () { "use strict"; eval("var x=3"); return x }()) ReferenceError: x is not defined You can’t use with in strict mode, so you’ll have to replace it with a work-around [4]. Keeping declarations in an environment An environment is where JavaScript keeps the parameters and variables of a function. It maps variable names to values and is thus similar to an object. We might be able to avoid rewriting the input and manage declarations via environments. The idea is as follows. eval puts declarations in some environment: Non-strict mode: the environment of the surrounding function. Strict mode: a newly created environment. What if we could reuse that environment for the next invocation of eval, instead of throwing it away? Then eval would properly remember prior declarations. Strict mode gives us no way to access the temporary environment it creates for each invocation. However, in non-strict mode, we might be able to keep the environment of the surrounding function around. The following subsections explore two ways of doing so. Declarations via nested scopes If you create a function g inside another function f, then g permanently retains a reference to f’s current environment envf. Whenever g is called, a new g-specific environment envg is created. But envg points to its parent environment envf. Variables that can’t be found in g’s scope (as managed via envg), are looked up in f’s scope (via envf). Thus, envf is not lost, as long as g exists. That gives us a strategy for keeping the environment of the function that calls eval around. In the following code that function is called evalHelper and creates a new function that has to be used for the next call of eval. Hence, declarations made in the former function are accessible in the later function. function Evaluator() { var that = this; that.evalHelper = function (str) { that.evalHelper = function (str) { return eval(str); }; return eval(str); }; } Evaluator.prototype.evaluate = function (str) { return this.evalHelper(str); }; The fatal problem of this implementation is that you cannot nest to arbitrary depth. But, for the above depth of 2, it works perfectly: > var e = new Evaluator(); > e.evaluate("var x = 7;"); undefined > e.evaluate("x * 3") 21 Declarations via a generator It would be great if we could “restart” the function that calls eval, re-enter it with its previous environment still in place. ECMAScript.next’s generators [5] let you do that. Current versions of Firefox already support generators. Here is a demonstration of how they work in these versions (in ECMAScript.next, you will have to write function*, but apart from that, the code is the same): function mygen() { console.log((yield 0) + " @ 0"); console.log((yield 1) + " @ 1"); console.log((yield 2) + " @ 2"); } The above is a generator function. Invoke it and it will create a generator object. On that object, you first need to invoke the next() method to start execution. A yield x inside the code pauses execution and returns x to the the previously called generator object method. After the first next(), you can either call next() or send(y). The latter means that the currently paused yield will continue and produce the value y. The former is equivalent to send(undefined). The following interaction shows mygen in use: > var g = mygen(); > g.next() // can’t use send() the first time 0 > g.send("a") // continue after yield 0, pause again a @ 0 1 > g.send("b") b @ 1 2 The following is an implementation of Evaluator that calls eval via the generator evalGenerator. Because of that, eval always sees the same environment and remembers declarations. function evalGenerator(console) { var str = yield; while(true) { try { var result = JSON.stringify(eval(str)); str = yield result; } catch (e) { str = yield e.toString(); } } } function Evaluator(cons) { this.evalGen = evalGenerator(cons); this.evalGen.next(); // start } Evaluator.prototype.evaluate = function (str) { return this.evalGen.send(str); }; The new evaluator works as expected. > var e = new Evaluator(); > e.evaluate("var x = 7;") undefined > e.evaluate("x * 2") "14" > e.evaluate("* syntax_error") "SyntaxError: missing ; before statement" The biggest problem with this solution is that it uses the deprecated features non-strict eval together with the new feature generators. There will probably be a way in ECMAScript.next to make this combination work, but it will be a hack and should thus be avoided. Conclusion We have used eval to implement a helper type for a command line. While doing so, we learned a few interesting things about eval: Letting it remember declarations between invocations is complicated; it can access all variables in the scopes surrounding its invocation; and in non-strict mode, it can even create new variables inside the invoking function. The best solution for remembering declarations would be for eval to have an optional parameter for an environment (to be reused), but that is not in the cards. Therefore, the only truly safe solution in pure JavaScript is to use a full-featured JavaScript parser such as esprima to rewrite critical parts of the input code. That is left as an exercise to the reader. References Combining code editing with a command line JavaScript’s with statement and why it’s deprecated JavaScript’s strict mode: a summary Handing variables to eval Asynchronous programming and continuation-passing style in JavaScript

July 27, 2012

by Axel Rauschmayer

· 5,189 Views

What is ActiveMQ?

Although the Active MQ website already gives a pithy, to-the-point explanation of ActiveMQ, I would like to add some more context to their definition. From the ActiveMQ project’s website: “ActiveMQ is an open sourced implementation of JMS 1.1 as part of the J2EE 1.4 specification.” Here’s my take: ActiveMQ is an open-source, messaging software which can serve as the backbone for an architecture of distributed applications built upon messaging. The creators of ActiveMQ were driven to create this open-source project for two main reasons: The available existing solutions at the time were proprietary/very expensive Developers with the Apache Software Foundation were working on a fully J2EE compliant application server (Geronimo) and they needed a JMS solution that had a license compatible with Apache’s licensing. Since its inception, ActiveMQ has turned into a strong competitor of the commercial alternatives, such as WebSphereMQ, EMS/TIBCO and SonicMQ and is deployed in production in some of the top companies in industries ranging from financial services to retail. Using messaging as an integration or communication style leads to many benefits such as: Allowing applications built with different languages and on different operating systems to integrate with each other Location transparency – client applications don’t need to know where the service applications are located Reliable communication – the producers/consumers of messages don’t have to be available at the same time, or certain segments along the route of the message can go down and come back up without impacting the message getting to the service/consumer Scaling – can scale horizontally by adding more services that can handle the messages if too many messages are arriving Asynchronous communication – a client can fire a message and continue other processing instead of blocking until the service has sent a response; it can handle the response message only when the message is ready Reduced coupling – the assumptions made by the clients and services are greatly reduced as a result of the previous 5 benefits. A service can change details about itself, including its location, protocol, and availability, without affecting or disrupting the client. Please see Gregor Hohpe’s description about messaging or the book he and Bobby Woolf wrote about messaging-based enterprise application integration. There are other advantages as well (hopefully someone can add other benefits or drawbacks in the comments), and ActiveMQ is a free, open-source software that can facilitate delivering those advantages and has proven to be highly reliable and scalable in production environments.

July 21, 2012

by Christian Posta

· 28,635 Views · 8 Likes

How Many Java developers are There in the World?

Oracle says it’s 9,000,000. Wikipedia claims it’s 10,000,000. And the guys from NumberOf.net seem to be the most precise - they know that there are exactly 9,007,346 Java developers out there. Nice numbers. I have used those articles as reference points while speaking about the potential market size for our memory leak detection tool. But something in these numbers has bothered me for years - there is no trustworthy and public analysis behind those numbers. Its just conjured up from thin air. So I finally thought I would do something about it and try to figure it out for good. It proved out to be a challenging task. After all - with more than seven billion people on our planet I couldn't call everyone and ask them. Well, maybe I could, but if every call would take on average 20 seconds I would need at least 4,439 years to complete the survey. If I would not sleep nor eat nor rest. So I had to use other ways for estimation. After playing around with different sources of information, I decided to dig into four of them for a closer look: Labour statistics provided by different governments Language popularity sites such as Tiobe and Langpop Employment portals using Indeed.com and Monster.com Download numbers on popular Java tools and libraries - namely Eclipse and Tomcat. Using that information I wanted to estimate the number using three different calculations - based on language popularity indexes, labour statistics and download figures. So, here we go. How many programmers could there be in total? World population is currently above seven billion. Out of those seven billion we can leave out sub-Saharan Africa (900M) and rural Asia (about 50% of its 2.2B population) as negligible. This leaves us with approximately 5 billion people living in regions where overall economical and cultural background can be considered suitable for software industries to spawn. Now, out of those 5,000,000,000 how many could be actually developing software? A good answer at StackExchange gives us some pointers as to where we can find information on the percentage of software developers in different countries. Using the US, Japan, Canada, the EU27 and the UK as a baseline we can estimate that 0.82% of the population is employed as a software developer or programmer: Country Population Developers % Canada 33,476,688 387,000 1.16% EU27 502,486,499 5,900,000 1.17% Japan 127,799,000 1,016,929 0.80% UK 63,162,000 333,000 0.53% US 313,931,000 1,336,300 0.43% Weighted average: 0.86% 0.86% out of five billion is 43,000,000. Lets remember this number, as it will be used as a baseline in following calculations. Popularity contests In the popularity contest we will use two channels for the source of data - the TIOBE index and the Langpop one. Other sources such as Dataist figures were hard to interpret, so we’ll stick just to those two. For the background - the TIOBE ratings are calculated by counting hits of the most popular search engines. The search query that is used is +" programming", e.g. +“Java programming” in our case. Langpop uses more sources for input besides search engine queries - in equal weights it traces open job positions, book titles, search engine results, the number of open source projects and other data to calculate its popularity score. Simplifying TIOBE and Langpop results, we can conclude that according to TIOBE 17% and according to Langpop ~15% of the programmers in the world are using Java. Averaging those numbers we can say that around 16% out of the 43,000,000 developers in the world use Java. This translates to 6,880,000 Java developers out there. Job portals Job portals, especially when considering both available positions and uploaded resumes, are definitely a good source of information. The larger ones also provide nice reports on labour market, which we will dig into next. Note that we used Indeed.com and Monster.com - if you can point us towards more and/or better sources of information, we would be glad to correct our calculations. But using this analysis from Monster.com and the aggregated statistics from Indeed.com we can say that ~18% of Monster.com applicants can program in Java and ~16% of open engineering / programming positions scanned by Indeed.com are looking for Java talent. Averaging those numbers we arrive at 17%. Which out of 41,000,000 programmers in total would translate to 7,310,000 Java guys and girls in the world. Software downloads Every Java developer uses something to build the application. Well, we expect them to use at least a JVM and a compiler. If you happen to know anyone who can get away without those two, please let us know. We would hire him immediately. But most of us tend to use more than just a compiler and a virtual machine. We use IDEs, application servers, build tools, etc. So we figured that we would look into the publicly available download numbers of these tools and try to estimate the number of developers from the download numbers. When calculating the total number of developers from estimated number of users, we take into account the market share of the corresponding software. To estimate the market share we use Zeroturnaround’s statistics gathered in the spring of 2012. Eclipse downloads. Eclipse Juno was released on June 27 and has been downloaded 1,200,000 times during the first 20 days. Looking into the historical data published by eclipse.org we can predict that Juno will be downloaded approximately 8,000,000 times in total. Last four major Eclipse releases have all been released using a yearly release calendar and all the releases took place in June: Juno - 8,000,000 (in a year, expecting the trend to continue. Currently has 1,200,000 downloads in first 20 days). Indigo - 6,000,000 downloads Helios - 4,100,000 downloads Galileo - 2,200,000 downloads Averaging Juno estimates and Indigo results, we can say that Eclipse is downloaded approximately 7,000,000 times a year. Using the Zeroturnaround’s statistics, we expect 68% of Java developers to use Eclipse as a (primary) IDE. If we now make a bold claim that each Java developer on Eclipse will download the IDE exactly once a year, expect the number of downloads per year to be 7,000,000 and consider that 32% of Java developers do not use Eclipse at all, we come to a conclusion that there should be 10,300,00 Java developers in total. Apache Tomcat downloads. Vadim Gritsenko has put together some nice statistics on top of Apache logs. From there we can see that during the last year Tomcat has been downloaded approximately 550,000 times/month. This gives us a yearly total of 6,600,000 Tomcat downloads. Applying now statistics from the same report used for calculating Eclipse’s market share we can estimate that 59% of Java developers are using Tomcat as one of their development platform. If we now again make a bold claim that each Java developer on Tomcat will download every major release exactly once and consider that 41% of Java developers do not use Tomcat, we reach to conclusion that there should be 11,186,000 Java developers out there. Averaging the numbers from Eclipse and Tomcat downloads, we end up with 10,743,000 Java developers. Conclusions We used three different sources for estimation - popularity contests, job market analysis and download numbers of popular Java development infrastructure products. The numbers varied quite a bit - from 6,880,000 to 10,743,000. Aggressively averaging the three numbers we can conclude that there are 8,311,000 Java developers out there. Not quite as much as Oracle or Wikipedia think, but still enough to build a business that provides developing tools for the Java community. Lies. Damn lies. And statistics.

July 20, 2012

by Nikita Salnikov-Tarnovski

· 24,419 Views

Replacing Query String Elements in C# .NET and JavaScript

While writing list navigation and search features in websites today there is a constant need to find/replace and play with query string elements, so that you can easily manipulate these mystical items while you’re carrying them around in your website’s URLs. I have a few little methods I’ve used over the years and carry with me project to project, and this post is putting them on the record for easy access later. I have a secret. This post is actually more aimed at an audience of 'myself', and my ability to have an easy bit of source code to call upon when I’m on the go looking for a quick solution to cut and paste – as most of my blog posts are. But you, dear reader, you get to share in this benefit with me by pulling from the awesomeness within this post as well. Solution: .Net c# When doing this with c# you have a few pretty cool features up your sleeve. One of these is HttpUtility.ParseQueryString(urlPath) framework method. This static method allows you to extract a NameValueCollection that is editable from a given query string. Why is this cool? Because it allows you to very easily play with the query string collection like it is any other NameValueCollection – with Add() and Remove() methods. This makes it incredibly powerful. Quick & Dirty code beware! The code I’m pasting below is far from being the most elegant solution, i seem to have misplaced my nicer piece of code and am in too much of a rush to find it right now (sorry). Until i find my nicer solution, the method below will get you by – whether you have a hatred for ternary’s or not. public static string ReplaceQueryStringParam(string currentPageUrl, string paramToReplace, string newValue) { string urlWithoutQuery = currentPageUrl.IndexOf('?') >= 0 ? currentPageUrl.Substring(0, currentPageUrl.IndexOf('?')) : currentPageUrl; string queryString = currentPageUrl.IndexOf('?') >= 0 ? currentPageUrl.Substring(currentPageUrl.IndexOf('?')) : null; var queryParamList = queryString != null ? HttpUtility.ParseQueryString(queryString) : HttpUtility.ParseQueryString(string.Empty); if (queryParamList[paramToReplace] != null) { queryParamList[paramToReplace] = newValue; } else { queryParamList.Add(paramToReplace, newValue); } return String.Format("{0}?{1}", urlWithoutQuery, queryParamList); } To call this, you can do the following: // var currentUrl = HttpContext.Current.Request.Url; var currentUrl = "http://www.mysite.com/mypage?category=cool-products&sort=price&page=3"; // change the my sort-by param named"sort" to "name" var newUrlWithChangedSort = ReplaceQueryStringParam(currentUrl, "sort", "name"); Solution: JavaScript The second part of this post includes a JavaScript solution, as you never know when you have to do this on the client side. function replaceQueryString(url, param, value) { if (url.lastIndexOf('?') <= 0) url = url + "?"; var re = new RegExp("([?|&])" + param + "=.*?(&|$)", "i"); if (url.match(re)) return url.replace(re, '$1' + param + "=" + value + '$2'); else return url.substring(url.length - 1) == '?' ? url + param + "=" + value : url + '&' + param + "=" + value; } And to use the above code in your client-side javascript simply write something along the lines of: //var currentUrl = self.location; var currentUrl = "http://www.mysite.com/mypage?category=cool-products&sort=price&page=3"; // change the my sort-by param named"sort" to "name" var newUrlWithChangedSort = replaceQueryString(currentUrl, "sort", "name"); Easy – now next time you need to knock something together, instead of writing it yourself, you can simply cut & paste mine!

July 20, 2012

by Douglas Rathbone

· 16,500 Views

How Changing Java Package Names Transformed my System Architecture

Changing your perspective even a small amount can have profound effects on how you approach your system. Let’s say you’re writing a web application in Java. In the system you deal with orders, customers and products. As a web application, your classes include staples like PersonController, PersonRepository, CustomerController and OrderService. How do you organize your classes into packages? There are two fundamental ways to structure your packages. Either you can focus on the logical tiers, like com.brodwall.myapp.controllers, com.brodwall.myapp.domain or perhaps com.brodwall.myapp.services.customer. Or you can focus on the domain contexts, like com.brodwall.myapp.customer, com.brodwall.myapp.orders and com.brodwall.myapp.products. The first approach is by far the most prevalent. In my view, it’s also the least helpful. Here are some ways your thinking changes if you structure your packages around domain concepts, rather than technological tiers: First, and most fundamentally, your mental model will now be aligned with that of the users of your system. If you’re asked to implement a typical feature, it is now more likely to be focused around a strict subset of the packages of your system. For example, adding a new field to a form will at least affect the presentation logic, entity and persistence layer for the corresponding domain concept. If your packages are organized around tiers, this change will hit all over your system. In a word: A system organized around features, rather than technologies, have higher coherence. This technical term means that a large percentage of a the dependencies of a class are located close to that class. Secondly, organizing around domain concepts will give you more options when your software grows. When a package contains tens of classes, you may want to split it up in several packages. The discussion can itself be enlightening. “Maybe we should separate out the customer address classes into a com.brodwall.myapp.customer.address package. It seems to have a bit of a life on its own.” “Yeah, and maybe we can use the same classes for other places we need addresses, such as suppliers?” “Cool, so com.brodwall.myapp.address, then?” Or maybe you decide that order status codes and payment status codes deserve to be in the “com.brodwall.myapp.order.codes” package. On the other hand, what options do you have for splitting up com.brodwall.myapp.controllers? You could create subpackages for customer, orders and products, but these subpackages may only have one or possibly two classes each. Finally, and perhaps most intriguingly, using domain concepts for packages allows you to vary the design according on a case by case basis. Maybe you really need a OrderService which coordinates the payment and shipping of an order, while ProductController only needs basic create-retrieve-update-delete functionality with a repository. A ProductService would just get in the way. If ProductService is missing from the com.brodwall.myapp.services package, this may be confusing or at the very least give you a nagging feeling that something is wrong. On the other hand, if there’s no Controller in the com.brodwall.myapp.product package, it doesn’t matter much. Also, most systems have some good parts and some not-so-good parts. If your Services package is not working for you, there’s not much you can do. But if the Products package is rotten, you can throw it out and reimplement it without the whole system being thrown into a state of chaos. By putting the classes needed to implement a feature together with each other and apart from the classes needed to implement other features, developers can be pragmatic and innovative when developing one feature without negatively affecting other features. The flip side of this is that most developers are more comfortable with some technologies in the application and less comfortable with other technologies. Organizing around features instead of technologies force each developer to consider a larger set of technological challenges. Some programmers take this as a motivating challenge to learn, while others, it seems, would rather not have to learn something new. If it were my money being spend to create features, I know what kind of developer I would want. Trivial changes can have large effects. By organizing your software around features, you get a more coherent system that allows for growth. It may challenge your developers, but it drives down the number of hand-offs needed to implement a feature and it challenges the developers to improve the parts of the application they are working on. See also my blog post on Architecture as tidying up.

July 20, 2012

by Johannes Brodwall

· 17,418 Views

Spring Data - Apache Hadoop

Spring for Apache Hadoop is a Spring project to support writing applications that can benefit of the integration of Spring Framework and Hadoop. This post describes how to use Spring Data Apache Hadoop in an Amazon EC2 environment using the “Hello World” equivalent of Hadoop programming – a Wordcount application. 1./ Launch an Amazon Web Services EC2 instance. - Navigate to AWS EC2 Console (“https://console.aws.amazon.com/ec2/home”): - Select Launch Instance then Classic Wizzard and click on Continue. My test environment was a “Basic Amazon Linux AMI 2011.09″ 32-bit., Instant type: Micro (t1.micro , 613 MB), Security group quick-start-1 that enables ssh to be used for login. Select your existing key pair (or create a new one). Obviously you can select another AMI and instance types depending on your favourite flavour. (Should you vote for Windows 2008 based instance, you also need to have cygwin installed as an additional Hadoop prerequisite beside Java JDK and ssh, see “Install Apache Hadoop” section) 2./ Download Apache Hadoop - as of writing this article, 1.0.0 is the latest stable version of Apache Hadoop, that is what was used for testing purposes. I downloaded hadoop-1.0.0.tar.gz and copied it into /home/ec2-user directory using pscp command from my PC running Windows: c:\downloads>pscp -i mykey.ppk hadoop-1.0.0.tar.gz [email protected]:/home/ec2-user (the computer name above – ec2-ipaddress-region-compute.amazonaws.com – can be found on AWS EC2 console, Instance Description, public DNS field) 3./ Install Apache Hadoop: As prerequisites, you need to have Java JDK 1.6 and ssh installed, see Apache Single-Node Setup Guide. (ssh is automatically installed with Basic Amazon AMI). Then install hadoop itself: $ cd ~ # change directory to ec2-user home (/home/ec2-user) $ tar xvzf hadoop-1.0.0.tar.gz $ ln -s hadoop-1.0.0 hadoop $ cd hadoop/conf $ vi hadoop-env.sh # edit as below export JAVA_HOME=/opt/jdk1.6.0_29 $ vi core-site.xml # edit as below – this defines the namenode to be running on localhost and listeing to port 9000. fs.default.name hdfs://localhost:9000 $ vi hdsf-site.xml # edit as below this defines that file system replicate is 1 (in production environment it is supposed to be 3 by default) dfs.replication 1 $ vi mapred-site.xml # edit as below – this defines the jobtracker to be running on localhost and listeing to port 9001. mapred.job.tracker localhost:9001 $ cd ~/hadoop $ bin/hadoop namenode -format $ bin/start-all.sh At this stage all hadoop jobs are running in pseudo distributed mode, you can verify it by running: $ ps -ef | grep java You should see 5 java processes: namenode, secondarynamenode, datanode, jobtracker and tasktracker. 4./ Install Spring Data Hadoop Download Spring Data Hadoop package from SpringSource community download site. As of writing this article, the latest stable version is spring-data-hadoop-1.0.0.M1.zip. $ cd ~ $ tar xzvf spring-data-hadoop-1.0.0.M1.zip $ ln -s spring-data-hadoop-1.0.0.M1 spring-data-hadoop 5./ Build and Run Spring Data Hadoop Wordcount example $ cd spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount Spring Data Hadoop is using gradle as build tool. Check build.grandle build file. The original version packaged in the tar.gz file does not compile, it complains about thrift, version 0.2.0 and jdo2-api, version2.3-ec. Add datanucleus.org maven repository to the build.gradle file to support jdo2-api (http://www.datanucleus.org/downloads/maven2/) . Unfortunatelly, there seems to be no maven repo for thrift 0.2.0 . You should download thrift 0.2.0.jar and thrift.0.2.0.pom file e.g. from this repo: “http://people.apache.org/~rawson/repo“ and then add it to local maven repo. $ mvn install:install-file -DgroupId=org.apache.thrift -DartifactId=thrift -Dversion=0.2.0 -Dfile=thrift-0.2.0.jar -Dpackaging=jar $ vi build.grandle # modify the build file to refer to datanucleus maven repo for jdo2-api and the local repo for thrift repositories { // Public Spring artefacts mavenCentral() maven { url “http://repo.springsource.org/libs-release” } maven { url “http://repo.springsource.org/libs-milestone” } maven { url “http://repo.springsource.org/libs-snapshot” } maven { url “http://www.datanucleus.org/downloads/maven2/” } maven { url “file:///home/ec2-user/.m2/repository” } } I also modified the META-INF/spring/context.xml file in order to run hadoop file system commands manually: $ cd /home/ec2-user/spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount/src/main/resources $vi META-INF/spring/context.xml # remove clean-script and also the dependency on it for JobRunner. xmlns=”http://www.springframework.org/schema/beans” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xmlns:context=”http://www.springframework.org/schema/context” xmlns:hdp=”http://www.springframework.org/schema/hadoop” xmlns:p=”http://www.springframework.org/schema/p” xsi:schemaLocation=”http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd”> fs.default.name=${hd.fs} Copy the sample file – nietzsche-chapter-1.txt – to Hadoop file system (/user/ec2-user-/input directory) $ cd src/main/resources/data $ hadoop fs -mkdir /user/ec2-user/input $ hadoop fs -put nietzsche-chapter-1.txt /user/ec2-user/input/data $ cd ../../../.. # go back to samples/wordcount directory $ ../gradlew Verify the result: $ hadoop fs -cat /user/ec2-user/output/part-r-00000 | more “AWAY 1 “BY 1 “Beyond 1 “By 2 “Cheers 1 “DE 1 “Everywhere 1 “FROM” 1 “Flatterers 1 “Freedom 1

July 19, 2012

by Istvan Szegedi

· 11,895 Views

How to Resolve java.lang.NoClassDefFoundError: Part 3

This article is part 3 of our NoClassDefFoundError troubleshooting series. As I mentioned in my first article, there are many possible issues that can lead to a NoClassDefFoundError. This article will focus and describe one of the most common causes of this problem: failure of a Java class static initializer block or variable. A sample Java program will be provided and I encourage you to compile and run this example from your workstation in order to properly replicate and understand this type of NoClassDefFoundError problem. Java static initializer revisited The Java programming language provides you with the capability to “statically” initialize variables or a block of code. This is achieved via the “static” variable identifier or the usage of a static {} block at the header of a Java class. Static initializers are guaranteed to be executed only once in the JVM life cycle and are Thread safe by design which make their usage quite appealing for static data initialization such as internal object caches, loggers etc. What is the problem? I will repeat again, static initializers are guaranteed to be executed only once in the JVM life cycle…This means that such code is executed at the Class loading time and never executed again until you restart your JVM. Now what happens if the code executed at that time (@Class loading time) terminates with an unhandled Exception? Welcome to the java.lang.NoClassDefFoundError problem case #2! NoClassDefFoundError problem case 2 – static initializer failure This type of problem is occurring following the failure of static initializer code combined with successive attempts to create a new instance of the affected (non-loaded) class. Sample Java program The following simple Java program is split as per below: The main Java program NoClassDefFoundErrorSimulator The affected Java class ClassA ClassA provides you with a ON/OFF switch allowing you the replicate the type of problem that you want to study This program is simply attempting to create a new instance of ClassA 3 times (one after each other). It will demonstrate that an initial failure of either a static variable or static block initializer combined with successive attempt to create a new instance of the affected class triggers java.lang.NoClassDefFoundError. #### NoClassDefFoundErrorSimulator.java package org.ph.javaee.tools.jdk7.training2; /** * NoClassDefFoundErrorSimulator * @author Pierre-Hugues Charbonneau * */ public class NoClassDefFoundErrorSimulator { /** * @param args */ public static void main(String[] args) { System.out.println("java.lang.NoClassDefFoundError Simulator - Training 2"); System.out.println("Author: Pierre-Hugues Charbonneau"); System.out.println("http://javaeesupportpatterns.blogspot.com\n\n"); try { // Create a new instance of ClassA (attempt #1) System.out.println("FIRST attempt to create a new instance of ClassA...\n"); ClassA classA = new ClassA(); } catch (Throwable any) { any.printStackTrace(); } try { // Create a new instance of ClassA (attempt #2) System.out.println("\nSECOND attempt to create a new instance of ClassA...\n"); ClassA classA = new ClassA(); } catch (Throwable any) { any.printStackTrace(); } try { // Create a new instance of ClassA (attempt #3) System.out.println("\nTHIRD attempt to create a new instance of ClassA...\n"); ClassA classA = new ClassA(); } catch (Throwable any) { any.printStackTrace(); } System.out.println("\n\ndone!"); } } #### ClassA.java package org.ph.javaee.tools.jdk7.training2; /** * ClassA * @author Pierre-Hugues Charbonneau * */ public class ClassA { private final static String CLAZZ = ClassA.class.getName(); // Problem replication switch ON/OFF private final static boolean REPLICATE_PROBLEM1 = true; // static variable initializer private final static boolean REPLICATE_PROBLEM2 = false; // static block{} initializer // Static variable executed at Class loading time private static String staticVariable = initStaticVariable(); // Static initializer block executed at Class loading time static { // Static block code execution... if (REPLICATE_PROBLEM2) throw new IllegalStateException("ClassA.static{}: Internal Error!"); } public ClassA() { System.out.println("Creating a new instance of "+ClassA.class.getName()+"..."); } /** * * @return */ private static String initStaticVariable() { String stringData = ""; if (REPLICATE_PROBLEM1) throw new IllegalStateException("ClassA.initStaticVariable(): Internal Error!"); return stringData; } } Problem reproduction In order to replicate the problem, we will simply “voluntary” trigger a failure of the static initializer code. Please simply enable the problem type that you want to study e.g. either static variable or static block initializer failure: // Problem replication switch ON (true) / OFF (false) private final static boolean REPLICATE_PROBLEM1 = true; // static variable initializer private final static boolean REPLICATE_PROBLEM2 = false; // static block{} initializer Now, let’s run the program with both switch at OFF (both boolean values at false) ## Baseline (normal execution) java.lang.NoClassDefFoundError Simulator - Training 2 Author: Pierre-Hugues Charbonneau http://javaeesupportpatterns.blogspot.com FIRST attempt to create a new instance of ClassA... Creating a new instance of org.ph.javaee.tools.jdk7.training2.ClassA... SECOND attempt to create a new instance of ClassA... Creating a new instance of org.ph.javaee.tools.jdk7.training2.ClassA... THIRD attempt to create a new instance of ClassA... Creating a new instance of org.ph.javaee.tools.jdk7.training2.ClassA... done! For the initial run (baseline), the main program was able to create 3 instances of ClassA successfully with no problem. ## Problem reproduction run (static variable initializer failure) java.lang.NoClassDefFoundError Simulator - Training 2 Author: Pierre-Hugues Charbonneau http://javaeesupportpatterns.blogspot.com FIRST attempt to create a new instance of ClassA... java.lang.ExceptionInInitializerError at org.ph.javaee.tools.jdk7.training2.NoClassDefFoundErrorSimulator.main(NoClassDefFoundErrorSimulator.java:21) Caused by: java.lang.IllegalStateException: ClassA.initStaticVariable(): Internal Error! at org.ph.javaee.tools.jdk7.training2.ClassA.initStaticVariable(ClassA.java:37) at org.ph.javaee.tools.jdk7.training2.ClassA.(ClassA.java:16) ... 1 more SECOND attempt to create a new instance of ClassA... java.lang.NoClassDefFoundError: Could not initialize class org.ph.javaee.tools.jdk7.training2.ClassA at org.ph.javaee.tools.jdk7.training2.NoClassDefFoundErrorSimulator.main(NoClassDefFoundErrorSimulator.java:30) THIRD attempt to create a new instance of ClassA... java.lang.NoClassDefFoundError: Could not initialize class org.ph.javaee.tools.jdk7.training2.ClassA at org.ph.javaee.tools.jdk7.training2.NoClassDefFoundErrorSimulator.main(NoClassDefFoundErrorSimulator.java:39) done! ## Problem reproduction run (static block initializer failure) java.lang.NoClassDefFoundError Simulator - Training 2 Author: Pierre-Hugues Charbonneau http://javaeesupportpatterns.blogspot.com FIRST attempt to create a new instance of ClassA... java.lang.ExceptionInInitializerError at org.ph.javaee.tools.jdk7.training2.NoClassDefFoundErrorSimulator.main(NoClassDefFoundErrorSimulator.java:21) Caused by: java.lang.IllegalStateException: ClassA.static{}: Internal Error! at org.ph.javaee.tools.jdk7.training2.ClassA.(ClassA.java:22) ... 1 more SECOND attempt to create a new instance of ClassA... java.lang.NoClassDefFoundError: Could not initialize class org.ph.javaee.tools.jdk7.training2.ClassA at org.ph.javaee.tools.jdk7.training2.NoClassDefFoundErrorSimulator.main(NoClassDefFoundErrorSimulator.java:30) THIRD attempt to create a new instance of ClassA... java.lang.NoClassDefFoundError: Could not initialize class org.ph.javaee.tools.jdk7.training2.ClassA at org.ph.javaee.tools.jdk7.training2.NoClassDefFoundErrorSimulator.main(NoClassDefFoundErrorSimulator.java:39) done! What happened? As you can see, the first attempt to create a new instance of ClassA did trigger a java.lang.ExceptionInInitializerError. This exception indicates the failure of our static initializer for our static variable & bloc which is exactly what we wanted to achieve. The key point to understand at this point is that this failure did prevent the whole class loading of ClassA. As you can see, attempt #2 and attempt #3 both generated a java.lang.NoClassDefFoundError, why? Well since the first attempt failed, class loading of ClassA was prevented. Successive attempts to create a new instance of ClassA within the current ClassLoader did generate java.lang.NoClassDefFoundError over and over since ClassA was not found within current ClassLoader. As you can see, in this problem context, the NoClassDefFoundError is just a symptom or consequence of another problem. The original problem is the ExceptionInInitializerError triggered following the failure of the static initializer code. This clearly demonstrates the importance of proper error handling and logging when using Java static initializers. Recommendations and resolution strategies Now find below my recommendations and resolution strategies for NoClassDefFoundError problem case 2: - Review the java.lang.NoClassDefFoundError error and identify the missing Java class - Perform a code walkthrough of the affected class and determine if it contains static initializer code (variables & static block) - Review your server and application logs and determine if any error (e.g. ExceptionInInitializerError) originates from the static initializer code - Once confirmed, analyze the code further and determine the root cause of the initializer code failure. You may need to add some extra logging along with proper error handling to prevent and better handle future failures of your static initializer code going forward Please feel free to post any question or comment. The part 4 will start coverage of NoClassDefFoundError problems related to class loader problems.

July 19, 2012

by Pierre - Hugues Charbonneau

· 91,074 Views · 3 Likes

5 Tips for Proper Java Heap Size

Determination of proper Java Heap size for a production system is not a straightforward exercise. In my Java EE enterprise experience, I have seen multiple performance problem cases due to inadequate Java Heap capacity and tuning. This article will provide you with 5 tips that can help you determine optimal Java Heap size, as a starting point, for your current or new production environment. Some of these tips are also very useful regarding the prevention and resolution of java.lang.OutOfMemoryError problems; including memory leaks. Please note that these tips are intended to “help you” determine proper Java Heap size. Since each IT environment is unique, you are actually in the best position to determine precisely the required Java Heap specifications of your client’s environment. Some of these tips may also not be applicable in the context of a very small Java standalone application but I still recommend you to read the entire article. Future articles will include tips on how to choose the proper Java VM garbage collector type for your environment and applications. #1 – JVM: you always fear what you don't understand How can you expect to configure, tune and troubleshoot something that you don’t understand? You may never have the chance to write and improve Java VM specifications but you are still free to learn its foundation in order to improve your knowledge and troubleshooting skills. Some may disagree, but from my perspective, the thinking that Java programmers are not required to know the internal JVM memory management is an illusion. Java Heap tuning and troubleshooting can especially be a challenge for Java & Java EE beginners. Find below a typical scenario: - Your client production environment is facing OutOfMemoryError on a regular basis and causing lot of business impact. Your support team is under pressure to resolve this problem - A quick Google search allows you to find examples of similar problems and you now believe (and assume) that you are facing the same problem - You then grab JVM -Xms and -Xmx values from another person OutOfMemoryError problem case, hoping to quickly resolve your client’s problem - You then proceed and implement the same tuning to your environment. 2 days later you realize problem is still happening (even worse or little better)…the struggle continues… What went wrong? - You failed to first acquire proper understanding of the root cause of your problem - You may also have failed to properly understand your production environment at a deeper level (specifications, load situation etc.). Web searches is a great way to learn and share knowledge but you have to perform your own due diligence and root cause analysis - You may also be lacking some basic knowledge of the JVM and its internal memory management, preventing you to connect all the dots together My #1 tip and recommendation to you is to learn and understand the basic JVM principles along with its different memory spaces. Such knowledge is critical as it will allow you to make valid recommendations to your clients and properly understand the possible impact and risk associated with future tuning considerations. Now find below a quick high level reference guide for the Java VM: The Java VM memory is split up to 3 memory spaces: The Java Heap. Applicable for all JVM vendors, usually split between YoungGen (nursery) & OldGen (tenured) spaces. The PermGen (permanent generation). Applicable to the Sun HotSpot VM only (PermGen space will be removed in future Java 7 or Java 8 updates) The Native Heap (C-Heap). Applicable for all JVM vendors. I recommend that you review each article below, including Sun white paper on the HotSpot Java memory management. I also encourage you to download and look at the OpenJDK implementation. ## Sun HotSpot VM http://javaeesupportpatterns.blogspot.com/2011/08/java-heap-space-hotspot-vm.html ## IBM VM http://javaeesupportpatterns.blogspot.com/2012/02/java-heap-space-ibm-vm.html ## Oracle JRockit VM http://javaeesupportpatterns.blogspot.com/2012/02/java-heap-space-jrockit-vm.html ## Sun (Oracle) – Java memory management white paper http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf ## OpenJDK – Open-source Java implementation http://openjdk.java.net/ As you can see, the Java VM memory management is more complex than just setting up the biggest value possible via –Xmx. You have to look at all angles, including your native and PermGen space requirement along with physical memory availability (and # of CPU cores) from your physical host(s). It can get especially tricky for 32-bit JVM since the Java Heap and native Heap are in a race. The bigger your Java Heap, smaller the native Heap. Attempting to setup a large Heap for a 32-bit VM e.g .2.5 GB+ increases risk of native OutOfMemoryError depending of your application(s) footprint, number of Threads etc. 64-bit JVM resolves this problem but you are still limited to physical resources availability and garbage collection overhead (cost of major GC collections go up with size). The bottom line is that the bigger is not always the better so please do not assume that you can run all your 20 Java EE applications on a single 16 GB 64-bit JVM process. #2 – Data and application is king: review your static footprint requirement Your application(s) along with its associated data will dictate the Java Heap footprint requirement. By static memory, I mean “predictable” memory requirements as per below. - Determine how many different applications you are planning to deploy to a single JVM process e.g. number of EAR files, WAR files, jar files etc. The more applications you deploy to a single JVM, higher demand on native Heap - Determine how many Java classes will be potentially loaded at runtime; including third part API’s. The more class loaders and classes that you load at runtime, higher demand on the HotSpot VM PermGen space and internal JIT related optimization objects - Determine data cache footprint e.g. internal cache data structures loaded by your application (and third party API’s) such as cached data from a database, data read from a file etc. The more data caching that you use, higher demand on the Java Heap OldGen space - Determine the number of Threads that your middleware is allowed to create. This is very important since Java threads require enough native memory or OutOfMemoryError will be thrown For example, you will need much more native memory and PermGen space if you are planning to deploy 10 separate EAR applications on a single JVM process vs. only 2 or 3. Data caching not serialized to a disk or database will require extra memory from the OldGen space. Try to come up with reasonable estimates of the static memory footprint requirement. This will be very useful to setup some starting point JVM capacity figures before your true measurement exercise (e.g. tip #4). For 32-bit JVM, I usually do not recommend a Java Heap size high than 2 GB (-Xms2048m, -Xmx2048m) since you need enough memory for PermGen and native Heap for your Java EE applications and threads. This assessment is especially important since too many applications deployed in a single 32-bit JVM process can easily lead to native Heap depletion; especially in a multi threads environment. For a 64-bit JVM, a Java Heap size of 3 GB or 4 GB per JVM process is usually my recommended starting point. #3 – Business traffic set the rules: review your dynamic footprint requirement Your business traffic will typically dictate your dynamic memory footprint. Concurrent users & requests generate the JVM GC “heartbeat” that you can observe from various monitoring tools due to very frequent creation and garbage collections of short & long lived objects. As you saw from the above JVM diagram, a typical ratio of YoungGen vs. OldGen is 1:3 or 33%. For a typical 32-bit JVM, a Java Heap size setup at 2 GB (using generational & concurrent collector) will typically allocate 500 MB for YoungGen space and 1.5 GB for the OldGen space. Minimizing the frequency of major GC collections is a key aspect for optimal performance so it is very important that you understand and estimate how much memory you need during your peak volume. Again, your type of application and data will dictate how much memory you need. Shopping cart type of applications (long lived objects) involving large and non-serialized session data typically need large Java Heap and lot of OldGen space. Stateless and XML processing heavy applications (lot of short lived objects) require proper YoungGen space in order to minimize frequency of major collections. Example: - You have 5 EAR applications (~2 thousands of Java classes) to deploy (which include middleware code as well…) - Your native heap requirement is estimated at 1 GB (has to be large enough to handle Threads creation etc.) - Your PermGen space is estimated at 512 MB - Your internal static data caching is estimated at 500 MB - Your total forecast traffic is 5000 concurrent users at peak hours - Each user session data footprint is estimated at 500 K - Total footprint requirement for session data alone is 2.5 GB under peak volume As you can see, with such requirement, there is no way you can have all this traffic sent to a single JVM 32-bit process. A typical solution involves splitting (tip #5) traffic across a few JVM processes and / or physical host (assuming you have enough hardware and CPU cores available). However, for this example, given the high demand on static memory and to ensure a scalable environment in the long run, I would also recommend 64-bit VM but with a smaller Java Heap as a starting point such as 3 GB to minimize the GC cost. You definitely want to have extra buffer for the OldGen space so I typically recommend up to 50% memory footprint post major collection in order to keep the frequency of Full GC low and enough buffer for fail-over scenarios. Most of the time, your business traffic will drive most of your memory footprint, unless you need significant amount of data caching to achieve proper performance which is typical for portal (media) heavy applications. Too much data caching should raise a yellow flag that you may need to revisit some design elements sooner than later. #4 – Don’t guess it, measure it! At this point you should: - Understand the basic JVM principles and memory spaces - Have a deep view and understanding of all applications along with their characteristics (size, type, dynamic traffic, stateless vs. stateful objects, internal memory caches etc.) - Have a very good view or forecast on the business traffic (# of concurrent users etc.) and for each application - Some ideas if you need a 64-bit VM or not and which JVM settings to start with - Some ideas if you need more than one JVM (middleware) processes But wait, your work is not done yet. While this above information is crucial and great for you to come up with “best guess” Java Heap settings, it is always best and recommended to simulate your application(s) behaviour and validate the Java Heap memory requirement via proper profiling, load & performance testing. You can learn and take advantage of tools such as JProfiler (future articles will include tutorials on JProfiler). From my perspective, learning how to use a profiler is the best way to properly understand your application memory footprint. Another approach I use for existing production environments is heap dump analysis using the Eclipse MAT tool. Heap Dump analysis is very powerful and allow you to view and understand the entire memory footprint of the Java Heap, including class loader related data and is a must do exercise in any memory footprint analysis; especially memory leaks. Java profilers and heap dump analysis tools allow you to understand and validate your application memory footprint, including detection and resolution of memory leaks. Load and performance testing is also a must since this will allow you to validate your earlier estimates by simulating your forecast concurrent users. It will also expose your application bottlenecks and allow you to further fine tune your JVM settings. You can use tools such as Apache JMeter which is very easy to learn and use or explore other commercial products. Finally, I have seen quite often Java EE environments running perfectly fine until the day where one piece of the infrastructure start to fail e.g. hardware failure. Suddenly the environment is running at reduced capacity (reduced # of JVM processes) and the whole environment goes down. What happened? There are many scenarios that can lead to domino effects but lack of JVM tuning and capacity to handle fail-over (short term extra load) is very common. If your JVM processes are running at 80%+ OldGen space capacity with frequent garbage collections, how can you expect to handle any fail-over scenario? Your load and performance testing exercise performed earlier should simulate such scenario and you should adjust your tuning settings properly so your Java Heap has enough buffer to handle extra load (extra objects) at short term. This is mainly applicable for the dynamic memory footprint since fail-over means redirecting a certain % of your concurrent users to the available JVM processes (middleware instances). #5 – Divide and conquer At this point you have performed dozens of load testing iterations. You know that your JVM is not leaking memory. Your application memory footprint cannot be reduced any further. You tried several tuning strategies such as using a large 64-bit Java Heap space of 10 GB+, multiple GC policies but still not finding your performance level acceptable? In my experience I found that, with current JVM specifications, proper vertical and horizontal scaling which involved creating a few JVM processes per physical host and across several hosts will give you the throughput and capacity that you are looking for. Your IT environment will also more fault tolerant if you break your application list in a few logical silos, with their own JVM process, Threads and tuning values. This “divide and conquer” strategy involves splitting your application(s) traffic to multiple JVM processes and will provide you with: - Reduced Java Heap size per JVM process (both static & dynamic footprint) - Reduced complexity of JVM tuning - Reduced GC elapsed and pause time per JVM process - Increased redundancy and fail-over capabilities - Aligned with latest Cloud and IT virtualization strategies The bottom line is that when you find yourself spending too much time in tuning that single elephant 64-bit JVM process, it is time to revisit your middleware and JVM deployment strategy and take advantage of vertical & horizontal scaling. This implementation strategy is more taxing for the hardware but will really pay off in the long run. Please provide any comment and share your experience on JVM Heap sizing and tuning.

July 19, 2012

by Pierre - Hugues Charbonneau

· 143,244 Views · 7 Likes

Render Geographic Information in 3D With Three.js and D3.js

The last couple of days I've been playing around with three.js and geo information. I wanted to be able to render map/geo data (e.g. in geojson format) inside the three.js scene. That way I have another dimension I could use to show a specific metric instead of just using the color in a 2D map. In this article I'll show you how you can do this. The example we'll create shows a 3D map of the Netherlands, rendered in Three.js, that uses a color to indicate the population density per municipality and the height of each municipality represents the actual number of residents. Or if you can look at a working example. This information is based on open data available from the Dutch government. If you look at the source from the example, you can see the json we use for this. For more information on geojson and how to parse it see the other articles I did on this subject: Using d3.js to visualize GIS Election site part 1: Basics with Knockout.js, Bootstrap and d3.js To get this working we'll take the following steps: Load the input geo data Setup a three.js scene Convert the input data to a Three.js path using d3.js Set the color and height of the Three.js object Render everything Just a reminder to see everything working, just look at the example. Load the input geo data D3.js has support to load json and directly transform it to an SVG path. Though this is a convenient way, I only needed the path data, not the complete SVG elements. So to load json I just used jquery's json support. // get the data jQuery.getJSON('data/cities.json', function(data, textStatus, jqXHR) { .. }); This will load the data and pass it in the data object to the supplied function. Setup a three.js scene Before we do anything with the data lets first setup a basic Three.js scene. // Set up the three.js scene. This is the most basic setup without // any special stuff function initScene() { // set the scene size var WIDTH = 600, HEIGHT = 600; // set some camera attributes var VIEW_ANGLE = 45, ASPECT = WIDTH / HEIGHT, NEAR = 0.1, FAR = 10000; // create a WebGL renderer, camera, and a scene renderer = new THREE.WebGLRenderer({antialias:true}); camera = new THREE.PerspectiveCamera(VIEW_ANGLE, ASPECT, NEAR, FAR); scene = new THREE.Scene(); // add and position the camera at a fixed position scene.add(camera); camera.position.z = 550; camera.position.x = 0; camera.position.y = 550; camera.lookAt( scene.position ); // start the renderer, and black background renderer.setSize(WIDTH, HEIGHT); renderer.setClearColor(0x000); // add the render target to the page $("#chart").append(renderer.domElement); // add a light at a specific position var pointLight = new THREE.PointLight(0xFFFFFF); scene.add(pointLight); pointLight.position.x = 800; pointLight.position.y = 800; pointLight.position.z = 800; // add a base plane on which we'll render our map var planeGeo = new THREE.PlaneGeometry(10000, 10000, 10, 10); var planeMat = new THREE.MeshLambertMaterial({color: 0x666699}); var plane = new THREE.Mesh(planeGeo, planeMat); // rotate it to correct position plane.rotation.x = -Math.PI/2; scene.add(plane); } Nothing to special, the comments inline should nicely explain what we're doing here. Next it gets more interesting. Convert the input data to a Three.js path using d3.js What we need to do next is convert our geojson input format to a THREE.Path that we can use in our scene. Three.js itself doesn't support geojson or SVG for that matter. Luckily though someone already started work on integrating d3.js with three.js. This project is called "d3-threeD" (sources can be found on github here). With this extension you can automagically render SVG elements in 3D directly from D3.js. Cool stuff, but it didn't allow me any control over how the elements were rendered. It does however contain a function we can use for our scenario. If you look through the source code of this project you'll find a method called "transformSVGPath". This method converts an SVG path string to a Three.Shape element. Unfortunately this method isn't exposed, but that's quickly solved by adding this to the d3-threeD.js file: // at the top var transformSVGPathExposed; ... // within the d3threeD(exports) function transformSVGPathExposed = transformSVGPath; This way we can call this method separately. Now that we have a way to transform an SVG path to a Three.js shape, we only need to convert the geojson to an SVG string and pass it to this function. We can use the geo functionaly from D3.js for this: geons.geoConfig = function() { this.TRANSLATE_0 = appConstants.TRANSLATE_0; this.TRANSLATE_1 = appConstants.TRANSLATE_1; this.SCALE = appConstants.SCALE; this.mercator = d3.geo.mercator(); this.path = d3.geo.path().projection(this.mercator); this.setupGeo = function() { var translate = this.mercator.translate(); translate[0] = this.TRANSLATE_0; translate[1] = this.TRANSLATE_1; this.mercator.translate(translate); this.mercator.scale(this.SCALE); } } The path variable from the previous piece of code can now be used like this: var feature = geo.path(geoFeature); To convert a geojson element to an SVG path. So how does this look combined? // add the loaded gis object (in geojson format) to the map function addGeoObject() { // keep track of rendered objects var meshes = []; ... // convert to mesh and calculate values for (var i = 0 ; i < data.features.length ; i++) { var geoFeature = data.features[i] var feature = geo.path(geoFeature); // we only need to convert it to a three.js path var mesh = transformSVGPathExposed(feature); // add to array meshes.push(mesh); ... } As you can see we iterate over the data.features list (this contains all the geojson representations of the municipalities). Each municipality is converted to an svg string, and each svg string is converted to a mesh. This mesh is a Three.js object that we can render on the scene. Set the color and height of the Three.js object Now we just need to set the height and the color of the Three.js shape and add it to the scene. The extended addGeoObject method now looks like this: // add the loaded gis object (in geojson format) to the map function addGeoObject() { // keep track of rendered objects var meshes = []; var averageValues = []; var totalValues = []; // keep track of min and max, used to color the objects var maxValueAverage = 0; var minValueAverage = -1; // keep track of max and min of total value var maxValueTotal = 0; var minValueTotal = -1; // convert to mesh and calculate values for (var i = 0 ; i < data.features.length ; i++) { var geoFeature = data.features[i] var feature = geo.path(geoFeature); // we only need to convert it to a three.js path var mesh = transformSVGPathExposed(feature); // add to array meshes.push(mesh); // we get a property from the json object and use it // to determine the color later on var value = parseInt(geoFeature.properties.bev_dichth); if (value > maxValueAverage) maxValueAverage = value; if (value < minValueAverage || minValueAverage == -1) minValueAverage = value; averageValues.push(value); // and we get the max values to determine height later on. value = parseInt(geoFeature.properties.aant_inw); if (value > maxValueTotal) maxValueTotal = value; if (value < minValueTotal || minValueTotal == -1) minValueTotal = value; totalValues.push(value); } // we've got our paths now extrude them to a height and add a color for (var i = 0 ; i < averageValues.length ; i++) { // create material color based on average var scale = ((averageValues[i] - minValueAverage) / (maxValueAverage - minValueAverage)) * 255; var mathColor = gradient(Math.round(scale),255); var material = new THREE.MeshLambertMaterial({ color: mathColor }); // create extrude based on total var extrude = ((totalValues[i] - minValueTotal) / (maxValueTotal - minValueTotal)) * 100; var shape3d = meshes[i].extrude({amount: Math.round(extrude), bevelEnabled: false}); // create a mesh based on material and extruded shape var toAdd = new THREE.Mesh(shape3d, material); // rotate and position the elements nicely in the center toAdd.rotation.x = Math.PI/2; toAdd.translateX(-490); toAdd.translateZ(50); toAdd.translateY(extrude/2); // add to scene scene.add(toAdd); } } // simple gradient function function gradient(length, maxLength) { var i = (length * 255 / maxLength); var r = i; var g = 255-(i); var b = 0; var rgb = b | (g << 8) | (r << 16); return rgb; } A big piece of code, but not that complex. What we do here is we keep track of two values for each municipality: the population density and the total population. These values are used to respectively calculate the color (using the gradient function) and the height. The height is used in the Three.js extrude function which converts our 2D Three.Js path to a 3D shape. The color is used to define a material. This shape and material is used to create the Mesh that we add to the scene. Render everything All that is left is to render everything. For this example we're not interested in animations or anything so we can make a single call to the renderer: renderer.render( scene, camera ); And the result is as you saw in the beginning. The following image shows a different example. This time we once again show the population density, but now the height represents the land area of the municipality. I'm currently creating a new set of geojson data, but this time for the whole of Europe. So in the next couple of weeks expect some articles using maps of Europe.

July 16, 2012

by Jos Dirksen

· 41,254 Views

Apache Thrift with Java Quickstart

Apache Thrift is a RPC framework founded by facebook and now it is an Apache project. Thrift lets you define data types and service interfaces in a language neutral definition file. That definition file is used as the input for the compiler to generate code for building RPC clients and servers that communicate over different programming languages. You can refer Thrift white paper also. According to the official web site Apache Thrift is a, software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages. Image courtesy wikipedia Installing Apache Thrift in Windows Installation Thrift can be a tiresome process. But for windows the compiler is available as a prebuilt exe. Download thrift.exe and add it into your environment variables. Writing Thrift definition file (.thrift file) Writing the Thrift definition file becomes really easy once you get used to it. I found this tutorial quite useful to begin with. Example definition file (add.thrift) namespace java com.eviac.blog.samples.thrift.server // defines the namespace typedef i32 int //typedefs to get convenient names for your types service AdditionService { // defines the service to add two numbers int add(1:int n1, 2:int n2), //defines a method } Compiling Thrift definition file To compile the .thrift file use the following command. thrift --gen For my example the command is, thrift --gen java add.thrift After performing the command, inside gen-java directory you'll find the source codes which is useful for building RPC clients and server. In my example it will create a java code called AdditionService.java Writing a service handler Service handler class is required to implement the AdditionService.Iface interface. Example service handler (AdditionServiceHandler.java) package com.eviac.blog.samples.thrift.server; import org.apache.thrift.TException; public class AdditionServiceHandler implements AdditionService.Iface { @Override public int add(int n1, int n2) throws TException { return n1 + n2; } } Writing a simple server Following is an example code to initiate a simple thrift server. To enable the multithreaded server uncomment the commented parts of the example code. Example server (MyServer.java) package com.eviac.blog.samples.thrift.server; import org.apache.thrift.transport.TServerSocket; import org.apache.thrift.transport.TServerTransport; import org.apache.thrift.server.TServer; import org.apache.thrift.server.TServer.Args; import org.apache.thrift.server.TSimpleServer; public class MyServer { public static void StartsimpleServer(AdditionService.Processor processor) { try { TServerTransport serverTransport = new TServerSocket(9090); TServer server = new TSimpleServer( new Args(serverTransport).processor(processor)); // Use this for a multithreaded server // TServer server = new TThreadPoolServer(new // TThreadPoolServer.Args(serverTransport).processor(processor)); System.out.println("Starting the simple server..."); server.serve(); } catch (Exception e) { e.printStackTrace(); } } public static void main(String[] args) { StartsimpleServer(new AdditionService.Processor(new AdditionServiceHandler())); } } Writing the client Following is an example java client code which consumes the service provided by AdditionService. Example client code (AdditionClient.java) package com.eviac.blog.samples.thrift.client; import org.apache.thrift.TException; import org.apache.thrift.protocol.TBinaryProtocol; import org.apache.thrift.protocol.TProtocol; import org.apache.thrift.transport.TSocket; import org.apache.thrift.transport.TTransport; import org.apache.thrift.transport.TTransportException; public class AdditionClient { public static void main(String[] args) { try { TTransport transport; transport = new TSocket("localhost", 9090); transport.open(); TProtocol protocol = new TBinaryProtocol(transport); AdditionService.Client client = new AdditionService.Client(protocol); System.out.println(client.add(100, 200)); transport.close(); } catch (TTransportException e) { e.printStackTrace(); } catch (TException x) { x.printStackTrace(); } } } Run the server code(MyServer.java). It should output following and will listen to the requests. Starting the simple server... Then run the client code(AdditionClient.java). It should output following. 300

July 16, 2012

by Pavithra Gunasekara

· 43,523 Views · 2 Likes

JMS With ActiveMQ

Java Message Service is a mechanism for integrating applications in a loosely coupled, flexible manner and delivers data asynchronously across applications.

July 14, 2012

by Pavithra Gunasekara

· 164,885 Views · 13 Likes

Dependency Convergence in Maven

I was running in to a problem with a Java project that occured only in IntelliJ Idea, but not on the command line, when running specific test classes in Maven. The exception stack trace had the following in it: Caused by: com.sun.jersey.api.container.ContainerException: No WebApplication provider is present That seems like an easy problem to fix - it is the exception message that is given when jersey can’t find the provider for JAX-RS. Fixing it is normally just a matter of making sure jersey-core is on the classpath to fulfill SPI requirements for JAX-RS. For some reason though this isn’t happening in IntelliJ Idea. I inspected the log output of the test run and it is quite clear that all of the jersey dependencies are on the classpath. Then it dawns me on the try running mvn dependency:tree from inside of Idea. Here is what I found: [INFO] +- org.mule.modules:mule-module-jersey:jar:3.2.1:provided [INFO] | +- com.sun.jersey:jersey-server:jar:1.6:provided [INFO] | +- com.sun.jersey:jersey-json:jar:1.6:provided [INFO] | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:provided [INFO] | | \- org.codehaus.jackson:jackson-xc:jar:1.7.1:provided [INFO] | +- com.sun.jersey:jersey-client:jar:1.6:provided [INFO] | \- org.codehaus.jackson:jackson-jaxrs:jar:1.8.0:provided ... [INFO] +- org.jclouds.driver:jclouds-sshj:jar:1.4.0-rc.3:compile [INFO] | +- org.jclouds:jclouds-compute:jar:1.4.0-rc.3:compile [INFO] | | \- org.jclouds:jclouds-scriptbuilder:jar:1.4.0-rc.3:compile [INFO] | +- org.jclouds:jclouds-core:jar:1.4.0-rc.3:compile [INFO] | | +- net.oauth.core:oauth:jar:20100527:compile [INFO] | | +- com.sun.jersey:jersey-core:jar:1.11:compile [INFO] | | +- com.google.inject.extensions:guice-assistedinject:jar:3.0:compile Notice how I have jersey-core 1.11 coming from jclouds-core but jersey 1.6 everywhere else. That, my friends, is a dependency convergence problem. Maven with its default set of plugins (read: no maven-enforcer-plugin) does not even warn you if something like this happens. In this case, somehow jclouds-core depends directly on jersey-core and happens to resolve the dependency to the version that jclouds-core declared first before jersey-core can be resolved as a transitive dependency on mule-module-jersey. To fix the symptom, all I had to do was add the jersey-core dependency explicitely as a top level dependency in my pom: com.sun.jersey jersey-core ${jersey.version} provided But doing so only fixes the symptom, not the problem. The real problem is that the maven project I’m working on does not presently attempt to detect or resolve dependency convergence problems. This is where the maven-enforcer-plugin comes in handy. You can have the enforcer plugin run the DependencyConvergence rule agaisnt your build and have it fail when you have potential conflicts in your transitive dependencies that you haven’t resolved through exclusions or declaring direct dependencies yet. Binding the maven-enforcer-plugin to your build would look something like this: org.apache.maven.plugins maven-enforcer-plugin 1.0.1 enforce enforce validate ... I chose to bind to the validate phase since that is the first phase to be run in the maven lifecycle. Now my build fails immediately and contains very useful output that looks like the following: Dependency convergence error for org.codehaus.jackson:jackson-jaxrs:1.7.1 paths to dependency are: +-com.nodeable:server:1.0-SNAPSHOT +-org.mule.modules:mule-module-jersey:3.2.1 +-com.sun.jersey:jersey-json:1.6 +-org.codehaus.jackson:jackson-jaxrs:1.7.1 and +-com.nodeable:server:1.0-SNAPSHOT +-org.mule.modules:mule-module-jersey:3.2.1 +-org.codehaus.jackson:jackson-jaxrs:1.8.0 There are many rules you can apply besides DependencyConvergence. However, if the output from the DependencyConvergence rule looks anything like mine does presently, it might take you a while before you get around to getting your maven build to pass and conform to other rules.

July 11, 2012

by Jason Whaley

· 24,834 Views · 1 Like

Everything You Need To Know About Couchbase Architecture

After receiving a lot of good feedback and comment on my last blog on MongoDb, I was encouraged to do another deep dive on another popular document oriented db; Couchbase. I have been a long-time fan CouchDb and has wrote a blog on it many years ago. After it merges with Membase, I am very excited to take a deep look into it again. Couchbase is the merge of two popular NOSQL technologies: Membase, which provides persistence, replication, sharding to the high performance memcached technology CouchDB, which pioneers the document oriented model based on JSON Like other NOSQL technologies, both Membase and CouchDB are built from the ground up on a highly distributed architecture, with data shard across machines in a cluster. Built around the Memcached protocol, Membase provides an easy migration to existing Memcached users who want to add persistence, sharding and fault resilience on their familiar Memcached model. On the other hand, CouchDB provides first class support for storing JSON documents as well as a simple RESTful API to access them. Underneath, CouchDB also has a highly tuned storage engine that is optimized for both update transaction as well as query processing. Taking the best of both technologies, Membase is well-positioned in the NOSQL marketplace. Programming model Couchbase provides client libraries for different programming languages such as Java / .NET / PHP / Ruby / C / Python / Node.js For read, Couchbase provides a key-based lookup mechanism where the client is expected to provide the key, and only the server hosting the data (with that key) will be contacted. Couchbase also provides a query mechanism to retrieve data where the client provides a query (for example, range based on some secondary key) as well as the view (basically the index). The query will be broadcasted to all servers in the cluster and the result will be merged and sent back to the client. For write, Couchbase provides a key-based update mechanism where the client sends in an updated document with the key (as doc id). When handling write request, the server will return to client’s write request as soon as the data is stored in RAM on the active server, which offers the lowest latency for write requests. Following is the core API that Couchbase offers. (in an abstract sense) # Get a document by key doc = get(key) # Modify a document, notice the whole document # need to be passed in set(key, doc) # Modify a document when no one has modified it # since my last read casVersion = doc.getCas() cas(key, casVersion, changedDoc) # Create a new document, with an expiration time # after which the document will be deleted addIfNotExist(key, doc, timeToLive) # Delete a document delete(key) # When the value is an integer, increment the integer increment(key) # When the value is an integer, decrement the integer decrement(key) # When the value is an opaque byte array, append more # data into existing value append(key, newData) # Query the data results = query(viewName, queryParameters) In Couchbase, document is the unit of manipulation. Currently Couchbase doesn't support server-side execution of custom logic. Couchbase server is basically a passive store and unlike other document oriented DB, Couchbase doesn't support field-level modification. In case of modifying documents, client need to retrieve documents by its key, do the modification locally and then send back the whole (modified) document back to the server. This design tradeoff network bandwidth (since more data will be transferred across the network) for CPU (now CPU load shift to client). Couchbase currently doesn't support bulk modification based on a condition matching. Modification happens only in a per document basis. (client will save the modified document one at a time). Transaction Model Similar to many NOSQL databases, Couchbase’s transaction model is primitive as compared to RDBMS. Atomicity is guaranteed at a single document and transactions that span update of multiple documents are unsupported. To provide necessary isolation for concurrent access, Couchbase provides a CAS (compare and swap) mechanism which works as follows … When the client retrieves a document, a CAS ID (equivalent to a revision number) is attached to it. While the client is manipulating the retrieved document locally, another client may modify this document. When this happens, the CAS ID of the document at the server will be incremented. Now, when the original client submits its modification to the server, it can attach the original CAS ID in its request. The server will verify this ID with the actual ID in the server. If they differ, the document has been updated in between and the server will not apply the update. The original client will re-read the document (which now has a newer ID) and re-submit its modification. Couchbase also provides a locking mechanism for clients to coordinate their access to documents. Clients can request a LOCK on the document it intends to modify, update the documents and then releases the LOCK. To prevent a deadlock situation, each LOCK grant has a timeout so it will automatically be released after a period of time. Deployment Architecture In a typical setting, a Couchbase DB resides in a server clusters involving multiple machines. Client library will connect to the appropriate servers to access the data. Each machine contains a number of daemon processes which provides data access as well as management functions. The data server, written in C/C++, is responsible to handle get/set/delete request from client. The Management server, written in Erlang, is responsible to handle the query traffic from client, as well as manage the configuration and communicate with other member nodes in the cluster. Virtual Buckets The basic unit of data storage in Couchbase DB is a JSON document (or primitive data type such as int and byte array) which is associated with a key. The overall key space is partitioned into 1024 logical storage unit called "virtual buckets" (or vBucket). vBucket are distributed across machines within the cluster via a map that is shared among servers in the cluster as well as the client library. High availability is achieved through data replication at the vBucket level. Currently Couchbase supports one active vBucket zero or more standby replicas hosted in other machines. Curremtly the standby server are idle and not serving any client request. In future version of Couchbase, the standby replica will be able to serve read request. Load balancing in Couchbase is achieved as follows: Keys are uniformly distributed based on the hash function When machines are added and removed in the cluster. The administrator can request a redistribution of vBucket so that data are evenly spread across physical machines. Management Server Management server performs the management function and co-ordinate the other nodes within the cluster. It includes the following monitoring and administration functions Heartbeat: A watchdog process periodically communicates with all member nodes within the same cluster to provide Couchbase Server health updates. Process monitor: This subsystem monitors execution of the local data manager, restarting failed processes as required and provide status information to the heartbeat module. Configuration manager: Each Couchbase Server node shares a cluster-wide configuration which contains the member nodes within the cluster, a vBucket map. The configuration manager pull this config from other member nodes at bootup time. Within a cluster, one node’s Management Server will be elected as the leader which performs the following cluster-wide management function Controls the distribution of vBuckets among other nodes and initiate vBucket migration Orchestrates the failover and update the configuration manager of member nodes If the leader node crashes, a new leader will be elected from surviving members in the cluster. When a machine in the cluster has crashed, the leader will detect that and notify member machines in the cluster that all vBuckets hosted in the crashed machine is dead. After getting this signal, machines hosting the corresponding vBucket replica will set the vBucket status as “active”. The vBucket/server map is updated and eventually propagated to the client lib. Notice that at this moment, the replication level of the vBucket will be reduced. Couchbase doesn’t automatically re-create new replicas which will cause data copying traffic. Administrator can issue a command to explicitly initiate a data rebalancing. The crashed machine, after reboot can rejoin the cluster. At this moment, all the data it stores previously will be completely discard and the machine will be treated as a brand new empty machine. As more machines are put into the cluster (for scaling out), vBucket should be redistributed to achieve a load balance. This is currently triggered by an explicit command from the administrator. Once receive the “rebalance” command, the leader will compute the new provisional map which has the balanced distribution of vBuckets and send this provisional map to all members of the cluster. To compute the vBucket map and migration plan, the leader attempts the following objectives: Evenly distribute the number of active vBuckets and replica vBuckets among member nodes. Place the active copy and each replicas in physically separated nodes. Spread the replica vBucket as wide as possible among other member nodes. Minimize the amount of data migration Orchestrate the steps of replica redistribution so no node or network will be overwhelmed by the replica migration. Once the vBucket maps is determined, the leader will pass the redistribution map to each member in the cluster and coordinate the steps of vBucket migration. The actual data transfer happens directly between the origination node to the destination node. Notice that since we have generally more vBuckets than machines. The workload of migration will be evenly distributed automatically. For example, when new machines are added into the clusters, all existing machines will migrate some portion of its vBucket to the new machines. There is no single bottleneck in the cluster. Throughput the migration and redistribution of vBucket among servers, the life cycle of a vBucket in a server will be in one of the following states “Active”: means the server is hosting the vBucket is ready to handle both read and write request “Replica”: means the server is hosting the a copy of the vBucket that may be slightly out of date but can take read request that can tolerate some degree of outdate. “Pending”: means the server is hosting a copy that is in a critical transitional state. The server cannot take either read or write request at this moment. “Dead”: means the server is no longer responsible for the vBucket and will not take either read or write request anymore. Data Server Data server implements the memcached APIs such as get, set, delete, append, prepend, etc. It contains the following key datastructure: One in-memory hashtable (key by doc id) for the corresponding vBucket hosted. The hashtable acts as both a metadata for all documents as well as a cache for the document content. Maintain the entry gives a quick way to detect whether the document exists on disk. To support async write, there is a checkpoint linkedlist per vBucket holding the doc id of modified documents that hasn't been flushed to disk or replicated to the replica. To handle a "GET" request Data server routes the request to the corresponding ep-engine responsible for the vBucket. The ep-engine will lookup the document id from the in-memory hastable. If the document content is found in cache (stored in the value of the hashtable), it will be returned. Otherwise, a background disk fetch task will be created and queued into the RO dispatcher queue. The RO dispatcher then reads the value from the underlying storage engine and populates the corresponding entry in the vbucket hash table. Finally, the notification thread notifies the disk fetch completion to the memcached pending connection, so that the memcached worker thread can revisit the engine to process a get request. To handle a "SET" request, a success response will be returned to the calling client once the updated document has been put into the in-memory hashtable with a write request put into the checkpoint buffer. Later on the Flusher thread will pickup the outstanding write request from each checkpoint buffer, lookup the corresponding document content from the hashtable and write it out to the storage engine. Of course, data can be lost if the server crashes before the data has been replicated to another server and/or persisted. If the client requires a high data availability across different crashes, it can issue a subsequent observe() call which blocks on the condition that the server persist data on disk, or the server has replicated the data to another server (and get its ACK). Overall speaking, the client has various options to tradeoff data integrity with throughput. Hashtable Management To synchronize accesses to a vbucket hash table, each incoming thread needs to acquire a lock before accessing a key region of the hash table. There are multiple locks per vbucket hash table, each of which is responsible for controlling exclusive accesses to a certain ket region on that hash table. The number of regions of a hash table can grow dynamically as more documents are inserted into the hash table. To control the memory size of the hashtable, Item pager thread will monitor the memory utilization of the hashtable. Once a high watermark is reached, it will initiate an eviction process to remove certain document content from the hashtable. Only entries that is not referenced by entries in the checkpoint buffer can be evicted because otherwise the outstanding update (which only exists in hashtable but not persisted) will be lost. After eviction, the entry of the document still remains in the hashtable; only the document content of the document will be removed from memory but the metadata is still there. The eviction process stops after reaching the low watermark. The high / low water mark is determined by the bucket memory quota. By default, the high water mark is set to 75% of bucket quota, while the low water mark is set to 60% of bucket quota. These water marks can be configurable at runtime. In CouchDb, every document is associated with an expiration time and will be deleted once it is expired. Expiry pager is responsible for tracking and removing expired document from both the hashtable as well as the storage engine (by scheduling a delete operation). Checkpoint Manager Checkpoint manager is responsible to recycle the checkpoint buffer, which holds the outstanding update request, consumed by the two downstream processes, Flusher and TAP replicator. When all the request in the checkpoint buffer has been processed, the checkpoint buffer will be deleted and a new one will be created. TAP Replicator TAP replicator is responsible to handle vBucket migration as well as vBucket replication from active server to replica server. It does this by propagating the latest modified document to the corresponding replica server. At the time a replica vBucket is established, the entire vBucket need to be copied from the active server to the empty destination replica server as follows The in-memory hashtable at the active server will be transferred to the replica server. Notice that during this period, some data may be updated and therefore the data set transfered to the replica can be inconsistent (some are the latest and some are outdated). Nevertheless, all updates happen after the start of transfer is tracked in the checkpoint buffer. Therefore, after the in-memory hashtable transferred is completed, the TAP replicator can pickup those updates from the checkpoint buffer. This ensures the latest versioned of changed documents are sent to the replica, and hence fix the inconsistency. However the hashtable cache doesn’t contain all the document content. Data also need to be read from the vBucket file and send to the replica. Notice that during this period, update of vBucket will happen in active server. However, since the file is appended only, subsequent data update won’t interfere the vBucket copying process. After the replica server has caught up, subsequent update at the active server will be available at its checkpoint buffer which will be pickup by the TAP replicator and send to the replica server. CouchDB Storage Structure Data server defines an interface where different storage structure can be plugged-in. Currently it supports both a SQLite DB as well as CouchDB. Here we describe the details of CouchDb, which provides a super high performance storage mechanism underneath the Couchbase technology. Under the CouchDB structure, there will be one file per vBucket. Data are written to this file in an append-only manner, which enables Couchbase to do mostly sequential writes for update, and provide the most optimized access patterns for disk I/O. This unique storage structure attributes to Couchbase’s fast on-disk performance for write-intensive applications. The following diagram illustrate the storage model and how it is modified by 3 batch updates (notice that since updates are asynchronous, it is perform by "Flusher" thread in batches). The Flusher thread works as follows: 1) Pick up all pending write request from the dirty queue and de-duplicate multiple update request to the same document. 2) Sort each request (by key) into corresponding vBucket and open the corresponding file 3) Append the following into the vBucket file (in the following contiguous sequence) All document contents in such write request batch. Each document will be written as [length, crc, content] one after one sequentially. The index that stores the mapping from document id to the document’s position on disk (called the BTree by-id) The index that stores the mapping from update sequence number to the document’s position on disk. (called the BTree by-seq) The by-id index plays an important role for looking up the document by its id. It is organized as a B-Tree where each node contains a key range. To lookup a document by id, we just need to start from the header (which is the end of the file), transfer to the root BTree node of the by-id index, and then further traverse to the leaf BTree node that contains the pointer to the actual document position on disk. During the write, the similar mechanism is used to trace back to the corresponding BTree node that contains the id of the modified documents. Notice that in the append-only model, update is not happening in-place, instead we located the existing location and copy it over by appending. In other words, the modified BTree node will be need to be copied over and modified and finally paste to the end of file, and then its parent need to be modified to point to the new location, which triggers the parents to be copied over and paste to the end of file. Same happens to its parents’ parent and eventually all the way to the root node of the BTree. The disk seek can be at the O(logN) complexity. The by-seq index is used to keep track of the update sequence of lived documents and is used for asynchronous catchup purposes. When a document is created, modified or deleted, a sequence number is added to the by-seq btree and the previous seq node will be deleted. Therefore, for cross-site replication, view index update and compaction, we can quickly locate all the lived documents in the order of their update sequence. When a vBucket replicator asks for the list of update since a particular time, it provides the last sequence number in previous update, the system will then scan through the by-seq BTree node to locate all the document that has sequence number larger than that, which effectively includes all the document that has been modified since the last replication. As time goes by, certain data becomes garbage (see the grey-out region above) and become unreachable in the file. Therefore, we need a garbage collection mechanism to clean up the garbage. To trigger this process, the by-id and by-seq B-Tree node will keep track of the data size of lived documents (those that is not garbage) under its substree. Therefore, by examining the root BTree node, we can determine the size of all lived documents within the vBucket. When the ratio of actual size and vBucket file size fall below a certain threshold, a compaction process will be triggered whose job is to open the vBucket file and copy the survived data to another file. Technically, the compaction process opens the file and read the by-seq BTree at the end of the file. It traces the Btree all the way to the leaf node and copy the corresponding document content to the new file. The compaction process happens while the vBucket is being updated. However, since the file is appended only, new changes are recorded after the BTree root that the compaction has opened, so subsequent data update won’t interfere with the compaction process. When the compaction is completed, the system need to copy over the data that was appended since the beginning of the compaction to the new file. View Index Structure Unlike most indexing structure which provide a pointer from the search attribute back to the document. The CouchDb index (called View Index) is better perceived as a denormalized table with arbitrary keys and values loosely associated to the document. Such denormalized table is defined by a user-provided map() and reduce() function. map = function(doc) { … emit(k1, v1) … emit(k2, v2) … } reduce = function(keys, values, isRereduce) { if (isRereduce) { // Do the re-reduce only on values (keys will be null) } else { // Do the reduce on keys and values } // result must be ready for input values to re-reduce return result } Whenever a document is created, updated, deleted, the corresponding map(doc) function will be invoked (in an asynchronous manner) to generate a set of key/value pairs. Such key/value will be stored in a B-Tree structure. All the key/values pairs of each B-Tree node will be passed into the reduce() function, which compute an aggregated value within that B-Tree node. Re-reduce also happens in non-leaf B-Tree nodes which further aggregate the aggregated value of child B-Tree nodes. The management server maintains the view index and persisted it to a separate file. Create a view index is perform by broadcast the index creation request to all machines in the cluster. The management process of each machine will read its active vBucket file and feed each surviving document to the Map function. The key/value pairs emitted by the Map function will be stored in a separated BTree index file. When writing out the BTree node, the reduce() function will be called with the list of all values in the tree node. Its return result represent a partially reduced value is attached to the BTree node. The view index will be updated incrementally as documents are subsequently getting into the system. Periodically, the management process will open the vBucket file and scan all documents since the last sequence number. For each changed document since the last sync, it invokes the corresponding map function to determine the corresponding key/value into the BTree node. The BTree node will be split if appropriate. Underlying, Couchbase use a back index to keep track of the document with the keys that it previously emitted. Later when the document is deleted, it can look up the back index to determine what those key are and remove them. In case the document is updated, the back index can also be examined; semantically a modification is equivalent to a delete followed by an insert. The following diagram illustrates how the view index file will be incrementally updated via the append-only mechanism. Query Processing Query in Couchbase is made against the view index. A query is composed of the view name, a start key and end key. If the reduce() function isn’t defined, the query result will be the list of values sorted by the keys within the key range. In case the reduce() function is defined, the query result will be a single aggregated value of all keys within the key range. If the view has no reduce() function defined, the query processing proceeds as follows: Client issue a query (with view, start/end key) to the management process of any server (unlike a key based lookup, there is no need to locate a specific server). The management process will broadcast the request to other management process on all servers (include itself) within the cluster. Each management process (after receiving the broadcast request) do a local search for value within the key range by traversing the BTree node of its view file, and start sending back the result (automatically sorted by the key) to the initial server. The initial server will merge the sorted result and stream them back to the client. However, if the view has reduce() function defined, the query processing will involve computing a single aggregated value as follows: Client issue a query (with view, start/end key) to the management process of any server (unlike a key based lookup, there is no need to locate a specific server). The management process will broadcast the request to other management process on all servers (include itself) within the cluster. Each management process do a local reduce for value within the key range by traversing the BTree node of its view file to compute the reduce value of the key range. If the key range span across a BTree node, the pre-computed of the sub-range can be used. This way, the reduce function can reuse a lot of partially reduced values and doesn’t need to recomputed every value of the key range from scratch. The original server will do a final re-reduce() in all the return value from each other servers, and then passed back the final reduced value to the client. To illustrate the re-reduce concept, lets say the query has its key range from A to F. Instead of calling reduce([A,B,C,D,E,F]), the system recognize the BTree node that contains [B,C,D] has been pre-reduced and the result P is stored in the BTree node, so it only need to call reduce(A,P,E,F). Update View Index as vBucket migrates Since the view index is synchronized with the vBuckets in the same server, when the vBucket has migrated to a different server, the view index is no longer correct; those key/value that belong to a migrated vBucket should be discarded and the reduce value cannot be used anymore. To keep track of the vBucket and key in the view index, each bTree node has a 1024-bitmask indicating all the vBuckets that is covered in the subtree (ie: it contains a key emitted from a document belonging to the vBucket). Such bit-mask is maintained whenever the bTree node is updated. At the server-level, a global bitmask is used to indicate all the vBuckets that this server is responsible for. In processing the query of the map-only view, before the key/value pair is returned, an extra check will be perform for each key/value pair to make sure its associated vBucket is what this server is responsible for. When processing the query of a view that has a reduce() function, we cannot use the pre-computed reduce value if the bTree node contains a vBucket that the server is not responsible for. In this case, the bTree node’s bit mask is compared with the global bit mask. In case if they are not aligned, then the reduce value need to be recomputed. Here is an example to illustrate this process Couchbase is one of the popular NOSQL technology built on a solid technology foundation designed for high performance. In this post, we have examined a number of such key features: Load balancing between servers inside a cluster that can grow and shrink according to workload conditions. Data migration can be used to re-achieve workload balance. Asynchronous write provides lowest possible latency to client as it returns once the data is store in memory. Append-only update model pushes most update transaction into sequential disk access, hence provide extremely high throughput for write intensive applications. Automatic compaction ensures the data lay out on disk are kept optimized all the time. Map function can be used to pre-compute view index to enable query access. Summary data can be pre-aggregated using the reduce function. Overall, this cut down the workload of query processing dramatically. For a review on NOSQL architecture in general and some theoretical foundation, I have wrote a NOSQL design pattern blog, as well as some fundamental difference between SQL and NOSQL. For other NOSQL technologies, please read my other blog on MongoDb, Cassandra and HBase, Memcached Special thanks to Damien Katz and Frank Weigel from Couchbase team who provide a lot of implementation details of Couchbase.

July 7, 2012

by Ricky Ho

· 84,681 Views · 5 Likes

JBoss AS 7 is neat but the documentation is still quite lacking (and error messages not as useful as they could be). This post summarizes how you can create your own JavaEE-compliant login module for authenticating users of your webapp deployed on JBoss AS. A working elementary username-password module provided. Why use Java EE standard authentication? Java EE security primer A part of the Java EE specification is security for web and EE applications, which makes it possible both to specify declarative constraints in your web.xml (such as “role X is required to access resources at URLs “/protected/*”) and to control it programatically, i.e. verifying that the user has a particular role (see HttpServletRequest.isUserInRole). It works as follows: You declare in your web.xml: Login configuration – primarily whether to use browser prompt (basic) or a custom login form and a name for the login realm The custom form uses “magic” values for the post action and the fields, starting with j_, which are intercepted and processed by the server The roles used in your application (typically you’d something like “user” and perhaps “admin”) What roles are required for accessing particular URL patterns (default: none) Whether HTTPS is required for some parts of the application You tell your application server how to authenticate users for that login realm, usually by associating its name with one of the available login modules in the configuration (the modules ranging from simple file-based user list to LDAP and Kerberos support). Only rarely do you need to create your own login module, the topic of this post. If this is new for you than I strongly recommend reading The Java EE 5 Tutorial – Examples: Securing Web Applications (Form-Based Authentication with a JSP Page incl. security constraint specification, Basic Authentication with JAX-WS, Securing an Enterprise Bean, Using the isCallerInRole and getCallerPrincipal Methods). Why to bother? Declarative security is nicely decoupled from the business code It’s easy to propagate security information between a webapp and for example EJBs (where you can protect a complete bean or a particular method declaratively via xml or via annotations such as @RolesAllowed) It’s easy to switch to a different authentication mechanism such as LDAP and it’s more likely that SSO will be supported Custom login module implementation options If one of the login modules (part of a security domain) provided out of the box with JBoss, such as UsersRoles, Ldap, Database, Certificate, isn’t sufficient for you then you can adjust one of them or implement your own. You can: Extend one of the concrete modules, overriding one or some of its methods to ajdust to your needs – see f.ex. how to override the DatabaseServerLoginModule to specify your own encryption of the stored passwords. This should be your primary choice, of possible. Subclass UsernamePasswordLoginModule Implement javax.security.auth.spi.LoginModule if you need maximal flexibility and portability (this is a part of Java EE, namely JAAS, and is quite complex) JBoss EAP 5 Security Guide Ch. 12.2. Custom Modules has an excellent description of the basic modules (AbstractServerLoginModule, UsernamePasswordLoginModule) and how to proceed when subclassing them or any other standard module, including description of the key methods to implement/override. You must read it. (The guide is still perfectly applicable to JBoss AS 7 in this regard.) The custom JndiUserAndPass module example, extending UsernamePasswordLoginModule, is also worth reading – it uses module options and JNDI lookup. Example: Custom UsernamePasswordLoginModule subclass See the source code of MySimpleUsernamePasswordLoginModule that extends JBoss’ UsernamePasswordLoginModule. The abstract UsernamePasswordLoginModule (source code) works by comparing the password provided by the user for equality with the password returned from the method getUsersPassword, implemented by a subclass. You can use the method getUsername to obtain the user name of the user attempting login. Implement abstract methods getUsersPassword() Implement getUsersPassword() to lookup the user’s password wherever you have it. If you do not store passwords in plain text then read how to customize the behavior via other methods below getRoleSets() Implement getRoleSets() (from AbstractServerLoginModule) to return at least one group named “Roles” and containing 0+ roles assigned to the user, see the implementation in the source code for this post. Usually you’d lookup the roles for the user somewhere (instead of returning hardcoded “user_role” role). Optionally extend initialize(..) to get access to module options etc. Usually you will also want to extend initialize(Subject subject, CallbackHandler callbackHandler, Map sharedState, Map options) (called for each authentication attempt), To get values of properties declared via the element in the security-domain configuration – see JBoss 5 custom module example To do other initialization, such as looking up a data source via JNDI – see the DatabaseServerLoginModule Optionally override other methods to customize the behavior If you do not store passwords in plain text (a wise choice!) and your hashing method isn’t supported out of the box then you can override createPasswordHash(String username, String password, String digestOption) to hash/encrypt the user-supplied password before comparison with the stored password. Alternatively you could override validatePassword(String inputPassword, String expectedPassword) to do whatever conversion on the password before comparison or even do a different type of comparison than equality. Custom login module deployment options In JBoss AS you can Deploy your login module class in a JAR as a standalone module, independently of the webapp, under /modules/, together with a module.xml – described at JBossAS7SecurityCustomLoginModules Deploy your login module class as a part of your webapp (no module.xml required) In a JAR inside WEB-INF/lib/ Directly under WEB-INF/classes In each case you have to declare a corresponding security-domain it inside JBoss configuration (standalone/configuration/standalone.xml or domain/configuration/domain.xml): The code attribute should contain the fully qualified name of your login module class and the security-domain’s name must match the declaration in jboss-web.xml: form-auth true The code Download the webapp jboss-custom-login containing the custom login module MySimpleUsernamePasswordLoginModule, follow the deployment instructions in the README.

July 4, 2012

by Jakub Holý

· 31,975 Views

Using the JavaFX AnimationTimer

In retrospect it was probably not a good idea to give the AnimationTimer its name, because it can be used for much more than just animation: measuring the fps-rate, collision detection, calculating the steps of a simulation, the main loop of a game etc. In fact, most of the time I saw AnimationTimer in action was not related to animation at all. Nevertheless there are cases when you want to consider using an AnimationTimer for your animation. This post will explain the class and show an example where AnimationTimer is used to calculate animations. The AnimationTimer provides an extremely simple, but very useful and flexible feature. It allows to specify a method, that will be called in every frame. What this method is used for is not limited and, as already mentioned, does not have anything to do with animation. The only requirement is, that it has to return fast, because otherwise it can easily become the bottleneck of a system. To use it, a developer has to extend AnimationTimer and implement the abstract method handle(). This is the method that will be called in every frame while the AnimationTimer is active. A single parameter is passed to handle(). It contains the current time in nanoseconds, the same as what you would get when calling System.nanoTime(). Why should one use the passed in value instead of calling System.nanoTime() or its little brother System.currentTimeMillis() oneself? There are several reasons, but the most important probably is, that it makes your life a lot easier while debugging. If you ever tried to debug code, that depended on these two methods, you know that you are basically screwed. But the JavaFX runtime goes into a paused state while it is waiting to execute the next step during debugging and the internal clock does not proceed during this pause. In other words no matter if you wait two seconds or two hours before you resume a halted program while debugging, the increment of the parameter will roughly be the same! AnimationTimer has two methods start() and stop() to activate and deactivate it. If you override them, it is important that you call these methods in the super class. The Animation API comes with many feature rich classes, that make defining an animation very simple. There are predefined Transition classes, it is possible to define a key-frame based animation using Timeline, and one can even write a custom Transition easily. But in which cases does it make sense to use an AnimationTimer instead? – Almost always you want to use one of the standard classes. But if you want to specify many simple animations, using an AnimationTimer can be the better choice. The feature richness of the standard animation classes comes with a price. Every single animation requires a whole bunch of variables to be tracked – variables that you often do not need for simple animations. Plus these classes are optimized for speed, not for small memory footprint. Some of the variables are stored twice, once in the format the public API requires and once in a format that helps faster calculation while playing. Below is a simple example that shows a star field. It animates thousands of rectangles flying from the center to the outer edges. Using an AnimationTimer allows to store only the values that are needed. The calculation is extremely simple compared to the calculation within a Timeline for example, because no advanced features (loops, animation rate, direction etc.) have to be considered. package fxsandbox; import java.util.Random; import javafx.animation.AnimationTimer; import javafx.application.Application; import javafx.scene.Group; import javafx.scene.Node; import javafx.scene.Scene; import javafx.scene.paint.Color; import javafx.scene.shape.Rectangle; import javafx.stage.Stage; public class FXSandbox extends Application { private static final int STAR_COUNT = 20000; private final Rectangle[] nodes = new Rectangle[STAR_COUNT]; private final double[] angles = new double[STAR_COUNT]; private final long[] start = new long[STAR_COUNT]; private final Random random = new Random(); @Override public void start(final Stage primaryStage) { for (int i=0; i

June 27, 2012

by Michael Heinrichs

· 18,556 Views

Using Cookies to implement a RememberMe functionality

Some web applications may need a "Remember Me" functionality. This means that, after a user login, user will have access from same machine to all its data even after session expired. This access will be possible until user does a logout. If you are using Spring and its login form, then you should use "Remember Me" functionality already implemented inside the framework. Some web frameworks also offer a type of SignIn panel which already has "remember me" built-in. But in case you have to implement "Remember Me" functionality by your own, this can be easily achieved using Cookies. Java has a Cookie class named javax.servlet.http.Cookie. Algorithm is straight-forward: your login panel must contain a "Remember Me" check after a succesfull login with "Remember Me" check selected, you can create two cookies: one to keep the value for rememberMe and one to keep a token which has to identify the logged user. For sake of security, this token must never contain user name or user password. The ideea is to generate a random id as token value. And token value aside with user id must be saved in your storage (database) whenever a login is needed, you have to look if there is any cookie saved by you, and if so and your "rememberMe" value is true, you can take the user from storage based on your token and do an automatic login. when a logout is done, you have to delete the cookie that keeps the token To add a cookie, you have to specify the maximum age of the cookie in seconds : HttpServletResponse servletResponse = ...; Cookie c = new Cookie(COOKIE_NAME, encodeString(uuid)); c.setMaxAge(365 * 24 * 60 * 60); // one year servletResponse.addCookie(c); To delete a cookie, you have to find cookie by name and set its maximum age to 0, before adding it to servlet response: HttpServletRequest servletRequest = ...; HttpServletResponse servletResponse = ... ; Cookie[] cookies = servletRequest.getCookies(); for (int i = 0; i < cookies.length; i++) { Cookie c = cookies[i]; if (c.getName().equals(COOKIE_NAME)) { c.setMaxAge(0); c.setValue(null); servletResponse.addCookie(c); } }

June 26, 2012

by Mihai Dinca - Panaitescu

· 58,953 Views · 1 Like

JAX-WS Header: Part 1 the Client Side

Manipulating JAXWS header on the client Side like adding WSS username token or logging saop message.

June 25, 2012

by Slim Ouertani

· 89,768 Views

Managing ActiveMQ with JMX APIs

Here is a quick example of how to programmatically access ActiveMQ MBeans to monitor and manipulate message queues... First, get a connection to a JMX server (assumes localhost, port 1099, no auth) Note, always cache the connection for subsequent requests (can cause memory utilization issues otherwise) JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://localhost:1099/jmxrmi"); JMXConnector jmxc = JMXConnectorFactory.connect(url); MBeanServerConnection conn = jmxc.getMBeanServerConnection(); Then, you can execute various operations such as addQueue, removeQueue, etc... String operationName="addQueue"; String parameter="MyNewQueue"; ObjectName activeMQ = new ObjectName("org.apache.activemq:BrokerName=localhost,Type=Broker"); if(parameter != null) { Object[] params = {parameter}; String[] sig = {"java.lang.String"}; conn.invoke(activeMQ, operationName, params, sig); } else { conn.invoke(activeMQ, operationName,null,null); } Also, you can get an ActiveMQ QueueViewMBean instance for a specified queue name... ObjectName activeMQ = new ObjectName("org.apache.activemq:BrokerName=localhost,Type=Broker"); BrokerViewMBean mbean = (BrokerViewMBean) MBeanServerInvocationHandler.newProxyInstance(conn, activeMQ,BrokerViewMBean.class, true); for (ObjectName name : mbean.getQueues()) { QueueViewMBean queueMbean = (QueueViewMBean) MBeanServerInvocationHandler.newProxyInstance(mbsc, name, QueueViewMBean.class, true); if (queueMbean.getName().equals(queueName)) { queueViewBeanCache.put(cacheKey, queueMbean); return queueMbean; } } Then, execute one of several APIs against the QueueViewMBean instance... Queue monitoring - getEnqueueCount(), getDequeueCount(), getConsumerCount(), etc... Queue manipulation - purge(), getMessage(String messageId), removeMessage(String messageId), moveMessageTo(String messageId, String destinationName), copyMessageTo(String messageId, String destinationName), etc... Summary The APIs can easily be used to build a web or command line based tool to support remote ActiveMQ management features. That being said, all of these features are available via the JMX console itself and ActiveMQ does provide a web console to support some management/monitoring tasks. See these pages for more information... http://activemq.apache.org/jmx-support.html http://activemq.apache.org/web-console.html

June 22, 2012

by Ben O'Day

· 32,176 Views · 1 Like

Java Volatile Keyword Explained by Example

Check out an example of the volatile Java keyword.

June 21, 2012

by Thibault Delor

· 260,970 Views · 20 Likes