Data Engineering Resources

The Latest Data Engineering Topics

As you may have seen from my past performance related articles and HashMap case studies, Java thread safety problems can bring down your Java EE application and the Java EE container fairly easily. One of most common problems I have observed when troubleshooting Java EE performance problems is infinite looping triggered from the non-thread safe HashMap get() and put() operations. This problem is known since several years but recent production problems have forced me to revisit this issue one more time. This article will revisit this classic thread safety problem and demonstrate, using a simple Java program, the risk associated with a wrong usage of the plain old java.util.HashMap data structure involved in a concurrent threads context. This proof of concept exercise will attempt to achieve the following 3 goals: Revisit and compare the Java program performance level between the non-thread safe and thread safe Map data structure implementations (HashMap, Hashtable, synchronized HashMap, ConcurrentHashMap) Replicate and demonstrate the HashMap infinite looping problem using a simple Java program that everybody can compile, run and understand Review the usage of the above Map data structures in a real-life and modern Java EE container implementation such as JBoss AS7 For more detail on the ConcurrentHashMap implementation strategy, I highly recommend the great article from Brian Goetz on this subject. Tools and server specifications As a starting point, find below the different tools and software’s used for the exercise: Sun/Oracle JDK & JRE 1.7 64-bit Eclipse Java EE IDE Windows Process Explorer (CPU per Java Thread correlation) JVM Thread Dump (stuck thread analysis and CPU per Thread correlation) The following local computer was used for the problem replication process and performance measurements: Intel(R) Core(TM) i5-2520M CPU @ 2.50Ghz (2 CPU cores, 4 logical cores) 8 GB RAM Windows 7 64-bit * Results and performance of the Java program may vary depending of your workstation or server specifications. Java program In order to help us achieve the above goals, a simple Java program was created as per below: The main Java program is HashMapInfiniteLoopSimulator.java A worker Thread class WorkerThread.java was also created The program is performing the following: Initialize different static Map data structures with initial size of 2 Assign the chosen Map to the worker threads (you can chose between 4 Map implementations) Create a certain number of worker threads (as per the header configuration). 3 worker threads were created for this proof of concept NB_THREADS = 3; Each of these worker threads has the same task: lookup and insert a new element in the assigned Map data structure using a random Integer element between 1 – 1 000 000. Each worker thread perform this task for a total of 500K iterations The overall program performs 50 iterations in order to allow enough ramp up time for the HotSpot JVM The concurrent threads context is achieved using the JDK ExecutorService As you can see, the Java program task is fairly simple but complex enough to generate the following critical criteria’s: Generate concurrency against a shared / static Map data structure Use a mix of get() and put() operations in order to attempt to trigger internal locks and / or internal corruption (for the non-thread safe implementation) Use a small Map initial size of 2, forcing the internal HashMap to trigger an internal rehash/resize Finally, the following parameters can be modified at your convenience: ## Number of worker threads private static final int NB_THREADS = 3; ## Number of Java program iterations private static final int NB_TEST_ITERATIONS = 50; ## Map data structure assignment. You can choose between 4 structures // Plain old HashMap (since JDK 1.2) nonThreadSafeMap = new HashMap(2); // Plain old Hashtable (since JDK 1.0) threadSafeMap1 = new Hashtable(2); // Fully synchronized HashMap threadSafeMap2 = new HashMap(2); threadSafeMap2 = Collections.synchronizedMap(threadSafeMap2); // ConcurrentHashMap (since JDK 1.5) threadSafeMap3 = new ConcurrentHashMap(2); /*** Assign map at your convenience ****/ assignedMapForTest = threadSafeMap3; Now find below the source code of our sample program. #### HashMapInfiniteLoopSimulator.java package org.ph.javaee.training4; import java.util.Collections; import java.util.Map; import java.util.HashMap; import java.util.Hashtable; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; /** * HashMapInfiniteLoopSimulator * @author Pierre-Hugues Charbonneau * */ public class HashMapInfiniteLoopSimulator { private static final int NB_THREADS = 3; private static final int NB_TEST_ITERATIONS = 50; private static Map assignedMapForTest = null; private static Map nonThreadSafeMap = null; private static Map threadSafeMap1 = null; private static Map threadSafeMap2 = null; private static Map threadSafeMap3 = null; /** * Main program * @param args */ public static void main(String[] args) { System.out.println("Infinite Looping HashMap Simulator"); System.out.println("Author: Pierre-Hugues Charbonneau"); System.out.println("http://javaeesupportpatterns.blogspot.com"); for (int i=0; i(2); // Plain old Hashtable (since JDK 1.0) threadSafeMap1 = new Hashtable(2); // Fully synchronized HashMap threadSafeMap2 = new HashMap(2); threadSafeMap2 = Collections.synchronizedMap(threadSafeMap2); // ConcurrentHashMap (since JDK 1.5) threadSafeMap3 = new ConcurrentHashMap(2); // ConcurrentHashMap /*** Assign map at your convenience ****/ assignedMapForTest = threadSafeMap3; long timeBefore = System.currentTimeMillis(); long timeAfter = 0; Float totalProcessingTime = null; ExecutorService executor = Executors.newFixedThreadPool(NB_THREADS); for (int j = 0; j < NB_THREADS; j++) { /** Assign the Map at your convenience **/ Runnable worker = new WorkerThread(assignedMapForTest); executor.execute(worker); } // This will make the executor accept no new threads // and finish all existing threads in the queue executor.shutdown(); // Wait until all threads are finish while (!executor.isTerminated()) { } timeAfter = System.currentTimeMillis(); totalProcessingTime = new Float( (float) (timeAfter - timeBefore) / (float) 1000); System.out.println("All threads completed in "+totalProcessingTime+" seconds"); } } } #### WorkerThread.java package org.ph.javaee.training4; import java.util.Map; /** * WorkerThread * * @author Pierre-Hugues Charbonneau * */ public class WorkerThread implements Runnable { private Map map = null; public WorkerThread(Map assignedMap) { this.map = assignedMap; } @Override public void run() { for (int i=0; i<500000; i++) { // Return 2 integers between 1-1000000 inclusive Integer newInteger1 = (int) Math.ceil(Math.random() * 1000000); Integer newInteger2 = (int) Math.ceil(Math.random() * 1000000); // 1. Attempt to retrieve a random Integer element Integer retrievedInteger = map.get(String.valueOf(newInteger1)); // 2. Attempt to insert a random Integer element map.put(String.valueOf(newInteger2), newInteger2); } } } Performance comparison between thread safe Map implementations The first goal is to compare the performance level of our program when using different thread safe Map implementations: Plain old Hashtable (since JDK 1.0) Fully synchronized HashMap (via Collections.synchronizedMap()) ConcurrentHashMap (since JDK 1.5) Find below the graphical results of the execution of the Java program for each iteration along with a sample of the program console output. # Output when using ConcurrentHashMap Infinite Looping HashMap Simulator Author: Pierre-Hugues Charbonneau http://javaeesupportpatterns.blogspot.com All threads completed in 0.984 seconds All threads completed in 0.908 seconds All threads completed in 0.706 seconds All threads completed in 1.068 seconds All threads completed in 0.621 seconds All threads completed in 0.594 seconds All threads completed in 0.569 seconds All threads completed in 0.599 seconds ……………… As you can see, the ConcurrentHashMap is the clear winner here, taking in average only half a second (after an initial ramp-up) for all 3 worker threads to concurrently read and insert data within a 500K looping statement against the assigned shared Map. Please note that no problem was found with the program execution e.g. no hang situation. The performance boost is definitely due to the improved ConcurrentHashMap performance such as the non-blocking get() operation. The 2 other Map implementations performance level was fairly similar with a small advantage for the synchronized HashMap. HashMap infinite looping problem replication The next objective is to replicate the HashMap infinite looping problem observed so often from Java EE production environments. In order to do that, you simply need to assign the non-thread safe HashMap implementation as per code snippet below: /*** Assign map at your convenience ****/ assignedMapForTest = nonThreadSafeMap; Running the program as is using the non-thread safe HashMap should lead to: No output other than the program header Significant CPU increase observed from the system At some point the Java program will hang and you will be forced to kill the Java process What happened? In order to understand this situation and confirm the problem, we will perform a CPU per Thread analysis from the Windows OS using Process Explorer and JVM Thread Dump. 1 - Run the program again then quickly capture the thread per CPU data from Process Explorer as per below. Under explore.exe you will need to right click over the javaw.exe and select properties. The threads tab will be displayed. We can see overall 4 threads using almost all the CPU of our system. 2 – Now you have to quickly capture a JVM Thread Dump using the JDK 1.7 jstack utility. For our example, we can see our 3 worker threads which seems busy/stuck performing get() and put() operations. ..\jdk1.7.0\bin>jstack 272 2012-08-29 14:07:26 Full thread dump Java HotSpot(TM) 64-Bit Server VM (21.0-b17 mixed mode): "pool-1-thread-3" prio=6 tid=0x0000000006a3c000 nid=0x18a0 runnable [0x0000000007ebe000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.put(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) "pool-1-thread-2" prio=6 tid=0x0000000006a3b800 nid=0x6d4 runnable [0x000000000805f000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.get(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:29) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) "pool-1-thread-1" prio=6 tid=0x0000000006a3a800 nid=0x2bc runnable [0x0000000007d9e000] java.lang.Thread.State: RUNNABLE at java.util.HashMap.put(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) .............. 3 – CPU per thread correlation It is now time to convert the Process Explorer thread ID DECIMAL format to HEXA format as per below. The HEXA value allows us to map and identify each thread as per below: ## TID: 1748 (nid=0X6D4) Thread name: pool-1-thread-2 CPU @25.71% Task: Worker thread executing a HashMap.get() operation at java.util.HashMap.get(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:29) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ## TID: 700 (nid=0X2BC) Thread name: pool-1-thread-1 CPU @23.55% Task: Worker thread executing a HashMap.put() operation at java.util.HashMap.put(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ## TID: 6304 (nid=0X18A0) Thread name: pool-1-thread-3 CPU @12.02% Task: Worker thread executing a HashMap.put() operation at java.util.HashMap.put(Unknown Source) at org.ph.javaee.training4.WorkerThread.run(WorkerThread.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) ## TID: 5944 (nid=0X1738) Thread name: pool-1-thread-1 CPU @20.88% Task: Main Java program execution "main" prio=6 tid=0x0000000001e2b000 nid=0x1738 runnable [0x00000000029df000] java.lang.Thread.State: RUNNABLE at org.ph.javaee.training4.HashMapInfiniteLoopSimulator.main(HashMapInfiniteLoopSimulator.java:75) As you can see, the above correlation and analysis is quite revealing. Our main Java program is in a hang state because our 3 worker threads are using lot of CPU and not going anywhere. They may appear "stuck" performing HashMap get() & put() but in fact they are all involved in an infinite loop condition. This is exactly what we wanted to replicate. HashMap infinite looping deep dive Now let’s push the analysis one step further to better understand this looping condition. For this purpose, we added tracing code within the JDK 1.7 HashMap Java class itself in order to understand what is happening. Similar logging was added for the put() operation and also a trace indicating that the internal & automatic rehash/resize got triggered. The tracing added in get() and put() operations allows us to determine if the for() loop is dealing with circular dependency which would explain the infinite looping condition. #### HashMap.java get() operation public V get(Object key) { if (key == null) return getForNullKey(); int hash = hash(key.hashCode()); /*** P-H add-on- iteration counter ***/ int iterations = 1; for (Entry e = table[indexFor(hash, table.length)]; e != null; e = e.next) { /*** Circular dependency check ***/ Entry currentEntry = e; Entry nextEntry = e.next; Entry nextNextEntry = e.next != null?e.next.next:null; K currentKey = currentEntry.key; K nextNextKey = nextNextEntry != null?(nextNextEntry.key != null?nextNextEntry.key:null):null; System.out.println("HashMap.get() #Iterations : "+iterations++); if (currentKey != null && nextNextKey != null ) { if (currentKey == nextNextKey || currentKey.equals(nextNextKey)) System.out.println(" ** Circular Dependency detected! ["+currentEntry+"]["+nextEntry+"]"+"]["+nextNextEntry+"]"); } /***** END ***/ Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) return e.value; } return null; } HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.resize() in progress... HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 2 HashMap.resize() in progress... HashMap.resize() in progress... HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 2 HashMap.put() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 ** Circular Dependency detected! [362565=362565][333326=333326]][362565=362565] HashMap.put() #Iterations : 2 ** Circular Dependency detected! [333326=333326][362565=362565]][333326=333326] HashMap.put() #Iterations : 1 HashMap.put() #Iterations : 1 HashMap.get() #Iterations : 1 HashMap.put() #Iterations : 1 ............................. HashMap.put() #Iterations : 56823 Again, the added logging was quite revealing. We can see that following a few internal HashMap.resize() the internal structure became affected, creating circular dependency conditions and triggering this infinite looping condition (#iterations increasing and increasing...) with no exit condition. It is also showing that the resize() / rehash operation is the most at risk of internal corruption, especially when using the default HashMap size of 16. This means that the initial size of the HashMap appears to be a big factor in the risk & problem replication. Finally, it is interesting to note that we were able to successfully run the test case with the non-thread safe HashMap by assigning an initial size setting at 1000000, preventing any resize at all. Find below the merged graph results: The HashMap was our top performer but only when preventing an internal resize. Again, this is definitely not a solution to the thread safe risk but just a way to demonstrate that the resize operation is the most at risk given the entire manipulation of the HashMap performed at that time. The ConcurrentHashMap, by far, is our overall winner by providing both fast performance and thread safety against that test case. JBoss AS7 Map data structures usage We will now conclude this article by looking at the different Map implementations within a modern Java EE container implementation such as JBoss AS 7.1.2. You can obtain the latest source code from the github master branch. Find below the report: Total JBoss AS7.1.2 Java files (August 28, 2012 snapshot): 7302 Total Java classes using java.util.Hashtable: 72 Total Java classes using java.util.HashMap: 512 Total Java classes using synchronized HashMap: 18 Total Java classes using ConcurrentHashMap: 46 Hashtable references were found mainly within the test suite components and from naming and JNDI related implementations. This low usage is not a surprise here. References to the java.util.HashMap were found from 512 Java classes. Again not a surprise given how common this implementation is since the last several years. However, it is important to mention that a good ratio was found either from local variables (not shared across threads), synchronized HashMap or manual synchronization safeguard so “technically” thread safe and not exposed to the above infinite looping condition (pending/hidden bugs is still a reality given the complexity with Java concurrency programming…this case study involving Oracle Service Bus 11g is a perfect example). A low usage of synchronized HashMap was found with only 18 Java classes from packages such as JMS, EJB3, RMI and clustering. Finally, find below a breakdown of the ConcurrentHashMap usage which was our main interest here. As you will see below, this Map implementation is used by critical JBoss components layers such as the Web container, EJB3 implementation etc. ## JBoss Single Sign On Used to manage internal SSO ID's involving concurrent Thread access Total: 1 ## JBoss Java EE & Web Container Not surprising here since lot of internal Map data structures are used to manage the http sessions objects, deployment registry, clustering & replication, statistics etc. with heavy concurrent Thread access. Total: 11 ## JBoss JNDI & Security Layer Used by highly concurrent structures such as internal JNDI security management. Total: 4 ## JBoss domain & managed server management, rollout plans... Total: 7 ## JBoss EJB3 Used by data structures such as File Timer persistence store, application Exception, Entity Bean cache, serialization, passivation... Total: 8 ## JBoss kernel, Thread Pools & protocol management Used by high concurrent Threads Map data structures involved in handling and dispatching/processing incoming requests such as HTTP. Total: 3 ## JBoss connectors such as JDBC/XA DataSources... Total: 2 ## Weld (reference implementation of JSR-299: Contexts and Dependency Injection for the JavaTM EE platform) Used in the context of ClassLoader and concurrent static Map data structures involving concurrent Threads access. Total: 3 ## JBoss Test Suite Used in some integration testing test cases such as an internal Data Store, ClassLoader testing etc. Total: 3 Final words I hope this article has helped you revisit this classic problem and understand one of the common problems and risks associated with a wrong usage of the non-thread safe HashMap implementation. My main recommendation to you is to be careful when using an HashMap in a concurrent threads context. Unless you are a Java concurrency expert, I recommend that you use ConcurrentHashMap instead which offers a very good balance between performance and thread safety. As usual, extra due diligence is always recommended such as performing cycles of load & performance testing. This will allow you to detect thread safety and / or performance problems before you promote the solution to your client production environment. Please provide any comments and share your experience with ConcurrentHashMap or HashMap implementations and troubleshooting.

September 7, 2012

by Pierre - Hugues Charbonneau

· 154,881 Views · 5 Likes

Algorithm of the Week: Graphs and Their Representation

Although this post is supposed to be about algorithms I’ll cover more on graphs and their computer representation.

September 4, 2012

by Stoimen Popov

· 59,348 Views · 8 Likes

Manual Test-Driven Development

Test-Driven Development is a code-level practice, based on running automated tests that are written before the production code they exercise. But practices can be applied only in the context where they were developed: when some premises are not present is difficult to apply TDD as-is. Automated specification For example, consider the premise of assertion automation: it is possible to write a (hopefully) small algorithm that is able to check the result of running production code and return true or false. In the case the problem is: Draw an antialiased circle on this blank canvas. -- Carlo Pescio it is not immediately clear how to define automated tests for this behavior. We could check that some pixels are still blank inside or outside the circle, or that there is a bound number of pixels of black color; or even that they are contiguous. An opinion I've heard (that I try not to misrepresent) is that we only need to write some looser tests in these cases, checking only a few pixels of the circle. This process will give us a little feedback on the API of our Canvas or Circle object, but not much on the algorithm we are implementing inside it. Are we going in the right direction? Have new test cases correctly been satisfied without a large intervention on the existing code? Are we painting some unrelated pixels due to an hidden bug? What I argument here is instead that we should change the nature of the feedback mechanism. Speaking in control theory terms, change the block that acquires the output and influences the input to our design process. Develop in the browser When I was developing a Couchapp, a kind of web application served directly from a CouchDB database, I was appaled by the difficulty of testing it. While the production code was composed of ~100 lines, it was a complex mix of technologies: HTML and CSS code, client-side JavaScript for managing user events and some server-side JavaScript for the "queries" (actually the server-side only consists of the database in Couchapps.) Some of this logic could be tested in automation, like the result of queries over views. Yet much of it was related to a user interface, and as such requiring a large time investment to automate. Instead of waking up my Selenium server and start to manipulate a browser with code, I noticed that this UI was almost read-only; there were a few cases where a new document would have to be inserted, but a manual test of them was short and did not even required to reload the page. The whole application state was observable. Summing it up, I performed a frequent manual test that took a few seconds instead of trying to define complex and brittle automation logic for testing the UI. Now that I've been introduced to a simple qualitative ROI model by Carlo Pescio's article, I would do the same for every context where: a large time investment is needed for automating tests. it is possible to perform manual tests quickly. as the only logic conclusion. A word of caution TDD has many benefits (including catching regressions early) so I'm not prepared to give it up just because it is difficult to test. These are technical scenarios where I have successfully followed TDD by the book: multithreaded and multiprocess code applications distributed over multiple machines computer vision (object recognition and tracking) image manipulation code (via comparison testing) development of browser bindings for Selenium And even in the case the big picture is not easy to test-first (like in the case of image manipulation), we can benefit from TDD the pieces of the solution. For example, in the computer vision case I wasn't able to write a test beforehand for tracking a car inside a movie. But I was able to TDD the objects that the algorithmic solution to the problem called for: Patch, Area, Cluster, Movement, and so on. End-to-end TDD is not always cheap but unit level TDD can often be, if it considers testability as a relevant property (while regression testing even at the end-to-end level is always possible, in the worst case with record and replay.) End-to-end specifications If we can't define automated assertions for our "big picture" problem, it doesn't mean that we cannot apply the TDD approach, by substituting a manual step. Going back to the circle problem, I would define manual test cases on an inspection page seen by a human. I've seen this done with layouts and multiple browsers to catch CSS rendering bugs, for example: It would be very difficult to check these screenshots automatically, as each browser renders pages a bit differently from the others. The iterative process becomes: Define a cheap manual test, automating the arrange and act phases but not the assertion. Write only the code necessary to make it pass. Refactor. As long as the number of tests does not increase without limit and the manual check can be performed quickly, this approach does not slow you down with respect to TDD by-the-book. You'll have to take care of regression with other means; but at least you define a set of manual test cases. Feedback! TDD is an instrument of feedback: if feedback cannot be gathered in an automated way, we have to resort to manual checking of the specifications. Here are other examples of manual tools for generating feedback: Read-Eval-Print Loops: you can experimenting with existing classes and functions, and easily repeat steps thanks to history. the browser refresh button: the fastest way to transform a PSD into an HTML and CSS template. MongoDB console for learning the database API; other kinds of consoles like Firebug and Chrome's, or Clojure's.

September 3, 2012

by Giorgio Sironi

· 10,256 Views

Idempotent DB Update Scripts

An idempotent function gives the same result even if it is applied several times. That is exactly how a database update script should behave. It shouldn’t matter if it is run on or multiple times. The result should be the same. A database update script should be made to first check the state of the database and then apply the changes needed. If the script is done this way, several operations can be combined into one script that works on several databases despite the databases being at different (possibly unknown) state to start with. For the database schema itself I usually use Visual Studio 2010 database projects that handles updates automatically (in VS2012 the functionality has been changed significantly). Even with the schema updates handled automatically, there are always things that need manual handling. One common case is lookup tables that need initialization. Lookup Table Init Script I use a combination of a temp table and a MERGE clause to init lookup tables. CREATE TABLE #Colours ( ColourId INT NOT NULL, Name NVARCHAR(10) NOT NULL ) INSERT #Colours VALUES (1, N'Red'), (2, N'Green'), (3, N'Blue') MERGE Colours dst USING #Colours src ON (src.ColourId = dst.ColourId) WHEN MATCHED THEN UPDATE SET dst.ColourId = src.ColourId WHEN NOT MATCHED THEN INSERT VALUES (src.ColourId, src.Name) WHEN NOT MATCHED BY SOURCE THEN DELETE; DROP TABLE #Colours I think that the temp table approach is great because it gives a clear overview in the script of what the final values will be. It also works regardless of what the current values are. Sometimes it is relevant to keep old values, which can be done by removing the last two lines of the MERGE clause. It is also possible to flag records as inactive instead of deleting them. MERGE... ... WHEN NOT MATCHED BY SOURCE THEN SET dst.Active = 0; Checking Current State An idempotent script has to be able to check the current state and adopt its behaviour. The lookup table init script uses the MERGE clause for that, checking the actual values. In most cases it is possible to check the current state by inspecting the values of the table or through the sys meta data views. If that’s not possible, a separate table can be used to log the scripts run. This method has the advantage of an easy way to check what scripts have been run. The disadvantage is that it violates the DRY Principle by keeping a separate log, which can get out of sync with the actual database schema. What happens when a script is partially run and then fails before writing the log entry? What will happen the next time the script is run? This is where true idempotent script shines. Whenever there’s a doubt of the current state of the database the entire script can be run again, bringing the database to a known state.

September 3, 2012

by Anders Abel

· 11,156 Views

Building A Simple API Proxy Server with PHP

these days i’m playing with backbone and using public api as a source. the web browser has one horrible feature: it don’t allow you to fetch any external resource to our host due to the cross-origin restriction. for example if we have a server at localhost we cannot perform one ajax request to another host different than localhost. nowadays there is a header to allow it: access-control-allow-origin . the problem is that the remote server must set up this header. for example i was playing with github’s api and github doesn’t have this header. if the server is my server, is pretty straightforward to put this header but obviously i’m not the sysadmin of github, so i cannot do it. what the solution? one possible solution is, for example, create a proxy server at localhost with php. with php we can use any remote api with curl (i wrote about it here and here for example). it’s not difficult, but i asked myself: can we create a dummy proxy server with php to handle any request to localhost and redirects to the real server, instead of create one proxy for each request?. let’s start. problably there is one open source solution (tell me if you know it) but i’m on holidays and i want to code a little bit (i now, it looks insane but that’s me ). the idea is: ... $proxy->register('github', 'https://api.github.com'); ... and when i type: http://localhost/github/users/gonzalo123 and create a proxy to : https://api.github.com/users/gonzalo123 the request method is also important. if we create a post request to localhost we want a post request to github too. this time we’re not going to reinvent the wheel, so we will use symfony componets so we will use composer to start our project: we create a conposer.json file with the dependencies: { "require": { "symfony/class-loader":"dev-master", "symfony/http-foundation":"dev-master" } } now php composer.phar install and we can start coding. the script will look like this: register('github', 'https://api.github.com'); $proxy->run(); foreach($proxy->getheaders() as $header) { header($header); } echo $proxy->getcontent(); as we can see we can register as many servers as we want. in this example we only register github. the application only has two classes: restproxy , who extracts the information from the request object and calls to the real server through curlwrapper . request = $request; $this->curl = $curl; } public function register($name, $url) { $this->map[$name] = $url; } public function run() { foreach ($this->map as $name => $mapurl) { return $this->dispatch($name, $mapurl); } } private function dispatch($name, $mapurl) { $url = $this->request->getpathinfo(); if (strpos($url, $name) == 1) { $url = $mapurl . str_replace("/{$name}", null, $url); $querystring = $this->request->getquerystring(); switch ($this->request->getmethod()) { case 'get': $this->content = $this->curl->doget($url, $querystring); break; case 'post': $this->content = $this->curl->dopost($url, $querystring); break; case 'delete': $this->content = $this->curl->dodelete($url, $querystring); break; case 'put': $this->content = $this->curl->doput($url, $querystring); break; } $this->headers = $this->curl->getheaders(); } } public function getheaders() { return $this->headers; } public function getcontent() { return $this->content; } } the restproxy receive two instances in the constructor via dependency injection (curlwrapper and request). this architecture helps a lot in the tests , because we can mock both instances. very helpfully when building restproxy. the restproxy is registerd within packaist so we can install it using composer installer: first install componser curl -s https://getcomposer.org/installer | php and create a new project: php composer.phar create-project gonzalo123/rest-proxy proxy if we are using php5.4 (if not, what are you waiting for?) we can run the build-in server cd proxy php -s localhost:8888 -t www/ now we only need to open a web browser and type: http://localhost:8888/github/users/gonzalo123 the library is very minimal (it’s enough for my experiment) and it does’t allow authorization. of course full code is available in github .

September 2, 2012

by Gonzalo Ayuso

· 20,291 Views

Password Encryption -- Short Answer: Don't.

First, read this. Why passwords have never been weaker—and crackers have never been stronger. There are numerous important lessons in this article. One of the small lessons is that changing your password every sixty or ninety days is farcical. The rainbow table algorithms can crack a badly-done password in minutes. Every 60 days, the cracker has to spend a few minutes breaking your new password. Why bother changing it? It only annoys the haxorz; they'll be using your account within a few minutes. However. That practice is now so ingrained that it's difficult to dislodge from the heads of security consultants. The big lesson, however, is profound. Work Experience Recently, I got a request from a developer on how to encrypt a password. We have a Python back-end and the developer was asking which crypto package to download and how to install it. "Crypto?" I asked. "Why do we need crypto?" "To encrypt passwords," they replied. I spat coffee on my monitor. I felt like hitting Caps Lock in the chat window so I could respond like this: "NEVER ENCRYPT A PASSWORD, YOU DOLT." I didn't, but I felt like it. Much Confusion The conversation took hours. Chat can be slow that way. Also, I can be slow because I need to understand what's going on before I reply. I'm a slow thinker. But the developer also needed to try stuff and provide concrete code examples, which takes time. At the time, I knew that passwords must be hashed with salt. I hadn't read the Ars Technica article cited above, so I didn't know why computationally intensive hash algorithms are best for this. We had to discuss hash algorithms. We had to discuss algorithms for generating unique salt. We had to discuss random number generators and how to use an entropy source for a seed. We had to discuss http://www.ietf.org/rfc/rfc2617.txt in some depth, since the algorithms in section 3.2.2. show some best practices in creating hash summaries of usernames, passwords, and realms. All of this was, of course, side topics before we got to the heart of the matter. What's Been Going On After several hours, my "why" questions started revealing things. The specific user story, for example, was slow to surface. Why? Partly because I didn't demand it early enough. But also, many technology folks will conceive of a "solution" and pursue that technical concept no matter how difficult or bizarre. In some cases, the concept doesn't really solve the problem. I call this the "Rat Holes of Lost Time" phenomena: we chase some concept through numerous little rat-holes before we realize there's a lot of activity but no tangible progress. There's a perceptual narrowing that occurs when we focus on the technology. Often, we're not actually solving the problem. IT people leap past the problem into the solution as naturally as they breathe. It's a hard habit to break. It turned out that they were creating some additional RESTful web services. They knew that the RESTful requests needed proper authentication. But, they were vague on the details of how to secure the new RESTful services. So they were chasing down their concept: encrypt a password and provide this encrypted password with each request. They were half right, here. A secure "token" is required. But an encrypted password is a terrible token. Use The Framework, Luke What's most disturbing about this is the developer's blind spot. For some reason, the existence of other web services didn't enter into this developer's head. Why didn't they read the code for the services created on earlier sprints? We're using Django. We already have a RESTful web services framework with a complete (and high quality) security implementation. Nothing more is required. Use the RESTful authentication already part of Django. In most cases, HTTPS is used to encrypt at the socket layer. This means that Basic Authentication is all that's required. This is a huge simplification, since all the RESTful frameworks already offer this. The Django Rest Framework has a nice authentication module. When using Piston, it's easy to work with their Authentication handler. It's possible to make RESTful requests with Digest Authentication, if SSL is not being used. For example, Akoha handles this. It's easy to extend a framework to add Digest in addition to Basic authentication. For other customers, I created an authentication handler between Piston and ForgeRock OpenAM so that OpenAM tokens were used with each RESTful request. (This requires some care to create a solution that is testable.) Bottom Lines Don't encrypt passwords. Ever. Don't write your own hash and salt algorithm. Use a framework that offers this to you. Read the Ars Technica article before doing anything password-related.

August 28, 2012

by Steven Lott

· 21,822 Views

Adding Hibernate Entity Level Filtering feature to Spring Data JPA Repository

Original Article: http://borislam.blogspot.hk/2012/07/adding-hibernate-entity-level-filter.html Those who have used data filtering features of hibernate should know that it is very powerful. You could define a set of filtering criteria to an entity class or a collection. Spring data JPA is a very handy library but it does not have fitering features. In this post, I will demonstarte how to add the hibernate filter features at entity level. You can use this features when you are using Hibernate Entity Manager. We can just define annotation in your repositoy interface to enable this features. Step 1. Define filter at entity level as usual. Just use hibernate @FilterDef annotation @Entity @Table(name = "STUDENT") @FilterDef(name="filterBySchoolAndClass", parameters={@ParamDef(name="school", type="string"),@ParamDef(name="class", type="integer")}) public class Student extends GenericEntity implements Serializable { // add your properties ... } Step2. Define two custom annotations. These two annotations are to be used in your repository interfaces. You could apply the hibernate filter defined in step 1 to specific query through these annotations. @Target(ElementType.TYPE) @Retention(RetentionPolicy.RUNTIME) public @interface EntityFilter { FilterQuery[] filterQueries() default {}; } @Retention(RetentionPolicy.RUNTIME) public @interface FilterQuery { String name() default ""; String jpql() default ""; } Step3. Add a method to your Spring data JPA base repository. This method will read the annotation you defined (i.e. @FilterQuery) and apply hibernate filter to the query by just simply unwrap the EntityManager. You could specify the parameter in your hibernate filter and also the parameter in you query in this method. If you do not know how to add custom method to your Spring data JPA base repository, please see my previous article for how to customize your Spring data JPA base repository for detail. You can see in previous article that I intentionally expose the repository interface (i.e. the springDataRepositoryInterface property) in the GenericRepositoryImpl. This small tricks enable me to access the annotation in the repository interface easily. public List doQueryWithFilter( String filterName, String filterQueryName, Map inFilterParams, Map inQueryParams){ if (GenericRepository.class.isAssignableFrom(getSpringDataRepositoryInterface())) { Annotation entityFilterAnn = getSpringDataRepositoryInterface().getAnnotation(EntityFilter.class); if(entityFilterAnn != null){ EntityFilter entityFilter = (EntityFilter)entityFilterAnn; FilterQuery[] filterQuerys = entityFilter.filterQueries() ; for (FilterQuery fQuery : filterQuerys) { if (StringUtils.equals(filterQueryName, fQuery.name())) { String jpql = fQuery.jpql(); Filter filter = em.unwrap(Session.class).enableFilter(filterName); //set filter parameter for (Object key: inFilterParams.keySet()) { String filterParamName = key.toString(); Object filterParamValue = inFilterParams.get(key); filter.setParameter(filterParamName, filterParamValue); } //set query parameter Query query= em.createQuery(jpql); for (Object key: inQueryParams.keySet()) { String queryParamName = key.toString(); Object queryParamValue = inQueryParams.get(key); query.setParameter(queryParamName, queryParamValue); } return query.getResultList(); } } } } } return null; } Last Step: example usage In your repositry, define which query you would like to apply hibernate filter through your @EntityFilter and @FilterQuery annotation. @EntityFilter ( filterQueries = { @FilterQuery(name="query1", jpql="SELECT s FROM Student LEFT JOIN FETCH s.Subject where s.subject = :subject" ), @FilterQuery(name="query2", jpql="SELECT s FROM Student LEFT JOIN s.TeacherSubject where s.teacher = :teacher") } ) public interface StudentRepository extends GenericRepository { } In your service or business class that inject your repository, you could just simply call the doQueryWithFilter() method to enable the filtering function. @Service public class StudentService { @Inject private StudentRepository studentRepository; public List searchStudent( String subject, String school, String class) { List studentList; // Prepare parameters for query filter HashMap inFilterParams = new HashMap(); inFilterParams.put("school", "Hong Kong Secondary School"); inFilterParams.put("class", "S5"); // Prepare parameters for query HashMap inParams = new HashMap(); inParams.put("subject", "Physics"); studentList = studentRepository.doQueryWithFilter( "filterBySchoolAndClass", "query1", inFilterParams, inParams); return studentList; } }

August 24, 2012

by Boris Lam

· 56,831 Views · 1 Like

Spring Data, Spring Security and Envers integration

Learn about pros, cons, and basics of Spring security and data, plus Envers integration.

August 20, 2012

by Nicolas Fränkel

· 25,043 Views · 1 Like

EF Migrations Command Reference

Entity Framework Migrations are handled from the package manager console in Visual Studio. The usage is shown in various tutorials, but I haven’t found a complete list of the commands available and their usage, so I created my own. There are four available commands. Enable-Migrations: Enables Code First Migrations in a project. Add-Migration: Scaffolds a migration script for any pending model changes. Update-Database: Applies any pending migrations to the database. Get-Migrations: Displays the migrations that have been applied to the target database. The information here is the output of running get-help command-name -detailed for each of the commands in the package manager console (running EF 4.3.1). I’ve also added some own comments where I think some information is missing. My own comments are placed under the Additional Information heading. Please note that all commands should be entered on the same line. I’ve added line breaks to avoid vertical scrollbars. Enable-Migrations Enables Code First Migrations in a project. Syntax Enable-Migrations [-EnableAutomaticMigrations] [[-ProjectName] ] [-Force] [] Description Enables Migrations by scaffolding a migrations configuration class in the project. If the target database was created by an initializer, an initial migration will be created (unless automatic migrations are enabled via the EnableAutomaticMigrations parameter). Parameters -EnableAutomaticMigrations Specifies whether automatic migrations will be enabled in the scaffolded migrations configuration. If ommitted, automatic migrations will be disabled. -ProjectName Specifies the project that the scaffolded migrations configuration class will be added to. If omitted, the default project selected in package manager console is used. -Force Specifies that the migrations configuration be overwritten when running more than once for given project. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Enable-Migrations -examples. For more information, type: get-help Enable-Migrations -detailed. For technical information, type: get-help Enable-Migrations -full. Additional Information The flag for enabling automatic migrations is saved in the Migrations\Configuration.cs file, in the constructor. To later change the option, just change the assignment in the file. public Configuration() { AutomaticMigrationsEnabled = false; } Add-Migration Scaffolds a migration script for any pending model changes. Syntax Add-Migration [-Name] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [-IgnoreChanges] [] Add-Migration [-Name] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [-IgnoreChanges] [] Description Scaffolds a new migration script and adds it to the project. Parameters -Name Specifies the name of the custom script. -Force Specifies that the migration user code be overwritten when re-scaffolding an existing migration. -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. -IgnoreChanges Scaffolds an empty migration ignoring any pending changes detected in the current model. This can be used to create an initial, empty migration to enable Migrations for an existing database. N.B. Doing this assumes that the target database schema is compatible with the current model. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Add-Migration -examples. For more information, type: get-help Add-Migration -detailed. For technical information, type: get-help Add-Migration -full. Update-Database Applies any pending migrations to the database. Syntax Update-Database [-SourceMigration ] [-TargetMigration ] [-Script] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [] Update-Database [-SourceMigration ] [-TargetMigration ] [-Script] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [] Description Updates the database to the current model by applying pending migrations. Parameters -SourceMigration Only valid with -Script. Specifies the name of a particular migration to use as the update’s starting point. If ommitted, the last applied migration in the database will be used. -TargetMigration Specifies the name of a particular migration to update the database to. If ommitted, the current model will be used. -Script Generate a SQL script rather than executing the pending changes directly. -Force Specifies that data loss is acceptable during automatic migration of the database. -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Update-Database -examples. For more information, type: get-help Update-Database -detailed. For technical information, type: get-help Update-Database -full. Additional Information The command always runs any pending code-based migrations first. If the database is still incompatible with the model the additional changes required are applied as an separate automatic migration step if automatic migrations are enabled. If automatic migrations are disabled an error message is shown. Get-Migrations Displays the migrations that have been applied to the target database. Syntax Get-Migrations [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [] Get-Migrations [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [] Description Displays the migrations that have been applied to the target database. Parameters -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Get-Migrations -examples. For more information, type: get-help Get-Migrations -detailed. For technical information, type: get-help Get-Migrations -full. Additional Information The powershell commands are complex powershell functions, located in the tools\EntityFramework.psm1 file of the Entity Framework installation. The powershell code is mostly a wrapper around the System.Data.Entity.Migrations.MigrationsCommands found in the tools\EntityFramework\EntityFramework.PowerShell.dll file. First a MigrationsCommands object is instantiated with all configuration parameters. Then there is a public method on the MigrationsCommands object for each of the available commands.

August 20, 2012

by Anders Abel

· 31,378 Views · 1 Like

How to Migrate Drupal to Azure Web Sites

DrupalCon Munich is next week, and I am lucky enough to be going. As part of preparing for the conference, I thought it would be worthwhile to see just how easy (or difficult) it would be to migrate an existing Drupal site to Windows Azure Web Sites. So, in this post, I’ll do just that. Fortunately, because Windows Azure Web Sites supports both PHP and MySQL, the migration process is relatively straightforward. And, because Drupal and PHP run on any platform, the process I’ll describe should work for moving Drupal to Windows Azure Web Sites regardless of what platform you are moving from. Of course, Drupal installations can vary widely, so YMMV. I tested the instructions below on relatively small (and simple) Drupal installation running on CentOS 5. (Unfortunately, I won’t be using Drush since it isn’t supported on Windows Azure Websites.) If you are considering moving a large and complex Drupal application, may want to consider moving to Windows Azure Cloud Services (more information about that here: Migrating a Drupal Site from LAMP to Windows Azure). Before getting started, it’s worth noting that Windows Azure Websites lets you run up to 10 Web Sites for free in a multitenant environment. And, you can seamlessly upgrade to private, reserved VM instances as your traffic grows. To sign up, try the Windows Azure 90-day free trial. 1. Create a Windows Azure Web Site and MySQL database There is a step-by-step tutorial on http://www.windowsazure.com that walks you through creating a new website and a MySQL database, so I’ll refer you there to get started: Create a PHP-MySQL Windows Azure web site and deploy using Git. If you intend to use Git to publish your Drupal site, then go ahead and follow the instructions for setting up a Git repository. Make sure to follow the instructions in the Get remote MySQL connection information section as you will need that information later. You can ignore the remainder of the tutorial for the purposes of deploying your Drupal site, but if you are new to Windows Azure Web Sites (and to Git), you might find the additional reading informative. Ok, now you have a new website with a MySQL database, your have your MySQL database connection information, and you have (optionally) created a remote Git repository and made note of the Git deployment instructions. Now you are ready to copy your database to MySQL in Windows Azure Web Sites. 2. Copy database to MySQL in Windows Azure Web Sites I’m sure there is more than one way to copy your Drupal database, but I found the mysqldump tool to be effective and easy to use. To copy from a local machine to Windows Azure Web Sites, here’s the command I used: mysqldump -u local_username --password=local_password drupal | mysql -h remote_host -u remote_username --password=remote_password remote_db_name You will, of course, have to provide the username and password for your existing Drupal database, and you will have to provide the hostname, username, password, and database name for the MySQL database you created in step 1. This information is available in the connection string information that you should have noted in step 1. i.e. You should have a connection string that looks something like this: Database=remote_db_name;Data Source=remote_host;User Id=remote_username;Password=remote_password Depending on the size of your database, the copying process could take several minutes. Now your Drupal database is live in Windows Azure Websites. Before you deploy your Drupal code, you need to modify it so it can connect to the new database. 3. Modify database connection info in settings.php Here, you will again need your new database connection information. Open the /drupal/sites/default/setting.php file in your favorite text editor, and replace the values of ‘database’, ‘username’, ‘password’, and ‘host’ in the $databases array with the correct values for your new database. When you are finished, you should have something similar to this: $databases = array ( 'default' => array ( 'default' => array ( 'database' => 'remote_db_name', 'username' => 'remote_username', 'password' => 'remote_password', 'host' => 'remote_host', 'port' => '', 'driver' => 'mysql', 'prefix' => '', ), ), ); Be sure to save the settings.phpfile, then you are ready to deploy. 4. Deploy Drupal code using Git or FTP The last step is to deploy your code to Windows Azure Web Sites using Git or FTP. If you are using FTP, you can get the FTP hostname and username from you website’s dashboard. Then, use your favorite FTP client to upload your Drupal files to the /site/wwwroot folder of the remote site. If you are using Git, you need to set up a Git repository in Windows Azure Web Sites (steps for this are in the tutorial mentioned earlier). And, you will need Git installed on your local machine. Then, just follow the instructions provided after you created the repository: One note about using Git here: depending on your Git settings, your .gitignore file (a hidden file and a sibling to the .git folder created in your local root directory after you executed git commit), some files in your Drupal application may be ignored. In my case, all the files in the sites directory were ignored. If this happens, you will want to edit the .gitignore file so that these files aren’t ignored and redeploy. After you have deployed Drupal to Windows Azure Web Sites, you can continue to deploy updates via Git or FTP. Related information If you are looking for more information about Windows Azure Web Sites, these posts might be helpful: Windows Azure Websites- A PHP Perspective Windows Azure Websites, Web Roles, and VMs- When to use which- Configuring PHP in Windows Azure Websites with .user.ini Files One last thing you might consider, depending on your site, is using the Windows Azure Integration Module to store and serve your site’s media files.

August 19, 2012

by Brian Swan

· 10,246 Views

Machine Learning: Measuring Similarity and Distance

Measuring similarity or distance between two data points is fundamental to many Machine Learning algorithms such as K-Nearest-Neighbor, Clustering ... etc.

August 10, 2012

by Ricky Ho

· 54,305 Views · 6 Likes

Generate a Random Alpha Numeric String

Generate a random alpha numeric string whose length is the number of characters specified. Characters will be chosen from the set of alpha-numeric characters. Count is the length of random string to create. private static final String ALPHA_NUMERIC_STRING = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; public static String randomAlphaNumeric(int count) { StringBuilder builder = new StringBuilder(); while (count-- != 0) { int character = (int)(Math.random()*ALPHA_NUMERIC_STRING.length()); builder.append(ALPHA_NUMERIC_STRING.charAt(character)); } return builder.toString(); }

August 9, 2012

by Kunal Bhatia

· 164,556 Views · 3 Likes

String Memory Internals

This article is based on my answer on StackOverflow. I am trying to explain how String class stores the texts, how interning and constant pool works. The main point to understand here is the distinction between String Java object and its contents - char[] under private value field. String is basically a wrapper around char[] array, encapsulating it and making it impossible to modify so the String can remain immutable. Also the String class remembers which parts of this array is actually used (see below). This all means that you can have two different String objects (quite lightweight) pointing to the same char[]. I will show you few examples, together with hashCode() of each String and hashCode() of internal char[] value field (I will call it text to distinguish it from string). Finally I'll show javap -c -verbose output, together with constant pool for my test class. Please do not confuse class constant pool with string literal pool. They are not quite the same. See also Understanding javap's output for the Constant Pool Prerequisites For the purpose of testing I created such a utility method that breaks String encapsulation: private int showInternalCharArrayHashCode(String s) { final Field value = String.class.getDeclaredField("value"); value.setAccessible(true); return value.get(s).hashCode(); } It will print hashCode() of char[] value, effectively helping us understand whether this particular String points to the same char[] text or not. Two string literals in a class Let's start from the simplest example. Java code String one = "abc"; String two = "abc"; BTW if you simply write "ab" + "c", Java compiler will perform concatenation at compile time and the generated code will be exactly the same. This only works if all strings are known at compile time. Class constant pool Each class has its own constant pool - a list of constant values that can be reused if they occur several times in the source code. It includes common strings, numbers, method names, etc. Here are the contents of the constant pool in our example above: const #2 = String #38; // abc //... const #38 = Asciz abc; The important thing to note is the distinction between String constant object (#2) and Unicode encoded text "abc" (#38) that the string points to. Byte code Here is generated byte code. Note that both one and two references are assigned with the same #2 constant pointing to "abc" string: ldc #2; //String abc astore_1 //one ldc #2; //String abc astore_2 //two Output For each example I am printing the following values: System.out.println("one.value: " + showInternalCharArrayHashCode(one)); System.out.println("two.value: " + showInternalCharArrayHashCode(two)); System.out.println("one" + System.identityHashCode(one)); System.out.println("two" + System.identityHashCode(two)); No surprise that both pairs are equal: one.value: 23583040 two.value: 23583040 one: 8918249 two: 8918249 Which means that not only both objects point to the same char[] (the same text underneath) so equals() test will pass. But even more, one and two are the exact same references! So one == two is true as well. Obviously if one and two point to the same object then one.value and two.value must be equal. Literal and new String() Java code Now the example we all waited for - one string literal and one new String using the same literal. How will this work? String one = "abc"; String two = new String("abc"); The fact that "abc" constant is used two times in the source code should give you some hint... Class constant pool Same as above. Byte code ldc #2; //String abc astore_1 //one new #3; //class java/lang/String dup ldc #2; //String abc invokespecial #4; //Method java/lang/String."":(Ljava/lang/String;)V astore_2 //two Look carefully! The first object is created the same way as above, no surprise. It just takes a constant reference to already created String (#2) from the constant pool. However the second object is created via normal constructor call. But! The first String is passed as an argument. This can be decompiled to: String two = new String(one); Output The output is a bit surprising. The second pair, representing references to String object is understandable - we created two String objects - one was created for us in the constant pool and the second one was created manually for two. But why, on earth the first pair suggests that both String objects point to the same char[] value array?! one.value: 41771 two.value: 41771 one: 8388097 two: 16585653 It becomes clear when you look at how String(String) constructor works (greatly simplified here): public String(String original) { this.offset = original.offset; this.count = original.count; this.value = original.value; } See? When you are creating new String object based on existing one, it reuses char[] value. Strings are immutable, there is no need to copy data structure that is known to be never modified. Moreover, since new String(someString) creates an exact copy of existing string and strings are immutable, there is clearly no reason for the two to exist at the same time. I think this is the clue of some misunderstandings: even if you have two String objects, they might still point to the same contents. And as you can see the String object itself is quite small. Runtime modification and intern() Java code Let's say you initially used two different strings but after some modifications they are all the same: String one = "abc"; String two = "?abc".substring(1); //also two = "abc" The Java compiler (at least mine) is not clever enough to perform such operation at compile time, have a look: Class constant pool Suddenly we ended up with two constant strings pointing to two different constant texts: const #2 = String #44; // abc const #3 = String #45; // ?abc const #44 = Asciz abc; const #45 = Asciz ?abc; Byte Code ldc #2; //String abc astore_1 //one ldc #3; //String ?abc iconst_1 invokevirtual #4; //Method String.substring:(I)Ljava/lang/String; astore_2 //two The fist string is constructed as usual. The second is created by first loading the constant "?abc" string and then calling substring(1) on it. Output No surprise here - we have two different strings, pointing to two different char[] texts in memory: one.value: 27379847 two.value: 7615385 one: 8388097 two: 16585653 Well, the texts aren't really different, equals() method will still yield true. We have two unnecessary copies of the same text. Now we should run two exercises. First, try running: two = two.intern(); before printing hash codes. Not only both one and two point to the same text, but they are the same reference! one.value: 11108810 two.value: 11108810 one: 15184449 two: 15184449 This means both one.equals(two) and one == two tests will pass. Also we saved some memory because "abc" text appears only once in memory (the second copy will be garbage collected). The second exercise is slightly different, check out this: String one = "abc"; String two = "abc".substring(1); Obviously one and two are two different objects, pointing to two different texts. But how come the output suggests that they both point to the same char[] array?!? one.value: 23583040two.value: 23583040one: 11108810two: 8918249 I'll leave the answer to you. It'll teach you how substring() works, what are the advantages of such approach and when it can lead to big troubles. Lessons learnt String object itself is rather cheap. It's the text it points to that consumes most of the memory String is just a thin wrapper around char[] to preserve immutability new String("abc") isn't really that expensive as the internal text representation is reused. But still avoid such construct. When String is concatenated from constant values known at compile time, concatenation is done by the compiler, not by the JVM substring() is tricky, but most importantly, it is very cheap, both in terms of used memory and run time (constant in both cases)

August 8, 2012

by Tomasz Nurkiewicz

· 52,541 Views · 5 Likes

tcpdump: Learning how to read UDP packets

Use tcpdump to capture any UDP packets on port 8125.

August 7, 2012

by Mark Needham

· 305,665 Views

Using Multiple Versions of JDK and Eclipse in Single Machine

In my office laptop, I have installed two versions of JDK. For the office work, I need JDK6 because the internal framework needs it. I’m using JDK7 for my personal projects and exploring the latest and greatest in Java. I have two versions of Eclipse too (one for office work and one is the latest Juno). But, the tricky thing is to manage these multiple JDKs and IDEs. It’s a piece of cake if I just use Eclipse for compiling my code, because the IDE allows me to configure multiple versions of Java runtime. Unfortunately (or fortunately), I have to use the command line/shell to build my code. So, it is important that I have the right version of JDK present in the PATH and other related environment variables (such as JAVA_HOME). Manually modifying the environment variables every time I want to switch between JDKs, isn’t a happy task. But, thanks to Windows Powershell, I’m able to write a scriplet that can do the heavy-lifting for me. Basically, what I want to achieve is to set PATH variable to add Java bin folder and set the JAVA_HOME environment variable and then launch the correct Eclipse IDE. And, I want to do this with a single command. Let’s do it. Open a Windows Powershell. I prefer writing custom Windows scripts in my profile file so that it is available to run when ever I open the shell. To edit the profile, run this command: notepad.exe $profile - the $profile is a special variable that points to your profile file. Write the below script in the profile file and save it. function myIDE{ $env:Path += "C:\vraa\java\jdk7\bin;" $env:JAVA_HOME = "C:\vraa\java\jdk7" C:\vraa\ide\eclipse\eclipse set-location C:\vraa\workspace\myproject play } function officeIDE{ $env:Path += "C:\vraa\java\jdk6\bin;" $env:JAVA_HOME = "C:\vraa\java\jdk6" C:\office\eclipse\eclipse } Close and restart the Powershell. Now you can issue the command myIDE which will set the proper PATH and environment variables and then launch the eclipse IDE. As you can see, there are two functions with different configurations. Just call the function name that you want to launch from the Powershell command line (myIDE or officeIDE).

August 4, 2012

by Veera Sundar

· 20,831 Views

Spring Data With Cassandra Using JPA

We recently adopted the use of Spring Data. Spring Data provides a nice pattern/API that you can layer on top of JPA to eliminate boiler-plate code. With that adoption, we started looking at the DAO layer we use against Cassandra for some of our operations. Some of the data we store in Cassandra is simple. It does *not* leverage the flexible nature of NoSQL. In other words, we know all the table names, the column names ahead of time, and we don't anticipate them changing all that often. We could have stored this data in an RDBMs, using hibernate to access it, but standing up another persistence mechanism seemed like overkill. For simplicity's sake, we preferred storing this data in Cassandra. That said, we want the flexibility to move this to an RDBMs if we need to. Enter JPA. JPA would provide us a nice layer of abstraction away from the underlying storage mechanism. Wouldn't it be great if we could annotate the objects with JPA annotations, and persist them to Cassandra? Enter Kundera. Kundera is a JPA implementation that supports Cassandra (among other storage mechanisms). OK -- so JPA is great, and would get us what we want, but we had just adopted the use of Spring Data. Could we use both? The answer is "sort of". I forked off SpringSource's spring-data-cassandra: https://github.com/boneill42/spring-data-cassandra And I started hacking on it. I managed to get an implementation of the PagingAndSortingRepository for which I wrote unit tests that worked, but I was duplicating a lot of what should have come for free in the SimpleJpaRepository. When I tried to substitute my CassandraJpaRepository for the SimpleJpaRepository, I ran into some trouble w/ Kundera. Specifically, the MetaModel implementation appeared to be incomplete. MetaModelImpl was returning null for all managedTypes(). SimpleJpa wasn't too happy with this. Instead of wrangling with Kundera, we punted. We can achieve enough of the value leveraging JPA directly. Perhaps more importantly, there is still an impedance mismatch between JPA and NoSQL. In our case, it would have been nice to get at Cassandra through Spring Data using JPA for a few cases in our app, but for the vast majority of the application, a straight up ORM layer whereby we know the tables, rows and column names ahead of time is insufficient. For those cases where we don't know the schema ahead of time, we're going to need to leverage the converters pattern in Spring Data. So, I started hacking on a proper Spring Data layer using Astyanax as the client. Follow along here: https://github.com/boneill42/spring-data-cassandra More to come on that....

July 31, 2012

by Brian O' Neill

· 30,259 Views

Use Lucene’s MMapDirectory on 64bit Platforms, Please!

Don’t be afraid – Some clarification to common misunderstandings Since version 3.1, Apache Lucene and Solr use MMapDirectory by default on 64bit Windows and Solaris systems; since version 3.3 also for 64bit Linux systems. This change lead to some confusion among Lucene and Solr users, because suddenly their systems started to behave differently than in previous versions. On the Lucene and Solr mailing lists a lot of posts arrived from users asking why their Java installation is suddenly consuming three times their physical memory or system administrators complaining about heavy resource usage. Also consultants were starting to tell people that they should not use MMapDirectory and change their solrconfig.xml to work instead with slow SimpleFSDirectory or NIOFSDirectory (which is much slower on Windows, caused by a JVM bug #6265734). From the point of view of the Lucene committers, who carefully decided that using MMapDirectory is the best for those platforms, this is rather annoying, because they know, that Lucene/Solr can work with much better performance than before. Common misinformation about the background of this change causes suboptimal installations of this great search engine everywhere. In this blog post, I will try to explain the basic operating system facts regarding virtual memory handling in the kernel and how this can be used to largely improve performance of Lucene (“VIRTUAL MEMORY for DUMMIES”). It will also clarify why the blog and mailing list posts done by various people are wrong and contradict the purpose of MMapDirectory. In the second part I will show you some configuration details and settings you should take care of to prevent errors like “mmap failed” and suboptimal performance because of stupid Java heap allocation. Virtual Memory[1] Let’s start with your operating system’s kernel: The naive approach to do I/O in software is the way, you have done this since the 1970s – the pattern is simple: whenever you have to work with data on disk, you execute a syscall to your operating system kernel, passing a pointer to some buffer (e.g. a byte[] array in Java) and transfer some bytes from/to disk. After that you parse the buffer contents and do your program logic. If you don’t want to do too many syscalls (because those may cost a lot processing power), you generally use large buffers in your software, so synchronizing the data in the buffer with your disk needs to be done less often. This is one reason, why some people suggest to load the whole Lucene index into Java heap memory (e.g., by using RAMDirectory). But all modern operating systems like Linux, Windows (NT+), MacOS X, or Solaris provide a much better approach to do this 1970s style of code by using their sophisticated file system caches and memory management features. A feature called “virtual memory” is a good alternative to handle very large and space intensive data structures like a Lucene index. Virtual memory is an integral part of a computer architecture; implementations require hardware support, typically in the form of a memory management unit (MMU) built into the CPU. The way how it works is very simple: Every process gets his own virtual address space where all libraries, heap and stack space is mapped into. This address space in most cases also start at offset zero, which simplifies loading the program code because no relocation of address pointers needs to be done. Every process sees a large unfragmented linear address space it can work on. It is called “virtual memory” because this address space has nothing to do with physical memory, it just looks like so to the process. Software can then access this large address space as if it were real memory without knowing that there are other processes also consuming memory and having their own virtual address space. The underlying operating system works together with the MMU (memory management unit) in the CPU to map those virtual addresses to real memory once they are accessed for the first time. This is done using so called page tables, which are backed by TLBs located in the MMU hardware (translation lookaside buffers, they cache frequently accessed pages). By this, the operating system is able to distribute all running processes’ memory requirements to the real available memory, completely transparent to the running programs. Schematic drawing of virtual memory (image from Wikipedia [1], http://en.wikipedia.org/wiki/File:Virtual_memory.svg, licensed by CC BY-SA 3.0) By using this virtualization, there is one more thing, the operating system can do: If there is not enough physical memory, it can decide to “swap out” pages no longer used by the processes, freeing physical memory for other processes or caching more important file system operations. Once a process tries to access a virtual address, which was paged out, it is reloaded to main memory and made available to the process. The process does not have to do anything, it is completely transparent. This is a good thing to applications because they don’t need to know anything about the amount of memory available; but also leads to problems for very memory intensive applications like Lucene. Lucene & Virtual Memory Let’s take the example of loading the whole index or large parts of it into “memory” (we already know, it is only virtual memory). If we allocate a RAMDirectory and load all index files into it, we are working against the operating system: The operating system tries to optimize disk accesses, so it caches already all disk I/O in physical memory. We copy all these cache contents into our own virtual address space, consuming horrible amounts of physical memory (and we must wait for the copy operation to take place!). As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory and where does it land? – On disk again (in the OS swap file)! In fact, we are fighting against our O/S kernel who pages out all stuff we loaded from disk [2]. So RAMDirectory is not a good idea to optimize index loading times! Additionally, RAMDirectory has also more problems related to garbage collection and concurrency. Because the data residing in swap space, Java’s garbage collector has a hard job to free the memory in its own heap management. This leads to high disk I/O, slow index access times, and minute-long latency in your searching code caused by the garbage collector driving crazy. On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again. Memory Mapping Files The solution to the above issues is MMapDirectory, which uses virtual memory and a kernel feature called “mmap” [3] to access the disk files. In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does! Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache. It is now just a native memory access, nothing more! We don’t have to take care of paging in/out of buffers, all this is managed by the O/S kernel. Furthermore, we have no concurrency issue, the only overhead over a standard byte[] array is some wrapping caused by Java’s ByteBuffer interface (it is still slower than a real byte[] array, but that is the only way to use mmap from Java and is much faster than all other directory implementations shipped with Lucene). We also waste no physical memory, as we operate directly on the O/S cache, avoiding all Java GC issues described before. What does this all mean to our Lucene/Solr application? We should not work against the operating system anymore, so allocate as less as possible heap space (-Xmx Java option). Remember, our index accesses rely on passed directly to O/S cache! This is also very friendly to the Java garbage collector. Free as much as possible physical memory to be available for the O/S kernel as file system cache. Remember, our Lucene code works directly on it, so reducing the number of paging/swapping between disk and memory. Allocating too much heap to our Lucene application hurts performance! Lucene does not require it with MMapDirectory. Why does this only work as expected on operating systems and Java virtual machines with 64bit? One limitation of 32bit platforms is the size of pointers, they can refer to any address within 0 and 232-1, which is 4 Gigabytes. Most operating systems limit that address space to 3 Gigabytes because the remaining address space is reserved for use by device hardware and similar things. This means the overall linear address space provided to any process is limited to 3 Gigabytes, so you cannot map any file larger than that into this “small” address space to be available as big byte[] array. And when you mapped that one large file, there is no virtual space (address like “house number”) available anymore. As physical memory sizes in current systems already have gone beyond that size, there is no address space available to make use for mapping files without wasting resources (in our case “address space”, not physical memory!). On 64bit platforms this is different: 264-1 is a very large number, a number in excess of 18 quintillion bytes, so there is no real limit in address space. Unfortunately, most hardware (the MMU, CPU’s bus system) and operating systems are limiting this address space to 47 bits for user mode applications (Windows: 43 bits) [4]. But there is still much of addressing space available to map terabytes of data. Common misunderstandings If you have read carefully what I have told you about virtual memory, you can easily verify that the following is true: MMapDirectory does not consume additional memory and the size of mapped index files is not limited by the physical memory available on your server. By mmap() files, we only reserve address space not memory! Remember, address space on 64bit platforms is for free! MMapDirectory will not load the whole index into physical memory. Why should it do this? We just ask the operating system to map the file into address space for easy access, by no means we are requesting more. Java and the O/S optionally provide the option to try loading the whole file into RAM (if enough is available), but Lucene does not use that option (we may add this possibility in a later version). MMapDirectory does not overload the server when “top” reports horrible amounts of memory. “top” (on Linux) has three columns related to memory: “VIRT”, “RES”, and “SHR”. The first one (VIRT, virtual) is reporting allocated virtual address space (and that one is for free on 64 bit platforms!). This number can be multiple times of your index size or physical memory when merges are running in IndexWriter. If you have only one IndexReader open it should be approximately equal to allocated heap space (-Xmx) plus index size. It does not show physical memory used by the process. The second column (RES, resident) memory shows how much (physical) memory the process allocated for operating and should be in the size of your Java heap space. The last column (SHR, shared) shows how much of the allocated virtual address space is shared with other processes. If you have several Java applications using MMapDirectory to access the same index, you will see this number going up. Generally, you will see the space needed by shared system libraries, JAR files, and the process executable itself (which are also mmapped). How to configure my operating system and Java VM to make optimal use of MMapDirectory? First of all, default settings in Linux distributions and Solaris/Windows are perfectly fine. But there are some paranoid system administrators around, that want to control everything (with lack of understanding). Those limit the maximum amount of virtual address space that can be allocated by applications. So please check that “ulimit -v” and “ulimit -m” both report “unlimited”, otherwise it may happen that MMapDirectory reports “mmap failed” while opening your index. If this error still happens on systems with lot’s of very large indexes, each of those with many segments, you may need to tune your kernel parameters in /etc/sysctl.conf: The default value of vm.max_map_count is 65530, you may need to raise it. I think, for Windows and Solaris systems there are similar settings available, but it is up to the reader to find out how to use them. For configuring your Java VM, you should rethink your memory requirements: Give only the really needed amount of heap space and leave as much as possible to the O/S. As a rule of thumb: Don’t use more than ¼ of your physical memory as heap space for Java running Lucene/Solr, keep the remaining memory free for the operating system cache. If you have more applications running on your server, adjust accordingly. As usual the more physical memory the better, but you don’t need as much physical memory as your index size. The kernel does a good job in paging in frequently used pages from your index. A good possibility to check that you have configured your system optimally is by looking at both "top" (and correctly interpreting it, see above) and the similar command "iotop" (can be installed, e.g., on Ubuntu Linux by "apt-get install iotop"). If your system does lots of swap in/swap out for the Lucene process, reduce heap size, you possibly used too much. If you see lot's of disk I/O, buy more RUM (Simon Willnauer) so mmapped files don't need to be paged in/out all the time, and finally: buy SSDs. Happy mmapping! Bibliography [1] http://en.wikipedia.org/wiki/Virtual_memory [2] https://www.varnish-cache.org/trac/wiki/ArchitectNotes [3] http://en.wikipedia.org/wiki/Memory-mapped_file [4] http://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details

July 31, 2012

by Uwe Schindler

· 13,947 Views · 1 Like

Managing Camel Routes With JMX APIs

Here is a quick example of how to programmatically access Camel MBeans to monitor and manipulate routes... first, get a connection to a JMX server (assumes localhost, port 1099, no auth) note, always cache the connection for subsequent requests (can cause memory utilization issues otherwise) JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://localhost:1099/jmxrmi"); JMXConnector jmxc = JMXConnectorFactory.connect(url); MBeanServerConnection server = jmxc.getMBeanServerConnection(); use the following to iterate over all routes and retrieve statistics (state, exchanges, etc)... ObjectName objName = new ObjectName("org.apache.camel:type=routes,*"); List cacheList = new LinkedList(server.queryNames(objName, null)); for (Iterator iter = cacheList.iterator(); iter.hasNext();) { objName = iter.next(); String keyProps = objName.getCanonicalKeyPropertyListString(); ObjectName objectInfoName = new ObjectName("org.apache.camel:" + keyProps); String routeId = (String) server.getAttribute(objectInfoName, "RouteId"); String description = (String) server.getAttribute(objectInfoName, "Description"); String state = (String) server.getAttribute(objectInfoName, "State"); ... } use the following to execute operations against a Camel route (stop,start, etc) ObjectName objName = new ObjectName("org.apache.camel:type=routes,*"); List cacheList = new LinkedList(server.queryNames(objName, null)); for (Iterator iter = cacheList.iterator(); iter.hasNext();) { objName = iter.next(); String keyProps = objName.getCanonicalKeyPropertyListString(); if(keyProps.contains(routeID)) { ObjectName objectRouteName = new ObjectName("org.apache.camel:" + keyProps); Object[] params = {}; String[] sig = {}; server.invoke(objectRouteName, operationName, params, sig); return; } } summary These APIs can easily be used to build a web or command line based tool to support remote Camel management features. All of these features are available via the JMX console and Camel does provide a web console to support some management/monitoring tasks. See these pages for more information... http://camel.apache.org/camel-jmx.html http://camel.apache.org/web-console.html

July 30, 2012

by Ben O'Day

· 12,019 Views

Understanding Vector Clocks with Riak

Riak is one databases that uses vector clocks for conflict resolution. I came across these two blog posts on Basho.com, company which develops Riak, and these posts are great at explaining the basics of Vector Clocks - definitely a must read if you're into distributed systems: Why vector clocks are easy? Why vector clocks are hard? Voldermort DB (by LinkedIn) is another DB that uses Vector Clocks, as explained below. Not surprisingly, it also takes the idea from Amazon's Dynamo (like Riak): The redundancy of storage makes the system more resilient to server failure. Since each value is stored N times, you can tolerate as many as N – 1 machine failures without data loss. This causes other problems, though. Since each value is stored in multiple places it is possible that one of these servers will not get updated (say because it is crashed when the update occurs). To help solve this problem Voldemort uses a data versioning mechanism called Vector Clocks that are common in distributed programming. This is an idea we took from Amazon’s Dynamo system. This data versioning allows the servers to detect stale data when it is read and repair it. Voldermort's code in Java can be on code.google.com. Finally, before I end this post, you may be asking "why complicate so much?" (if I could get a penny every time I heard that when discussing distributed systems... :-). But in this case, it's a good and typical question: can't we just use timestamp and last one wins? The problem, though, is that it requires times to be perfectly synchronized - which is very difficult and oftentimes impossible. By using vector clocks, you don't have this requirement on the system.

July 25, 2012

by Rodrigo De Castro

· 9,176 Views

Hadoop Hive Web Interface

I’ve been playing with Hive recently and liking what I’ve found. In theory at least it provides a very nice, simple way of getting into analysing large data sets. To make it even easier to show other people what you’re up to Hive has a nascent web interface with a little documentation on the wiki On the one hand it’s rather simple at this point, but that should be easily enought to prettify given a bit of time. The bigger problem was getting it working in the first place. What follows worked for me using the latest cloudera packages on debian testing. I’m assuming you already have Hive and Hadoop installed, the basic packages worked fine for me here. Next up you’ll need the JDK (not just the JRE) as their is some compilation that will go on the first time you run the web interface. apt-get install ant sun-java6-jdk Next up I had to modify the installed /etc/hive/conf/hive-site.xml file as follows: I changed this: hive.metastore.uris file:///var/lib/hivevar/metastore/metadb/ Comma separated list of URIs of metastore servers. The first server that can be connected to will be used. To this. Note the hivevar path doesn’t exist so I’m not sure if this was a typo in the source. hive.metastore.uris file:///var/lib/hive/var/metastore/metadb/ Comma separated list of URIs of metastore servers. The first server that can be connected to will be used. I also change the following section regarding the metastore name: javax.jdo.option.ConnectionURL jdbc:derby:;databaseName=/var/lib/hive/metastore/${user.name}_db;create=true JDBC connect string for a JDBC metastore To this, with a fixed name. When using the above confirguration the file was actually called ${user.name} rather than my username being subsituted in. Elsewhere this seems to work fine. javax.jdo.option.ConnectionURL jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true JDBC connect string for a JDBC metastore I’m not convinced the above two changes are needed but have left them here just in case. The main tricky part is making sure a load of environment variables are correctly set. The following worked for me: export ANT_LIB=/usr/share/ant/lib export HIVE_HOME=/usr/lib/hive export HADOOP_HOME=/usr/lib/hadoop export PATH=$PATH:$HADOOP_HOME/bin export JAVA_HOME=/usr/lib/jvm/java-6-sun All being well that should allow you to run the hive command with the web interface like so: hive --service hwi That should bring up a webserver on port 9999 where you should see something similar to the screenshot above.

July 25, 2012

by Gareth Rushgrove

· 16,810 Views · 1 Like