Databases Resources

The Latest Databases Topics

Unit Testing Checklist: Keep Your Tests Useful and Avoid Big Mistakes

A unit test is a set of methods launched frequently to validate your code. It is usually a good idea to write code in this order: Write a class API Write a method to test the API Implement the API Launch the unit tests Why write unit tests? They validate current and future implementations. They measure code quality. They force you to write testable, loosely coupled code. They’re cheaper than manual regression testing. They build confidence in your code. They help teamwork. Why use a checklist? Unit testing can be harder than actual implementation. Unit testing forces you to really think things through. But unit tests should be simple, direct, and easy to read and maintain. You also need to know when to stop writing tests and start writing the implementation. Use this checklist to be sure your tests are really useful and to the point. Remember: the checklist helps you avoid big mistakes, but you need to make sure of the following: □ My test class is testing only one class. o You are testing a class API to be sure the public contract is respected. □ My methods are testing only one method at a time. o Be sure not to test private methods! They are hidden implementation details, not API. □ My variables and method names are explicit. o For example, store an expected value in an expectedFoo variable instead of just foo. If you test many combinations, use composed variable names like inputValue_NotNull, inputValue_ZeroData, inputValue_PastDate, etc. (according to your application’s coding convention). □ My test cases are easy to read by humans. o Future maintainers should be able to read your tests before reading the implementation. This will help them understand a class API before tweaking or debugging it. □ My tests respect the usual clean code standards. o There should be no flow control in a test method (switch, if, etc.). A good test is just a very straightforward sequence of setup/validate instructions. If necessary, use sub-methods to factorize and make your tests easier to read. In case of multiple scenarios, use multiple test methods (one for each case). o For example, a test method should fit on screen without scrolling – 1 to 20 lines long. If the method is longer, consider writing multiple test methods for each case instead of jamming them together. □ My tests are also testing expected exceptions. o In java, use @Test(expected=MyException.class). □ My tests don’t need access to a database. o Or if a test does need database access, then it must be a mocked, “fire and forget” temporary database that you fill with test cases for every new test method (use the Setup/Teardown methods to prepare it). □ My tests don’t need access to network resources. o You can’t rely on third parties like network or device presence to validate a method (use mocks). □ My tests control side effects, limit values (max, min) and null variables (even when they throw exceptions). o You want to make sure these problem cases never occur, even when the test won’t be used during maintenance. □ My tests can be run at any time, at any place without needing configuration or human intervention. □ My tests pass for the current implementation and are easy to evolve. o Tests really exist to support code evolution. If they are too hard to maintain or too light to refine the code, then they are a useless burden. (Many developers avoid unit testing for this reason.) □ My tests are concrete. o In Java: don’t use Date() as input for a method you are testing, but build a concrete date out of Calendar (don’t forget to force the timezone). Other example: use name = “Smith”; instead of name = “name”; or name = “test”; □ My tests use a mock to simulate/stub complex class structure or methods. o Remember to test only one class API at a time. o Never test third-party libraries through your own classes. Libraries should come with their own tests (this is actually a good way to choose a library). □ My tests are never @ignored or commented out. Never. Ever. □ My tests help me validate my architecture. o If you can’t test a method or a class, your design is not agile enough. □ My tests can run on any supported platform, not just the targeted platform. o Don’t expect a particular device or hardware configuration. Otherwise, your tests will make migration tougher and you will be incentivized to disable them. □ My tests are lightning fast! o Slow tests shouldn’t drag you down. Speed encourages you to launch your tests often. It also helps to reduce building time on Continuous Integration systems. o Use a test runner that allows you to launch one test at a time while you are writing it. Use “delay” or “sleep” with caution – i.e., only in some edge cases, like waiting for notifications or clock-based methods.

June 16, 2014

by Jean-baptiste Rieu

· 25,010 Views

Java 8 Friday: 10 Subtle Mistakes When Using the Streams API

at data geekery , we love java. and as we’re really into jooq’s fluent api and query dsl , we’re absolutely thrilled about what java 8 will bring to our ecosystem. java 8 friday every friday, we’re showing you a couple of nice new tutorial-style java 8 features, which take advantage of lambda expressions, extension methods, and other great stuff. you’ll find the source code on github . 10 subtle mistakes when using the streams api we’ve done all the sql mistakes lists: 10 common mistakes java developers make when writing sql 10 more common mistakes java developers make when writing sql yet another 10 common mistakes java developers make when writing sql (you won’t believe the last one) but we haven’t done a top 10 mistakes list with java 8 yet! for today’s occasion ( it’s friday the 13th ), we’ll catch up with what will go wrong in your application when you’re working with java 8. (it won’t happen to us, as we’re stuck with java 6 for another while) 1. accidentally reusing streams wanna bet, this will happen to everyone at least once. like the existing “streams” (e.g. inputstream ), you can consume streams only once. the following code won’t work: intstream stream = intstream.of(1, 2); stream.foreach(system.out::println); // that was fun! let's do it again! stream.foreach(system.out::println); you’ll get a java.lang.illegalstateexception: stream has already been operated upon or closed so be careful when consuming your stream. it can be done only once 2. accidentally creating “infinite” streams you can create infinite streams quite easily without noticing. take the following example: // will run indefinitely intstream.iterate(0, i -> i + 1) .foreach(system.out::println); the whole point of streams is the fact that they can be infinite, if you design them to be. the only problem is, that you might not have wanted that. so, be sure to always put proper limits: // that's better intstream.iterate(0, i -> i + 1) .limit(10) .foreach(system.out::println); 3. accidentally creating “subtle” infinite streams we can’t say this enough. you will eventually create an infinite stream, accidentally. take the following stream, for instance: intstream.iterate(0, i -> ( i + 1) % 2) .distinct() .limit(10) .foreach(system.out::println); so… we generate alternating 0′s and 1′s then we keep only distinct values, i.e. a single 0 and a single 1 then we limit the stream to a size of 10 then we consume it well… the distinct() operation doesn’t know that the function supplied to the iterate() method will produce only two distinct values. it might expect more than that. so it’ll forever consume new values from the stream, and the limit(10) will never be reached. tough luck, your application stalls. 4. accidentally creating “subtle” parallel infinite streams we really need to insist that you might accidentally try to consume an infinite stream. let’s assume you believe that the distinct() operation should be performed in parallel. you might be writing this: intstream.iterate(0, i -> ( i + 1) % 2) .parallel() .distinct() .limit(10) .foreach(system.out::println); now, we’ve already seen that this will turn forever. but previously, at least, you only consumed one cpu on your machine. now, you’ll probably consume four of them, potentially occupying pretty much all of your system with an accidental infinite stream consumption. that’s pretty bad. you can probably hard-reboot your server / development machine after that. have a last look at what my laptop looked like prior to exploding: if i were a laptop, this is how i’d like to go. 5. mixing up the order of operations so, why did we insist on your definitely accidentally creating infinite streams? it’s simple. because you may just accidentally do it. the above stream can be perfectly consumed if you switch the order of limit() and distinct() : intstream.iterate(0, i -> ( i + 1) % 2) .limit(10) .distinct() .foreach(system.out::println); this now yields: 0 1 why? because we first limit the infinite stream to 10 values (0 1 0 1 0 1 0 1 0 1), before we reduce the limited stream to the distinct values contained in it (0 1). of course, this may no longer be semantically correct, because you really wanted the first 10 distinct values from a set of data (you just happened to have “forgotten” that the data is infinite). no one really wants 10 random values, and only then reduce them to be distinct. if you’re coming from a sql background, you might not expect such differences. take sql server 2012, for instance. the following two sql statements are the same: -- using top selectdistincttop10 * fromi orderby.. -- using fetch select* fromi orderby.. offset 0 rows fetchnext10 rowsonly so, as a sql person, you might not be as aware of the importance of the order of streams operations. 6. mixing up the order of operations (again) speaking of sql, if you’re a mysql or postgresql person, you might be used to the limit .. offset clause. sql is full of subtle quirks, and this is one of them. the offset clause is applied first , as suggested in sql server 2012′s (i.e. the sql:2008 standard’s) syntax. if you translate mysql / postgresql’s dialect directly to streams, you’ll probably get it wrong: intstream.iterate(0, i -> i + 1) .limit(10) // limit .skip(5) // offset .foreach(system.out::println); the above yields 5 6 7 8 9 yes. it doesn’t continue after 9 , because the limit() is now applied first , producing (0 1 2 3 4 5 6 7 8 9). skip() is applied after, reducing the stream to (5 6 7 8 9). not what you may have intended. beware of the limit .. offset vs. "offset .. limit" trap! 7. walking the file system with filters we’ve blogged about this before . what appears to be a good idea is to walk the file system using filters: files.walk(paths.get(".")) .filter(p -> !p.tofile().getname().startswith(".")) .foreach(system.out::println); the above stream appears to be walking only through non-hidden directories, i.e. directories that do not start with a dot. unfortunately, you’ve again made mistake #5 and #6. walk() has already produced the whole stream of subdirectories of the current directory. lazily, though, but logically containing all sub-paths. now, the filter will correctly filter out paths whose names start with a dot “.”. e.g. .git or .idea will not be part of the resulting stream. but these paths will be: .\.git\refs , or .\.idea\libraries . not what you intended. now, don’t fix this by writing the following: files.walk(paths.get(".")) .filter(p -> !p.tostring().contains(file.separator + ".")) .foreach(system.out::println); while that will produce the correct output, it will still do so by traversing the complete directory subtree, recursing into all subdirectories of “hidden” directories. i guess you’ll have to resort to good old jdk 1.0 file.list() again. the good news is, filenamefilter and filefilter are both functional interfaces. 8. modifying the backing collection of a stream while you’re iterating a list , you must not modify that same list in the iteration body. that was true before java 8, but it might become more tricky with java 8 streams. consider the following list from 0..9: // of course, we create this list using streams: list list = intstream.range(0, 10) .boxed() .collect(tocollection(arraylist::new)); now, let’s assume that we want to remove each element while consuming it: list.stream() // remove(object), not remove(int)! .peek(list::remove) .foreach(system.out::println); interestingly enough, this will work for some of the elements! the output you might get is this one: 0 2 4 6 8 null null null null null java.util.concurrentmodificationexception if we introspect the list after catching that exception, there’s a funny finding. we’ll get: [1, 3, 5, 7, 9] heh, it “worked” for all the odd numbers. is this a bug? no, it looks like a feature. if you’re delving into the jdk code, you’ll find this comment in arraylist.arralistspliterator : /* * if arraylists were immutable, or structurally immutable (no * adds, removes, etc), we could implement their spliterators * with arrays.spliterator. instead we detect as much * interference during traversal as practical without * sacrificing much performance. we rely primarily on * modcounts. these are not guaranteed to detect concurrency * violations, and are sometimes overly conservative about * within-thread interference, but detect enough problems to * be worthwhile in practice. to carry this out, we (1) lazily * initialize fence and expectedmodcount until the latest * point that we need to commit to the state we are checking * against; thus improving precision. (this doesn't apply to * sublists, that create spliterators with current non-lazy * values). (2) we perform only a single * concurrentmodificationexception check at the end of foreach * (the most performance-sensitive method). when using foreach * (as opposed to iterators), we can normally only detect * interference after actions, not before. further * cme-triggering checks apply to all other possible * violations of assumptions for example null or too-small * elementdata array given its size(), that could only have * occurred due to interference. this allows the inner loop * of foreach to run without any further checks, and * simplifies lambda-resolution. while this does entail a * number of checks, note that in the common case of * list.stream().foreach(a), no checks or other computation * occur anywhere other than inside foreach itself. the other * less-often-used methods cannot take advantage of most of * these streamlinings. */ now, check out what happens when we tell the stream to produce sorted() results: list.stream() .sorted() .peek(list::remove) .foreach(system.out::println); this will now produce the following, “expected” output 0 1 2 3 4 5 6 7 8 9 and the list after stream consumption? it is empty: [] so, all elements are consumed, and removed correctly. the sorted() operation is a “stateful intermediate operation” , which means that subsequent operations no longer operate on the backing collection, but on an internal state. it is now “safe” to remove elements from the list! well… can we really? let’s proceed with parallel() , sorted() removal: list.stream() .sorted() .parallel() .peek(list::remove) .foreach(system.out::println); this now yields: 7 6 2 5 8 4 1 0 9 3 and the list contains [8] eek. we didn’t remove all elements!? free beers ( and jooq stickers ) go to anyone who solves this streams puzzler! this all appears quite random and subtle, we can only suggest that you never actually do modify a backing collection while consuming a stream. it just doesn’t work. 9. forgetting to actually consume the stream what do you think the following stream does? intstream.range(1, 5) .peek(system.out::println) .peek(i -> { if(i == 5) thrownewruntimeexception("bang"); }); when you read this, you might think that it will print (1 2 3 4 5) and then throw an exception. but that’s not correct. it won’t do anything. the stream just sits there, never having been consumed. as with any fluent api or dsl, you might actually forget to call the “terminal” operation. this might be particularly true when you use peek() , as peek() is an aweful lot similar to foreach() . this can happen with jooq just the same, when you forget to call execute() or fetch() : dsl.using(configuration) .update(table) .set(table.col1, 1) .set(table.col2, "abc") .where(table.id.eq(3)); oops. no execute() yes, the “best” way – with 1-2 caveats ;-) 10. parallel stream deadlock this is now a real goodie for the end! all concurrent systems can run into deadlocks, if you don’t properly synchronise things. while finding a real-world example isn’t obvious, finding a forced example is. the following parallel() stream is guaranteed to run into a deadlock: object[] locks = { newobject(), newobject() }; intstream .range(1, 5) .parallel() .peek(unchecked.intconsumer(i -> { synchronized(locks[i % locks.length]) { thread.sleep(100); synchronized(locks[(i + 1) % locks.length]) { thread.sleep(50); } } })) .foreach(system.out::println); note the use of unchecked.intconsumer() , which transforms the functional intconsumer interface into a org.jooq.lambda.fi.util.function.checkedintconsumer , which is allowed to throw checked exceptions. well. tough luck for your machine. those threads will be blocked forever :-) the good news is, it has never been easier to produce a schoolbook example of a deadlock in java! for more details, see also brian goetz’s answer to this question on stack overflow . conclusion with streams and functional thinking, we’ll run into a massive amount of new, subtle bugs. few of these bugs can be prevented, except through practice and staying focused. you have to think about how to order your operations. you have to think about whether your streams may be infinite. streams (and lambdas) are a very powerful tool. but a tool which we need to get a hang of, first.

June 16, 2014

by Lukas Eder

· 10,347 Views · 2 Likes

The Performance Price of Foreign Keys in PostgreSQL

If you want to optimize the performance of your PostgreSQL tables, you might benefit from looking in places you wouldn't ordinarily expect. For example, your foreign keys. According to Shaun Thomas in this recent post, foreign keys can have a considerable impact on performance. Ordinarily they're a good thing - Thomas acknowledges that - but there are certain design choices in PostgreSQL that can complicate them: In PostgreSQL, every foreign key is maintained with an invisible system-level trigger added to the source table in the reference. At least one trigger must go here, as operations that modify the source data must be checked that they do not violate the constraint. Each trigger, Thomas says, requires some overhead, and as the number of triggers increases, so does the cost. Thomas offers an example query to demonstrate where this kind of thing becomes an issue - not really a practical or real-world example, though - but the performance hit is substantial: . . . after merely five foreign keys, performance of our updates drops by 28.5%. By the time we have 20 foreign keys, the updates are 95% slower! This is not to say that you shouldn't use foreign keys, Thomas says, but that you should be careful with them and avoid abusing them. In other words, if you don't need a foreign key, you should probably just leave it out. Check out the Thomas' post for all the details.

June 12, 2014

by Alec Noller

· 17,677 Views

Querying XML CLOB Data Directly in Oracle

Oracle 9i introduced a new datatype - XMLType - specifically designed for handling XML naively in the database. However, we may find we have a database that stores XML directly in a CLOB/NCLOB datatype (either because of legacy data, or because we are supporting multiple database platforms). It can be useful to query this data directly using the XMLType functions available. This post just shows a very simple example of a query to do this. (Note: there may be better ways to do this - but this worked for me when I needed an ad-hoc query quickly!) Our Table Assume we have a table called USER, which has the following columns Our embedded XML (example for Ringo) Our Query Say we want to look at account details in the XML - e.g. find which users all have the same sort code. A query like the following: SELECT ID, NAME XMLTYPE(u.accountxml).EXTRACT('/person/account/sortCode/text()') as SORTCODE FROM USER u; Will produce output like this. Obviously from here you can use all the standard SQL operators to filter, join etc further to get what you need:

June 11, 2014

by Adrian Milne

· 45,621 Views · 1 Like

Building a Simple RESTful API with Java Spark

Disclaimer: This post is about the Java micro web framework named Spark and not about the data processing engine Apache Spark. In this blog post we will see how Spark can be used to build a simple web service. As mentioned in the disclaimer, Spark is a micro web framework for Java inspired by the Ruby framework Sinatra. Spark aims for simplicity and provides only a minimal set of features. However, it provides everything needed to build a web application in a few lines of Java code. Getting Started Let's assume we have a simple domain class with a few properties and a service that provides some basic CRUDfunctionality: public class User { private String id; private String name; private String email; // getter/setter } public class UserService { // returns a list of all users public List getAllUsers() { .. } // returns a single user by id public User getUser(String id) { .. } // creates a new user public User createUser(String name, String email) { .. } // updates an existing user public User updateUser(String id, String name, String email) { .. } } We now want to expose the functionality of UserService as a RESTful API (For simplicity we will skip the hypermedia part of REST ;-)). For accessing, creating and updating user objects we want to use following URL patterns: GET /users Get a list of all users GET /users/ Get a specific user POST /users Create a new user PUT /users/ Update a user The returned data should be in JSON format. To get started with Spark we need the following Maven dependencies: com.sparkjava spark-core 2.0.0 org.slf4j slf4j-simple 1.7.7 Spark uses SLF4J for logging, so we need to a SLF4J binder to see log and error messages. In this example we use the slf4j-simple dependency for this purpose. However, you can also use Log4j or any other binder you like. Having slf4j-simple in the classpath is enough to see log output in the console. We will also use GSON for generating JSON output and JUnit to write a simple integration tests. You can find these dependencies in the complete pom.xml. Returning All Users Now it is time to create a class that is responsible for handling incoming requests. We start by implementing the GET /users request that should return a list of all users. import static spark.Spark.*; public class UserController { public UserController(final UserService userService) { get("/users", new Route() { @Override public Object handle(Request request, Response response) { // process request return userService.getAllUsers(); } }); // more routes } } Note the static import of spark.Spark.* in the first line. This gives us access to various static methods including get(), post(), put() and more. Within the constructor the get() method is used to register aRoute that listens for GET requests on /users. A Route is responsible for processing requests. Whenever aGET /users request is made, the handle() method will be called. Inside handle() we return an object that should be sent to the client (in this case a list of all users). Spark highly benefits from Java 8 Lambda expressions. Route is a functional interface (it contains only one method), so we can implement it using a Java 8 Lambda expression. Using a Lambda expression the Routedefinition from above looks like this: get("/users", (req, res) -> userService.getAllUsers()); To start the application we have to create a simple main() method. Inside main() we create an instance of our service and pass it to our newly created UserController: public class Main { public static void main(String[] args) { new UserController(new UserService()); } } If we now run main(), Spark will start an embedded Jetty server that listens on Port 4567. We can test our first route by initiating a GET http://localhost:4567/users request. In case the service returns a list with two user objects the response body might look like this: [com.mscharhag.sparkdemo.User@449c23fd, com.mscharhag.sparkdemo.User@437b26fe] Obviously this is not the response we want. Spark uses an interface called ResponseTransformer to convert objects returned by routes to an actual HTTP response. ReponseTransformer looks like this: public interface ResponseTransformer { String render(Object model) throws Exception; } ResponseTransformer has a single method that takes an object and returns a String representation of this object. The default implementation of ResponseTransformer simply calls toString() on the passed object (which creates output like shown above). Since we want to return JSON we have to create a ResponseTransformer that converts the passed objects to JSON. We use a small JsonUtil class with two static methods for this: public class JsonUtil { public static String toJson(Object object) { return new Gson().toJson(object); } public static ResponseTransformer json() { return JsonUtil::toJson; } } toJson() is an universal method that converts an object to JSON using GSON. The second method makes use of Java 8 method references to return a ResponseTransformer instance. ResponseTransformer is again a functional interface, so it can be satisfied by providing an appropriate method implementation (toJson()). So whenever we call json() we get a new ResponseTransformer that makes use of our toJson()method. In our UserController we can pass a ResponseTransformer as a third argument to Spark's get()method: import static com.mscharhag.sparkdemo.JsonUtil.*; public class UserController { public UserController(final UserService userService) { get("/users", (req, res) -> userService.getAllUsers(), json()); ... } } Note again the static import of JsonUtil.* in the first line. This gives us the option to create a newResponseTransformer by simply calling json(). Our response looks now like this: [{ "id": "1866d959-4a52-4409-afc8-4f09896f38b2", "name": "john", "email": "[email protected]" },{ "id": "90d965ad-5bdf-455d-9808-c38b72a5181a", "name": "anna", "email": "[email protected]" }] We still have a small problem. The response is returned with the wrong Content-Type. To fix this, we can register a Filter that sets the JSON Content-Type: after((req, res) -> { res.type("application/json"); }); Filter is again a functional interface and can therefore be implemented by a short Lambda expression. After a request is handled by our Route, the filter changes the Content-Type of every response toapplication/json. We can also use before() instead of after() to register a filter. Then, the Filterwould be called before the request is processed by the Route. The GET /users request should be working now :-) Returning a Specific User To return a specific user we simply create a new route in our UserController: get("/users/:id", (req, res) -> { String id = req.params(":id"); User user = userService.getUser(id); if (user != null) { return user; } res.status(400); return new ResponseError("No user with id '%s' found", id); }, json()); With req.params(":id") we can obtain the :id path parameter from the URL. We pass this parameter to our service to get the corresponding user object. We assume the service returns null if no user with the passed id is found. In this case, we change the HTTP status code to 400 (Bad Request) and return an error object. ResponseError is a small helper class we use to convert error messages and exceptions to JSON. It looks like this: public class ResponseError { private String message; public ResponseError(String message, String... args) { this.message = String.format(message, args); } public ResponseError(Exception e) { this.message = e.getMessage(); } public String getMessage() { return this.message; } } We are now able to query for a single user with a request like this: GET /users/5f45a4ff-35a7-47e8-b731-4339c84962be If an user with this id exists we will get a response that looks somehow like this: { "id": "5f45a4ff-35a7-47e8-b731-4339c84962be", "name": "john", "email": "[email protected]" } If we use an invalid user id, a ResponseError object will be created and converted to JSON. In this case the response looks like this: { "message": "No user with id 'foo' found" } Creating and Updating Users Creating and updating users is again very easy. Like returning the list of all users it is done using a single service call: post("/users", (req, res) -> userService.createUser( req.queryParams("name"), req.queryParams("email") ), json()); put("/users/:id", (req, res) -> userService.updateUser( req.params(":id"), req.queryParams("name"), req.queryParams("email") ), json()); To register a route for HTTP POST or PUT requests we simply use the static post() and put() methods of Spark. Inside a Route we can access HTTP POST parameters using req.queryParams(). For simplicity reasons (and to show another Spark feature) we do not do any validation inside the routes. Instead we assume that the service will throw an IllegalArgumentException if we pass in invalid values. Spark gives us the option to register ExceptionHandlers. An ExceptionHandler will be called if anException is thrown while processing a route. ExceptionHandler is another single method interface we can implement using a Java 8 Lambda expression: exception(IllegalArgumentException.class, (e, req, res) -> { res.status(400); res.body(toJson(new ResponseError(e))); }); Here we create an ExceptionHandler that is called if an IllegalArgumentException is thrown. The caught Exception object is passed as the first parameter. We set the response code to 400 and add an error message to the response body. If the service throws an IllegalArgumentException when the email parameter is empty, we might get a response like this: { "message": "Parameter 'email' cannot be empty" } The complete source the controller can be found here. Testing Because of Spark's simple nature it is very easy to write integration tests for our sample application. Let's start with this basic JUnit test setup: public class UserControllerIntegrationTest { @BeforeClass public static void beforeClass() { Main.main(null); } @AfterClass public static void afterClass() { Spark.stop(); } ... } In beforeClass() we start our application by simply running the main() method. After all tests finished we call Spark.stop(). This stops the embedded server that runs our application. After that we can send HTTP requests within test methods and validate that our application returns the correct response. A simple test that sends a request to create a new user can look like this: @Test public void aNewUserShouldBeCreated() { TestResponse res = request("POST", "/users?name=john&[email protected]"); Map json = res.json(); assertEquals(200, res.status); assertEquals("john", json.get("name")); assertEquals("[email protected]", json.get("email")); assertNotNull(json.get("id")); } request() and TestResponse are two small self made test utilities. request() sends a HTTP request to the passed URL and returns a TestResponse instance. TestResponse is just a small wrapper around some HTTP response data. The source of request() and TestResponse is included in the complete test classfound on GitHub. Conclusion Compared to other web frameworks Spark provides only a small amount of features. However, it is so simple you can build small web applications within a few minutes (even if you have not used Spark before). If you want to look into Spark you should clearly use Java 8, which reduces the amount of code you have to write a lot. You can find the complete source of the sample project on GitHub.

June 9, 2014

by Michael Scharhag

· 111,681 Views · 3 Likes

Neo4j & Cypher: UNWIND vs FOREACH

I’ve written a couple of posts about the new UNWIND clause in Neo4j’s cypher query language, but I forgot about my favorite use of UNWIND, which is to get rid of some uses of FOREACH from our queries. Let’s say we’ve created a timetree up front and now have a series of events coming in that we want to create in the database and attach to the appropriate part of the timetree. Before UNWIND existed we might try to write the following query using FOREACH: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events FOREACH (event IN events | CREATE (e:Event {name: event.name}) MATCH (year:Year {year: event.timetree.year }), (year)-[:HAS_MONTH]->(month {month: event.timetree.month }), (month)-[:HAS_DAY]->(day {day: event.timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day)) Unfortunately we can’t use MATCH inside a FOREACH statement so we’ll get the following error: Invalid use of MATCH inside FOREACH (line 5, column 3) " MATCH (year:Year {year: event.timetree.year }), " ^ Neo.ClientError.Statement.InvalidSyntax We can work around this by using MERGE instead in the knowledge that it’s never going to create anything because the timetree already exists: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events FOREACH (event IN events | CREATE (e:Event {name: event.name}) MERGE (year:Year {year: event.timetree.year }) MERGE (year)-[:HAS_MONTH]->(month {month: event.timetree.month }) MERGE (month)-[:HAS_DAY]->(day {day: event.timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day)) If we replace the FOREACH with UNWIND we’d get the following: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events UNWIND events AS event CREATE (e:Event {name: event.name}) WITH e, event.timetree AS timetree MATCH (year:Year {year: timetree.year }), (year)-[:HAS_MONTH]->(month {month: timetree.month }), (month)-[:HAS_DAY]->(day {day: timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day) Although the lines of code has slightly increased the query is now correct and we won’t accidentally correct new parts of our time tree. We could also pass on the event that we created to the next part of the query which wouldn’t be the case when using FOREACH.

June 9, 2014

by Mark Needham

· 12,754 Views

Checking Media Queries With jQuery

With the web being used on so many different devices now it's very important that you can change your design to fit on different screen sizes. The best way of changing your display depending on screen size is to use media queries to find out the size viewport of the screen and allowing you to change the design depending on what screen size is on. You will mainly make these changes in the CSS file as you can define the media query break points to change the design on different devices like this. /* Smartphones (portrait) ----------- */ @media only screen and (max-width : 320px) { } /* Desktops and laptops ----------- */ @media only screen and (min-width : 1224px) { } /* Large screens ----------- */ @media only screen and (min-width : 1824px) { } The above code will allow you to make styling completely different on different devices, but what if you wanted to change the functionality of the site depending on the screen size? What if you needed to use some Javascript code on different screen sizes, for example to create a slide down navigation button. How Do You Use Media Queries With jQuery Media queries will be checking the width of the window to see what the size of the device is so you would think that you can use a method like .width() on the window object like this. if($(window).width() < 767) { // change functionality for smaller screens } else { // change functionality for larger screens } But this will not return the true value of the window as it takes into effect things like body padding and scroll bars on the window. The other option you have when checking the media size is to use a Javascript method of .matchMedia() on the window object. var window_size = window.matchMedia('(max-width: 768px)')); This works the same way as media queries and is supported on many browsers apart from IE9 and lower. Can I Use window.matchMedia To use matchMedia you need to pass in the min or max values you want to check (like media queries) and see if the viewport matches this. if (window.matchMedia('(max-width: 768px)').matches) { // do functionality on screens smaller than 768px } Now you can use this to add a click event on to a sub-menu for screens smaller than 768px. The below code is an example of how you can add some Javascript code which will only be run on screens smaller than 768px. if (window.matchMedia('(max-width: 768px)').matches) { $('.sub-menu-button').on('click', function(e) { var subMenu = $(this).next('.sub-navigation'); if(subMenu.is(':visible')) { subMenu.slideUp(); } else { subMenu.slideDown(); } return false; }); }

June 6, 2014

by Paul Underwood

· 135,780 Views · 4 Likes

How Does Spring @Transactional Really Work?

Wondering how Spring @Transactional works? Learn about what's really going on.

June 5, 2014

by Vasco Cavalheiro

· 882,695 Views · 47 Likes

MapDB: The Agile Java Data Engine

MapDB is a pure Java database, specifically designed for the Java developer. The fundamental concept for MapDB is very clever yet natural to use: provide a reliable, full-featured and “tune-able” database engine using the Java Collections API. MapDB 1.0 has just been released, this is the culmination of years of research and development to get the project to this point. Jan Kotek, the primary developer for MapDB, also worked on predecessor projects (JDBM), starting MapDB as an entire from-scratch rewrite. Jan’s expertise and dedication to low-level debugging has yielded excellent results, producting an easy-to-use database for Java with comparable performance to many C-based engines. What sets MapDB apart is the “map” concept. The idea is to leverage the totally natural Java Collections API – so familiar to Java developers that most of them literally use it daily in their work. For most database interactions with a Java application, some sort of translator is required. There are many Object-Relational Mapping (ORM) tools to name just one category of such components. The goal has always been in the direction of making it natural to code in objects in the Java language, and translate them to a specific database syntax (such as SQL). However, such efforts have always come up short, adding complexity for both the application developer and the data architect. When using MapDB there is no object “translation layer” – developers just access data in familiar structures like Maps, Sets, Queues, etc. There is no change in syntax from typical Java coding, other than a brief initialization syntax and transaction management. A developer can literally transform memory-limited maps into a high-speed persistent store in seconds (typically changing just one line of code). A MapDB Example Here is a simple MapDB example, showing how easy and intuitive it is to use in a Java application: // Initialize a MapDB database DB db = DBMaker.newFileDB(new File("testdb")) .closeOnJvmShutdown() .make(); // Create a Map: Map myMap = db.getTreeMap(“testmap”); // Work with the Map using the normal Map API. myMap.put(“key1”, “value1”); myMap.put(“key2”, “value2”); String value = myMap.get(“key1”); ... That’s all you need to do, now you have a file-backed Map of virtually any size. Note the “builder-style” initialization syntax, enabling MapDB as the agile database choice for Java. There are many builder options that let you tune your database for the specific requirements at hand. Just a small subset of options include: In-memory implementation Enable transactions Configurable caching This means that you can configure your database just for what you need, effectively making MapDB serve the job of many other databases. MapDB comes with a set of powerful configuration options, and you can even extend the product to make your own data implementations if necessary. Another very powerful feature is that MapDB utilizes some of the advanced Java Collections variants, such as ConcurrentNavigableMap. With this type of Map you can go beyond simple key-value semantics, as it is also a sorted Map allowing you to access data in order, and find values near a key. Not many people are aware of this extension to the Collections API, but it is extremely powerful and allows you to do a lot with your MapDB database (I will cover more of these capabilities in a future article). The Agile Aspect of MapDB When I first met Jan and started talking with him about MapDB he said something that made a very important impression: If you know what data structure you want, MapDB allows you to tailor the structure and database characteristics to your exact application needs. In other words, the schema and ways you can structure your data is very flexible. The configuration of the physical data store is just as flexible, making a perfect combination for meeting almost any database need. They key to this capability is inherent in MapDB’s architecture, and how it translates to the MapDB API itself. Here is a simple diagram of the MapDB architecture: As you can see from the diagram, there are 3 tiers in MapDB: Collections API: This is the familiar Java Collections API that every Java developer uses for maintaining application state. It has a simple builder-style extension to allow you to control the exact characteristics of a given database (including its internal format or record structure). Engine: The Engine is the real key to MapDB, this is where the records for a database – including their internal structure, concurrency control, transactional semantics – are controlled. MapDB ships with several engines already, and it is straightforward to add your own Engine if needed for specialized data handling. Volume: This is the physical storage layer (e.g., on-disk or in-memory). MapDB has a few standard Volume implementations, and they should suffice for most projects. The main point is that the development API is completely distinct from the Engine implementation (the heart of MapDB), and both are separate from the actual physical storage layer. This offers a very agile approach, allowing developers to exactly control what type of internal structure is needed for a given database, and what the actual data structure looks like from the top-level Collections API. To make things even more extensible and agile, MapDB uses a concept of Engine Wrappers. An Engine Wrapper allows adding additional features and options on top of a specific engine layer. For example, if the standard Map engine is utilized for creating a B-Tree backed Map, it is feasible to enable (or disable) caching support. This caching feature is done through an Engine Wrapper, and that is what shows up in the builder-style API used to configure a given database. While a whole article could be written just about this, the point here is that this adds to MapDB’s inherent agile nature. By way of example, here is how you configure a pure in-memory database, without transactional capabilities: // Initialize an in-memory MapDB database // without transactions DB db = DBMaker.newMemoryDB() .transactionDisable() .closeOnJvmShutdown() .make(); // Create a Map: Map myMap = db.getTreeMap(“testmap”); // Work with the Map using the normal Map API. myMap.put(“key1”, “value1”); myMap.put(“key2”, “value2”); String value = myMap.get(“key1”); ... That’s it! All that was needed was to change the DBMaker call to add the new options, everything else works exactly the same as in the example shown earlier. Agile Data Model In addition to customizing the features and performance characteristics of a given database instance, MapDB allows you to create an agile data model, with a schema exactly matching your application requirements. This is probably similar to how you write your code when creating standard Java in-memory structures. For example, let’s say you need to lookup a Person object by username, or by personID. Simply create a Person object and two Maps to meet your needs: public class Person { private Integer personID; private String username; ... // Setters and getters go here ... } // Create a Map of Person by username. Map personByUsernameMap = ... // Create a Map of Person by personID. Map personByPersonIDMap = ... This is a very trivial example, but now you can easily write to both maps for each new Person instance, and subsequently retrieve a Person by either key. Another interesting concept with MapDB data structures are some key extensions to the normal Java Collections API. A common requirement in applications is to have a Map with a key/value, and in addition to finding the value for a key to be able to perform the inverse: lookup the key for a given value. We can easily do this using the MapDB extension for bi-directional maps: // Create a primary map HTreeMap map = DBMaker.newTempHashMap(); // Create the inverse mapping for primary map NavigableSet> inverseMapping = new TreeSet>(); // Bind the inverse mapping to primary map, so it is auto-updated each time the primary map gets a new key/value Bind.mapInverse(map, inverseMapping); map.put(10L,"value2"); map.put(1111L,"value"); map.put(1112L,"value"); map.put(11L,"val"); // Now find a key by a given value. Long keyValue = Fun.filter(inverseMapping.get(“value2”); MapDB supports many constructs for the interaction of Maps or other collections, allowing you to create a schema of related structures that can automatically be kept in sync. This avoids a lot of scanning of structures, makes coding fast and convenient, and can keep things very fast. Wrapping it up I have shown a very brief introduction on MapDB and how the product works. As you can see its strengths are its use of the natural Java Collections API, the agile nature of the engine itself, and the support for virtually any type of data model or schema that your application needs. MapDB is freely available for any use under the Apache 2.0 license. To learn more, check out: www.mapdb.org.

June 5, 2014

by Cory Isaacson

· 28,643 Views · 3 Likes

Are There Any Differences Between Table Scan and Index Scan

Introduction As we all know, developers prefer Index Seek in the case of performance of a query, and we all know what Index Seek is and how it improves performance. In this article we are not going to discuss Index Seek. A table scan is performed on a table which does not have an Index upon it (a heap) – it looks at the rows in the table and an Index Scan is performed on an indexed table – the index itself. Here we are trying discuss a specified scenario related to Table Scan and Index Scan. Scenario Description Suppose we have a table object without any Index on it. The name of the Table Object is tbl_WithoutIndex and another Table Object called tbl_WuthIndex -- Heap Table Defination IF OBJECT_ID(N'dbo.tbl_WithoutIndex', N'U') IS NOT NULL BEGIN DROP TABLE [dbo].[tbl_WithoutIndex]; END GO CREATE TABLE [dbo].[tbl_WithoutIndex] ( EMPID INT NOT NULL, EMPNAME VARCHAR(50) NOT NULL ); GO -- Insert Some Records INSERT INTO [dbo].[tbl_WithoutIndex] (EMPID, EMPNAME) VALUES (101, 'Joydeep Das'), (102, 'Sukamal Jana'), (103, 'Ranajit Shinah'); GO -- Table with Index (Clustered Index for primary Key) IF OBJECT_ID(N'dbo.tbl_WithIndex', N'U') IS NOT NULL BEGIN DROP TABLE [dbo].[tbl_WithIndex]; END GO CREATE TABLE [dbo].[tbl_WithIndex] ( EMPID INT NOT NULL PRIMARY KEY, EMPNAME VARCHAR(50) NOT NULL ); GO -- Finding Index Name sp_helpindex tbl_WithIndex; index_name index_description index_keys PK__tbl_With__14CCD97D3AD6B8E2 clustered, unique, primary key located on PRIMARY EMPID -- Insert Some Records INSERT INTO [dbo].[tbl_WithIndex] (EMPID, EMPNAME) VALUES (101, 'Joydeep Das'), (102, 'Sukamal Jana'), (103, 'Ranajit Shinah'); GO Now we are going to compare the execution plan output SELECT * FROM [dbo].[tbl_WithoutIndex]; has no Index (Here I mean the clustered Index) the table is a Heap. So when we put the SELECT statement the entire table scanned. SELECT * FROM [dbo].[tbl_WithIndex]; Here he Table has a PRIMARY KEY, so it has a Clustered Index on it. But here we are not putting the Index columns on the WHERE clause, so the Clustered Index Scan Occurs. Close Observation of Execution Plan Please remember that the table has small number of records. Table Name Estimated IO Cost tbl_WithoutIndex 0.0032035 Tbl_WithIndex 0.003125 If we see the Estimated Operation Cost, it would be same for both the Query (0.0032853). Question in Mind Here we can see the Estimated Operation cost for both the Query is same. So Question is in the mind that, if a Index Scan occur we can drop the index and use the heap (in our case). So is there any other difference between them. How they are Differences Here we understand what the internal difference between Table Scan and Index Scan. When the table scan occurs MS SQL server reads all the Rows and Columns into memory. When the Index Scan occurs, it's going to read all the Rows and only the Columns in the index. Effects in Performance In case of performance the Table Scan and Index Scan both have the same output, if we have use the single table SELECT statement. But it differs in the case of Index Scan when we are going to JOIN table. Hope you like it.

June 5, 2014

by Joydeep Das

· 24,272 Views

Introducing Partitioned Collections for MongoDB Applications

TokuMX 1.5 is around the corner. The big feature will be something we discussed briefly when talking about replication changes in 1.4: partitioned collections. Before introducing the feature, I wanted to mention the following. Although TokuMX 1.5 is not available as of this writing, we would love to hear feedback on partitioned collections, which we think are wonderful for time-series data, as I describe below. If you are interested in trying out the feature, email [email protected] for a pre-release version of TokuMX 1.5. What is a partitioned collection? A partitioned collection is analogous to a partitioned table in relational databases. Oracle, MySQL, SQL Server, and Postgres all support partitioned tables. We are happy to bring this functionality to TokuMX. So, if the remainder of this blog is unclear, and you have friends in the office who are familiar with relational databases, you may want to ask them for more information :). Nevertheless, a partitioned collection is a collection that underneath the covers is broken into (or partitioned into) several individual collections, based on ranges of a “partition key”. From the application developer’s point of view, the collection is just another collection. Queries, inserts, updates, and deletes just work with no syntactical changes. Secondary indexes and replication work as well. But underneath the covers, the data will be broken into several collections, with each collection responsible for all data for a range of the partition key. If you are running TokuMX 1.4, a simple example is the oplog, which is a partitioned collection. Any normal query works just fine on the oplog. However, if you look in your data directory, you will see several .tokumx files named “local_oplog_rs_p…”. These files are the individual partitions that break up the data. Each partition stores a range of _id fields in the oplog. Why should I bother using a partitioned collection? This will be its own post with longer examples, but here is a summary. Partitioned collections have two big advantages: Large chunks of data can be deleted very efficiently by dropping partitions. The cost is that of performing an “rm” on some files in the filesystem. This is really fast and efficient. Queries that include the partition key may be isolated to individual partitions, and therefore run faster. This is similar to “query isolation” for shard keys. So, one scenario you may want a partitioned collection for is where the oldest data gets dropped periodically, and many queries benefit from a time based key. That will be a good fit. In short: time series data. If you have a time-series application where you want to keep a rolling period of data (e.g. the last 6 months worth), then using a partitioned collection will be great, and is preferable to using a TTL index or a capped collection. In a future blog post, I will expand on this. How do I use a partitioned collection in TokuMX 1.5? Basically, just like a normal collection, except with some commands added to create a partitioned collection, add partitions, and drop partitions. Below, I explain the shell commands added for this functionality. Our documentation contains the full commands so that they may be called by any driver’s runCommand method. Ok, so how do I create a partitioned collection? The first thing to consider is what your partition key should be. That is, what key do you want to use ranges of to partition your data? This key has similarities with a shard key. It should be a key that can be used to isolate partitions, the way a shard key is used to isolate shards (as explained here). Also, it should be a key that contains a range of data you would like to delete all at once. With time series data, that key will likely be a timestamp. In TokuMX, the partition key is always the primary key. To create a partitioned collection, “foo”, with a timestamp field, “ts”, used for the partitioning run the following: > db.createCollection("foo", { partitioned : 1 , primaryKey : { ts : 1 , _id : 1 } }) { "ok" : 1 } Note that in TokuMX, the primary key must have the _id field appended to it to ensure uniqueness. As a side note, we do not support hash based partitioning, only range based partitioning. Adding partitions? In TokuMX, partitions can only be appended to the end. Individual partitions cannot be split. So, say we have a collection that partitions on the _id field, where all _id’s happen to be integers. Suppose we have three partitions with the following ranges: _id <= 0 0 < _id <= 1000 _id > 1000 With this collection we cannot create a partition with the range 500 < _id <= 1000, because that would split the second partition. All we can do is add a new partition to the end, and “cap” the current last partition with a new maximum value. This new maximum value must be greater than or equal to the primary key (or in this case, _id) of the last partition’s last document. So, if the last partition’s last document has an _id of 2500, we can only partitions that create a range whose maximum is at least 2500. There are two ways to add a partition. The first method peeks at the last document in the current last partition, caps the partition with the primary key of that last document, and creates a new partition. To do so, one does: > db.foo.addPartition() { "ok" : 1 } In the above example, the partitioned collection would now have partitions with the following ranges: _id <= 0 0 < _id <= 1000 1000 < _id <= 2500 _id > 2500 Alternatively, we can specify what the new maximum of the existing last partition may be, provided it is greater than the last document in last partition (which in this example is 2500). To do so, we simply pass in the new maximum as a parameter: > db.foo.addPartition({ _id : 3000 }); { "ok" : 1 } This would make the collection have partitions with the following ranges: _id <= 0 0 < _id <= 1000 1000 < _id <= 3000 _id > 3000 Dropping partitions? Dropping partitions is simple. First, see what the partitions are with the following shell command: > db.foo.getPartitionInfo() { "numPartitions" : NumberLong(4), "partitions" : [ { "_id" : NumberLong(0), "max" : { "_id" : 0 }, "createTime" : ISODate("2014-05-29T01:50:15.839Z") }, { "_id" : NumberLong(1), "max" : { "_id" : 1000 }, "createTime" : ISODate("2014-05-29T01:50:27.049Z") }, { "_id" : NumberLong(2), "max" : { "_id" : 2500 }, "createTime" : ISODate("2014-05-29T01:50:30.549Z") }, { "_id" : NumberLong(3), "max" : { "_id" : { "$maxKey" : 1 } }, "createTime" : ISODate("2014-05-29T01:50:35.903Z") } ], "ok" : 1 } This lists each partition, what the maximum value that each partition may hold (thus defining the range of the partition), and the id of the partition (in the _id field). So, in the example we used for adding partitions, we have four partitions with _ids 0 through 3. To drop a partition, we run the following command and pass the _id of the partition we want to drop. To drop partition 0, we run: > db.foo.dropPartition(0) { "ok" : 1 } Looking at the list of partitions after this operation, we see the partition is dropped: > db.foo.getPartitionInfo() { "numPartitions" : NumberLong(3), "partitions" : [ { "_id" : NumberLong(1), "max" : { "_id" : 1000 }, "createTime" : ISODate("2014-05-29T01:50:27.049Z") }, { "_id" : NumberLong(2), "max" : { "_id" : 2500 }, "createTime" : ISODate("2014-05-29T01:50:30.549Z") }, { "_id" : NumberLong(3), "max" : { "_id" : { "$maxKey" : 1 } }, "createTime" : ISODate("2014-05-29T01:50:35.903Z") } ], "ok" : 1 } This covers how to use partitioned collections. We hope users in the MongoDB ecosystem find this feature as useful as relational database users do. In the comments section below, feel free to leave questions and/or feedback.

June 4, 2014

by Zardosht Kasheff

· 12,504 Views

Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel

Explore different message brokers, and discover how these important web technologies impact a customer's backlog of messages, and cluster/data performance.

June 3, 2014

by Yves Trudeau

· 460,763 Views · 86 Likes

MySQL Transaction Isolation Levels and Locks

Originally written by Lim Han Recently, an application that my team was working on encountered problems with a MySQL deadlock situation and it took us some time to figure out the reasons behind it. This application that we deployed was running on a 2-node cluster and they both are connected to an AWS MySQL database. The MySQL db tables are mostly based on InnoDB which supports transaction (meaning all the usual commit and rollback semantics) as well as row-level locking that MyISAM engine does not provide. So the problem arose when our users, due to some poorly designed user interface, was able to execute the same long running operation twice on the database. As it turned out, due to the fact that we have a dual node cluster, each of the user operation originated from a different web application (which in turn meant 2 different transaction running the same queries). The deadlock query happened to be a “INSERT INTO T… SELECT FROM S WHERE” query that introduced shared locks on the records that were used in the SELECT query. It didn’t help that both T and S in this case happened to be the same table. In effect, both the shared locks and exclusive locks were applied on the same table. An attempt to explain the possible cause of the deadlock on the queries could be explained by the following table. This is based on the assumption that we are using a default REPEATABLE_READ transaction isolation level (I will explain the concept of transaction isolation later) Assuming that we have a table as such RowId Value 1 Collection 1 2 Collection 2 … Collection N 450000 Collection 450000 The following is a sample sequence that could possibly cause a deadlock based on the 2 transactions running an SQL query like “INSERT INTO T SELECT FROM T WHERE … “ : Time Transaction 1 Transaction 2 Comment T1 Statement executed Statement executed. A shared lock is applied to records that are read by selection T2 Read lock s1 on Row 10-20 The lock on the index across a range. InnoDB has a concept of gap locks. T3 Statement executed Transaction 2 statement executed. Similar shared lock to s1 applied by selection T4 Read lock s2 on Row 10-20 Shared read locks allow both transaction to read the records only T5 Insert lock x1 into Row 13 in index wanted Transaction 1 attempts to get exclusive lock on Row 13 for insertion but Transaction 2 is holding a shared lock T6 Insert lock x2 into Row 13 in index wanted Transaction 2 attempts to get exclusive lock on Row 13 for insertion but Transaction 1 is holding a shared lock T7 Deadlock! The above scenario occurs only when we use REPEATABLE_READ (which introduces shared read locks). If we were to lower the transaction isolation level to READ_COMMITTED, we would reduce the chances of a deadlock happening. Of course, this would mean relaxing the consistency of the database records. In the case of our data requirements, we do not have such strict requirements for strong consistency. Thus, it is acceptable for one transaction to read records that are committed by other transactions. So, to delve deeper into the idea of Transaction Isolation, this concept has been defined by ANSI/ISO SQL as the following from highest isolation levels to lowest Serializable This is the highest isolation level and usually requires the use of shared read locks and exclusive write locks (as in the case of MySQL). What this means in essence that any query made will require access to a shared read lock on the records which prevents another transaction’s query to modify these records. Every update statement will require access to an exclusive write lock Also, range-locks must be acquired when a select statement with a WHERE condition is used. This is implemented as a gap lock in MySQL. Repeatable Reads This is the default level used in MySQL. This is mainly similar to Serializable beside the fact that a range lock is not used. However, the way that MySQL implements this level seemed to me a little different. Based on Wikipedia’s article on Transaction Isolation, a range lock is not implemented and so phantom reads can still occur. Phantom reads refer to a possibility that select queries will have additional records when the same query is made within a transaction. However, what I understand from MySQL’s document is that range locks are still used and the same select queries made in the same transaction will always return the same records. Maybe I’m mistaken in my understanding and if there’s any mistakes in my intepretations, I stand ready to be corrected. Read Committed This is an isolation level that will maintain a write lock until the end of the transaction but read locks will be released at the end of the SELECT statement. It does not promise that a SELECT statement will find the same data if it is re-run again in the same transaction. It will, however, guarantee that the data that is read are not “dirty” and has been committed. Read Uncommitted This is an isolation level that I doubt would be useful for most use cases. Basically, it allows a transaction to see all data that has been modified, including “dirty” or uncommitted data. This is the lowest isolation level Having gone through the different transaction isolation levels, we could see how the selection of the Transaction Isolation level determines the kind of database locking mechanism. From a practical standpoint, the default MySQL isolation level (REPEATABLE_READ) might not always be a good choice when you are dealing with a scenario like ours where there is really no need for such strong consistency in the data reads. I believe that by lowering the isolation level, it is likely to reduce chances that your database queries meet with a deadlock. Also, it might even allow a higher concurrent access to your database which improve the performance level of your queries. Of course, this comes with the caveat that you need to understand how important consistent reads are for your application. If you are dealing with data where precision is paramount (e.g. your bank accounts), then it is definitely necessary to impose as much isolation as possible so that you would not read inconsistent information within your transaction.

June 2, 2014

by Anh Tuan Nguyen

· 18,243 Views · 1 Like

Neo4j & Cypher: Rounding a Float Value to Decimal Places

About 6 months ago, Jacqui Read created a github issue explaining how she wanted to round a float value to a number of decimal places but was unable to do so due to the round function not taking the appropriate parameter. I found myself wanting to do the same thing last week where I initially had the following value: RETURN toFloat("12.336666") AS value I wanted to round that to 2 decimal places and Wes suggested multiplying the value before using ROUND and then dividing afterwards to achieve that. For two decimal places we need to multiply and divide by 100: WITH toFloat("12.336666") AS value RETURN round(100 * value) / 100 AS value 12.34 Michael suggested abstracting the number of decimal places like so: WITH 2 as precision WITH toFloat("12.336666") AS value, 10^precision AS factor RETURN round(factor * value)/factor AS value If we want to round to 4 decimal places we can easily change that: WITH 4 as precision WITH toFloat("12.336666") AS value, 10^precision AS factor RETURN round(factor * value)/factor AS value 12.3367 The code is available as a graph gist too if you want to play around with it. And you may as well check out the other gists while you’re at it – enjoy!

June 2, 2014

by Mark Needham

· 7,033 Views

Connecting to Cassandra from Java

In this post, I look at the basics of connecting to a Cassandra database from a Java client. I will use the DataStax Java Client JAR in order to do so.

May 30, 2014

by Dustin Marx

· 121,318 Views · 3 Likes

Performance Tuning of Spring/Hibernate Applications

For most typical Spring/Hibernate enterprise applications, the application performance depends almost entirely on the performance of it's persistence layer. This post will go over how to confirm that we are in presence of a 'database-bound' application, and then walk through 7 frequently used 'quick-win' tips that can help improve application performance. How to confirm that an application is 'database-bound' To confirm that an application is 'database-bound', start by doing a typical run in some development environment, using VisualVM for monitoring. VisualVM is a Java profiler shipped with the JDK and launchable via the command line by calling jvisualvm. After launching Visual VM, try the following steps: double click on your running application Select Sampler click on Settings checkbox Choose Profile only packages, and type in the following packages: your.application.packages.* org.hibernate.* org.springframework.* your.database.driver.package, for example oracle.* Click Sample CPU The CPU profiling of a typical 'database-bound' application should look something like this: We can see that the client Java process spends 56% of it's time waiting for the database to return results over the network. This is a good sign that the queries on the database are what's keeping the application slow. The 32.7% in Hibernate reflection calls is normal and nothing much can be done about it. First step for tuning - obtaining a baseline run The first step to do tuning is to define a baseline run for the program. We need to identify a set of functionally valid input data that makes the program go through a typical execution similar to the production run. The main difference is that the baseline run should run in a much shorter period of time, as a guideline an execution time of around 5 to 10 minutes is a good target. What makes a good baseline? A good baseline should have the following characteristics: it's functionally correct the input data is similar to production in it's variety it completes in a short amount of time optimizations in the baseline run can be extrapolated to a full run Getting a good baseline is solving half of the problem. What makes a bad baseline? For example, in a batch run for processing call data records in a telecommunications system, taking the first 10 000 records could be the wrong approach. The reason being, the first 10 000 might be mostly voice calls, but the unknown performance problem is in the processing of SMS traffic. Taking the first records of a large run would lead us to a bad baseline, from which wrong conclusions would be taken. Collecting SQL logs and query timings The SQL queries executed with their execution time can be collected using for example log4jdbc. See this blog post for how to collect SQL queries using log4jdbc - Spring/Hibernate improved SQL logging with log4jdbc. The query execution time is measured from the Java client side, and it includes the network round-trip to the database. The SQL query logs look like this: 16 avr. 2014 11:13:48 | SQL_QUERY /* insert your.package.YourEntity */ insert into YOUR_TABLE (...) values (...) {executed in 13 msec} The prepared statements themselves are also a good source of information - they allow to easily identify frequent query types. They can be logged by following this blog post - Why and where is Hibernate doing this SQL query? What metrics can be extracted from SQL logs The SQL logs can give the answer these questions: What are slowest queries being executed? What are the most frequent queries? What is the amount of time spent generating primary keys? Is there some data that could benefit from caching ? How to parse the SQL logs Probably the only viable option for large log volumes is to use command line tools. This approach has the advantage of being very flexible. At the expense of writing a small script or command, we can extract mostly any metric needed. Any command line tool will work as long as you are comfortable with it. If you are used to the Unix command line, bash might be a good option. Bash can be used also in Windows workstations, using for example Cygwin, or Git that includes a bash command line. Frequently applied Quick-Wins The quick-wins bellow identify common performance problems in Spring/Hibernate applications, and their corresponding solutions. Quick-win Tip 1 - Reduce primary key generation overhead In processes that are 'insert-intensive', the choice of a primary key generation strategy can matter a lot. One common way to generate id's is to use database sequences, usually one per table to avoid contention between inserts on different tables. The problem is that if 50 records are inserted, we want to avoid that 50 network round-trips are made to the database in order to obtain 50 id's, leaving the Java process hanging most of the time. How does Hibernate usually handle this? Hibernate provides new optimized ID generators that avoid this problem. Namely for sequences, a HiLo id generator is used by default. This is how the HiLo sequence generator it works: call a sequence once and get 1000 (the High value) calculate 50 id's like this: 1000 * 50 + 0 = 50000 1000 * 50 + 1 = 50001 ... 1000 * 50 + 49 = 50049, Low value (50) reached call sequence for new High value 1001 ... etc ... So from a single sequence call, 50 keys where generated, reducing the overhead caused my inumerous network round-trips. These new optimized key generators are on by default in Hibernate 4, and can even be turned off if needed by setting hibernate.id.new_generator_mappings to false. Why can primary key generation still be a problem? The problem is, if you declared the key generation strategy as AUTO, the optimized generators are still off, and your application will end up with a huge amount of sequence calls. In order to make sure the new optimized generators are on, make sure to use the SEQUENCE strategy instead of AUTO: @Id @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "your_key_generator") private Long id; With this simple change, an improvement in the range of 10%-20% can be measured in 'insert-intensive' applications, with basically no code changes. Quick-win Tip 2 - Use JDBC batch inserts/updates For batch programs, JDBC drivers usually provide an optimization for reducing network round-trips named 'JDBC batch inserts/updates'. When these are used, inserts/updates are queued at the driver level before being sent to the database. When a threshold is reached, then the whole batch of queued statements is sent to the database in one go. This prevents the driver from sending the statements one by one, which would waist multiple network round-trips. This is the entity manager factory configuration needed to active batch inserts/updates: 100 true true Setting only the JDBC batch size won't work. This is because the JDBC driver will batch the inserts only when receiving insert/updates for the exact same table. If an insert to a new table is received, then the JDBC driver will first flush the batched statements on the previous table, before starting to batch statements on the new table. A similar functionality is implicitly used if using Spring Batch. This optimization can easily buy you 30% to 40% to 'insert intensive' programs, without changing a single line of code. Quick-win Tip 3 - Periodically flush and clear the Hibernate session When adding/modifying data in the database, Hibernate keeps in the session a version of the entities already persisted, just in case they are modified again before the session is closed. But many times we can safely discard entities once the corresponding inserts where done in the database. This releases memory in the Java client process, preventing performance problems caused by long running Hibernate sessions. Such long-running sessions should be avoided as much as possible, but if by some reason they are needed, this is how to contain memory consumption: entityManager.flush(); entityManager.clear(); The flush will trigger the inserts from new entities to be sent to the database. The clear releases the new entities from the session. Quick-win Tip 4 - Reduce Hibernate dirty-checking overhead Hibernate uses internally a mechanism to keep track of modified entities called dirty-checking. This mechanism is not based on the equals and hashcode methods of the entity classes. Hibernate does it's most to keep the performance cost of dirty-checking to a minimum, and to dirty-check only when it needs to, but the mechanism does have a cost, which is more noticeable in tables with a large number of columns. Before applying any optimization, the most important is to measure the cost of dirty-checking using VisualVM. How to avoid dirty-checking? In Spring business methods that we know are read-only, dirty-checking can be turned off like this: @Transactional(readOnly=true) public void someBusinessMethod() { .... } An alternative to avoid dirty-checking is to use the Hibernate Stateless Session, which is detailed in the documentation. Quick-win Tip 5 - Search for 'bad' query plans Check the queries in the slowest queries list to see if they have good query plans. The most usual 'bad' query plans are: Full table scans: they happen when the table is being fully scanned due to usually a missing index or outdated table statistics. Full cartesian joins: This means that the full cartesian product of several tables is being computed. Check for missing join conditions, or if this can be avoided by splitting a step into several. Quick-win Tip 6 - check for wrong commit intervals If you are doing batch processing, the commit interval can make a large difference in the performance results, as in 10 to 100 times faster. Confirm that the commit interval is the one expected (usually around 100-1000 for Spring Batch jobs). It happens often that this parameter is not correctly configured. Quick-win Tip 7 - Use the second-level and query caches If some data is identified as being eligible for caching, then have a look at this blog post for how to setup the Hibernate caching: Pitfalls of the Hibernate Second-Level / Query Caches Conclusions To solve application performance problems, the most important action to take is to collect some metrics that allow to find what the current bottleneck is. Without some metrics it is often not possible to guess in useful time what the correct problem cause is. Also, many but not all of the typical performance pitfalls of a 'database-driven' application can be avoided in the first place by using the Spring Batch framework.

May 29, 2014

by Vasco Cavalheiro

· 38,394 Views · 3 Likes

Implementing Correlation ids in Spring Boot (for Distributed Tracing in SOA/Microservices)

After attending Sam Newman’s microservice talks at Geecon last week I started to think more about what is most likely an essential feature of service-oriented/microservice platforms for monitoring, reporting and diagnostics: correlation ids. Correlation ids allow distributed tracing within complex service oriented platforms, where a single request into the application can often be dealt with by multiple downstream service. Without the ability to correlate downstream service requests it can be very difficult to understand how requests are being handled within your platform. I’ve seen the benefit of correlation ids in several recent SOA projects I have worked on, but as Sam mentioned in his talks, it’s often very easy to think this type of tracing won’t be needed when building the initial version of the application, but then very difficult to retrofit into the application when you do realise the benefits (and the need for!). I’ve not yet found the perfect way to implement correlation ids within a Java/Spring-based application, but after chatting to Sam via email he made several suggestions which I have now turned into a simple project using Spring Boot to demonstrate how this could be implemented. Why? During both of Sam’s Geecon talks he mentioned that in his experience correlation ids were very useful for diagnostic purposes. Correlation ids are essentially an id that is generated and associated with a single (typically user-driven) request into the application that is passed down through the stack and onto dependent services. In SOA or microservice platforms this type of id is very useful, as requests into the application typically are ‘fanned out’ or handled by multiple downstream services, and a correlation id allows all of the downstream requests (from the initial point of request) to be correlated or grouped based on the id. So called ‘distributed tracing’ can then be performed using the correlation ids by combining all the downstream service logs and matching the required id to see the trace of the request throughout your entire application stack (which is very easy if you are using a centralised logging framework such as logstash) The big players in the service-oriented field have been talking about the need for distributed tracing and correlating requests for quite some time, and as such Twitter have created their open source Zipkin framework (which often plugs into their RPC framework Finagle), and Netflix has open-sourced their Karyon web/microservice framework, both of which provide distributed tracing. There are of course commercial offering in this area, one such product being AppDynamics, which is very cool, but has a rather hefty price tag. Creating a proof-of-concept in Spring Boot As great as Zipkin and Karyon are, they are both relatively invasive, in that you have to build your services on top of the (often opinionated) frameworks. This might be fine for some use cases, but no so much for others, especially when you are building microservices. I’ve been enjoying experimenting with Spring Boot of late, and this framework builds on the much known and loved (at least by me :-) ) Spring framework by providing lots of preconfigured sensible defaults. This allows you to build microservices (especially ones that communicate via RESTful interfaces) very rapidly. The remainder of this blog pos explains how I implemented a (hopefully) non-invasive way of implementing correlation ids. Goals Allow a correlation id to be generated for a initial request into the application Enable the correlation id to be passed to downstream services, using as method that is as non-invasive into the code as possible Implementation I have created two projects on GitHub, one containing an implementation where all requests are being handled in a synchronous style (i.e. the traditional Spring approach of handling all request processing on a single thread), and also one for when an asynchronous (non-blocking) style of communication is being used (i.e., using the Servlet 3 asynchronous support combined with Spring’s DeferredResult and Java’s Futures/Callables). The majority of this article describes the asynchronous implementation, as this is more interesting: Spring Boot asynchronous (DeferredResult + Futures) communication correlation id Github repo The main work in both code bases is undertaken by the CorrelationHeaderFilter, which is a standard Java EE Filter that inspects the HttpServletRequest header for the presence of a correlationId. If one is found then we set a ThreadLocal variable in the RequestCorrelation Class (discussed later). If a correlation id is not found then one is generated and added to the RequestCorrelation Class: public class CorrelationHeaderFilter implements Filter { //... @Override public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException { final HttpServletRequest httpServletRequest = (HttpServletRequest) servletRequest; String currentCorrId = httpServletRequest.getHeader(RequestCorrelation.CORRELATION_ID_HEADER); if (!currentRequestIsAsyncDispatcher(httpServletRequest)) { if (currentCorrId == null) { currentCorrId = UUID.randomUUID().toString(); LOGGER.info("No correlationId found in Header. Generated : " + currentCorrId); } else { LOGGER.info("Found correlationId in Header : " + currentCorrId); } RequestCorrelation.setId(currentCorrId); } filterChain.doFilter(httpServletRequest, servletResponse); } //... private boolean currentRequestIsAsyncDispatcher(HttpServletRequest httpServletRequest) { return httpServletRequest.getDispatcherType().equals(DispatcherType.ASYNC); } The only thing is this code that may not instantly be obvious is the conditional checkcurrentRequestIsAsyncDispatcher(httpServletRequest), but this is here to guard against the correlation id code being executed when the Async Dispatcher thread is running to return the results (this is interesting to note, as I initially didn’t expect the Async Dispatcher to trigger the execution of the filter again?) Here is the RequestCorrelation Class, which contains a simple ThreadLocal static variable to hold the correlation id for the current Thread of execution (set via the CorrelationHeaderFilter above) public class RequestCorrelation { public static final String CORRELATION_ID = "correlationId"; private static final ThreadLocal id = new ThreadLocal(); public static String getId() { return id.get(); } public static void setId(String correlationId) { id.set(correlationId); } } Once the correlation id is stored in the RequestCorrelation Class it can be retrieved and added to downstream service requests (or data store access etc) as required by calling the static getId() method within RequestCorrelation. It is probably a good idea to encapsulate this behaviour away from your application services, and you can see an example of how to do this in a RestClient Class I have created, which composes Spring’s RestTemplate and handles the setting of the correlation id within the header transparently from the calling Class. @Component public class CorrelatingRestClient implements RestClient { private RestTemplate restTemplate = new RestTemplate(); @Override public String getForString(String uri) { String correlationId = RequestCorrelation.getId(); HttpHeaders httpHeaders = new HttpHeaders(); httpHeaders.set(RequestCorrelation.CORRELATION_ID, correlationId); LOGGER.info("start REST request to {} with correlationId {}", uri, correlationId); //TODO: error-handling and fault-tolerance in production ResponseEntity response = restTemplate.exchange(uri, HttpMethod.GET, new HttpEntity(httpHeaders), String.class); LOGGER.info("completed REST request to {} with correlationId {}", uri, correlationId); return response.getBody(); } } //... calling Class public String exampleMethod() { RestClient restClient = new CorrelatingRestClient(); return restClient.getForString(URI_LOCATION); //correlation id handling completely abstracted to RestClient impl } Making this work for asynchronous requests… The code included above works fine when you are handling all of your requests synchronously, but it is often a good idea in a SOA/microservice platform to handle requests in a non-blocking asynchronous manner. In Spring this can be achieved by using the DeferredResult Class in combination with the Servlet 3 asynchronous support. The problem with using ThreadLocal variables within the asynchronous approach is that the Thread that initially handles the request (and creates the DeferredResult/Future) will not be the Thread doing the actual processing. Accordingly, a bit of glue code is needed to ensure that the correlation id is propagated across the Threads. This can be achieved by extending Callable with the required functionality: (don’t worry if example Calling Class code doesn’t look intuitive – this adaption between DeferredResults and Futures is a necessary evil within Spring, and the full code including the boilerplate ListenableFutureAdapter is in my GitHub repo): public class CorrelationCallable implements Callable { private String correlationId; private Callable callable; public CorrelationCallable(Callable targetCallable) { correlationId = RequestCorrelation.getId(); callable = targetCallable; } @Override public V call() throws Exception { RequestCorrelation.setId(correlationId); return callable.call(); } } //... Calling Class @RequestMapping("externalNews") public DeferredResult externalNews() { return new ListenableFutureAdapter<>(service.submit(new CorrelationCallable<>(externalNewsService::getNews))); } And there we have it – the propagation of correlation id regardless of the synchronous/asynchronous nature of processing! You can clone the Github report containing my asynchronous example, and execute the application by running mvn spring-boot:run at the command line. If you access http://localhost:8080/externalNewsin your browser (or via curl) you will see something similar to the following in your Spring Boot console, which clearly demonstrates a correlation id being generated on the initial request, and then this being propagated through to a simulated external call (have a look in the ExternalNewsServiceRest Class to see how this has been implemented): [nio-8080-exec-1] u.c.t.e.c.w.f.CorrelationHeaderFilter : No correlationId found in Header. Generated : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : start REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 [nio-8080-exec-2] u.c.t.e.c.w.f.CorrelationHeaderFilter : Found correlationId in Header : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : completed REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 Conclusion I’m quite happy with this simple prototype, and it does meet the two goals I listed above. Future work will include writing some tests for this code (shame on me for not TDDing!), and also extend this functionality to a more realistic example. I would like to say a massive thanks to Sam, not only for sharing his knowledge at the great talks at Geecon, but also for taking time to respond to my emails. If you’re interested in microservices and related work I can highly recommend Sam’s Microservice book which is available in Early Access at O’Reilly. I’ve enjoyed reading the currently available chapters, and having implemented quite a few SOA projects recently I can relate to a lot of the good advice contained within. I’ll be following the development of this book with keen interest! If you have any comments or thoughts then please do share them via the comment below, or feel free to get in touch via the usual mechanisms! References I used Tomasz Nurkiewicz’s excellent blog several times for learning how best to wire up all of the DeferredResult/Future code in Spring: http://www.nurkiewicz.com/2013/03/deferredresult-asynchronous-processing.html

May 29, 2014

by Daniel Bryant

· 24,580 Views · 1 Like

Visualizing XML as a Graph Using Neo4j

neo4j is quite awesome for pretty much everything . here is an extract i found from this blog whenever someone gives you a problem, think graphs . they are the most fundamental and flexible way of representing any kind of a relationship, so it’s about a 50-50 shot that any interesting design problem has a graph involved in it. make absolutely sure you can’t think of a way to solve it using graphs before moving on to other solution types. as a baby step, let us try to visualize an xml as a graph. we can start with a simple xml here, belgian waffles $5.95 two of our famous belgian waffles with plenty of real maple syrup 650 it is a simple ‘breakfast menu’ with just one item, ‘ belgian waffles ‘. neo4j has a neat interface which lets us visualize the graph we created. the graph for the above xml looks slick.. the above xml has 6 tags and we have 6 nodes in our graph. the nested tags/nodes has a ‘child-of’ relationship with the parent tag/node. in the graph above, node numbered 0 is the root node – corresponds to the tag node numbered 1 is the immediate child of the root – corresponds to the tag nodes numbered from 2-4 are the tags which come under tag, viz , , and transforming xml.. let us parse the xml into a data structure which can be easily persisted using neo4j . personally, i would prefer sax parser as there are serious memory constraints when creating a dom object (really painful if you talking big data as xml). additionally sax parsing gives you unlimited freedom to do whatever you want to do with the xml. this is how i represent a node, using the object xmlelement . each xmlelement is identified by an id. this is basically an integer or long. only thing to make sure is that the number has to be unique across all xmlelements. publicclassxmlelement { privatestring tagname; privatestring tagvalue; privatemap attributes = newhashmap(); privatehierarchyidentifier hierarchyidentifier; privateintparentid; privatebooleanpersisted; //setters and getters for the members publicstring getatrributestring(){ objectmapper jsonmapper = newobjectmapper(); try{ returnjsonmapper.writevalueasstring(this.attributes); } catch(jsonprocessingexception e) { logger.severe(e.getmessage()); e.printstacktrace(); } returnnull; } } the xml tag name and value are stored as string and the attributes are stored as a map. also, there is another member object, hierarchyidentifier publicclasshierarchyidentifier { privateintdepth; privateintwidth; privateintid; //getters and setters for member element } the hierarchyidentifier class contains the id class which is used to identify an xmlelement , which translates to identifying an xml tag. each xmlelement has the id of it’s parent stored. and after the parse, the xml should be represented as a map of and each xmlelement identified by the corresponding integer id. you can see the sax parser here .. so we pass the map to the graphwriter object. we create a node with the following properties value – the value of the xml tag id – the id of the current node parent – the id of the parent tag attributes – the string representation of the tag attributes also, the node has the following labels node – for all nodes parent – for the seed ( highest parent ) node xml tag name – say, the node for the tag will have label food currently, there is only a single type of relationship, ‘child_of’ , between immediate nested tags. see the below the example for a bigger xml belgian waffles $5.95 two of our famous belgian waffles with plenty of real maple syrup 650 masala dosa $10.95 south india's famous slim pancake with mashed potatoes 650 25 i have added a second item to the ‘breakfast_menu’.. the legendary masala dosa . also, the second food item has a new child tag, . and the resultant graph looks prettier the sets of tags,relationships and properties can be seen in the neo4j local server interface. the entire codebase can be found here in github .. visualizing as a graphgist an easy representation can be done in graphgist. i have created the graph of this xml. here is the link to the gist . also, the gist script can be found in the github gist here ...

May 29, 2014

by Nikhil Kuriakose

· 25,914 Views

Debunking Myths About the VoltDB In-Memory Database

Originally written by John Hugg I’d like to take a quick moment to address some myths and misconceptions about VoltDB. Many people selling products who view VoltDB as competition seem to be repeating them. As you’ll read, much of what’s said is just plain FUD. VoltDB is an in-memory database that has benchmarked at over 3 million transactions a second on bare metal, and recently crushed previous performance records in the cloud, posting eye-popping YCSB (Yahoo! Cloud Service Benchmark) numbers on AWS, Amazon’s cloud platform. It’s fully transactional, supports full disk durability, and has very low latency and fully native high availability. It’s not surprising competing interests look for FUD. Note the comments below apply to the current shipping version of VoltDB at the time of this posting (4.2), unless otherwise noted. Myth #1: “VoltDB requires stored procedures.” This was true for 1.0, but no one seems to notice it’s been false since we shipped 1.1 in 2010. VoltDB supports unforeseen SQL without any stored procedure use. We have users in production who have never used a single stored procedure. SQL queries can be run through: JDBC Our command line SQL console A web portal built into every server The HTTP/JSON interface Native client drivers written in C++, C#, PHP, Python, Node.js and even Java. Community-provided language drivers ODBC drivers (in development to ship in Q2) There is one restriction: VoltDB doesn’t support external transaction control; VoltDB is permanently in auto-commit mode. Note this offers more transactionality than most NoSQL systems, and is a shared restriction of several NewSQL or “In-Memory” SQL systems. For those systems that do offer external transaction control, I’m not aware of any that publish benchmarks where transactions require multiple round trips to the client. This is because it’s generally understood to be a bad thing in a performance-critical system. In addition to ad-hoc SQL, VoltDB auto-generates optimized procedures for various CRUD operations (create, read, update, delete). For example, to fetch a row by primary key from table FOO, I can call a procedure named “FOO.select” with the key value as a parameter. Myth #2: “VoltDB doesn’t support ad-hoc SQL.” This is just a rephrasing of Myth #1 and is still false. Myth #3: “VoltDB is slow unless I use stored procedures.” Well, no. VoltDB can run faster with stored procedures, but it’s still fast if they are not used. In our internal benchmarks on pretty cheap single-socket hardware, we can run about 50k write statements per second, per host with full durability. That’s a bit lower than we can achieve with stored procedures, so we’re doing some engineering work to close the gap. Still, on a 3-node cluster costing about $3,000 for hardware, we can run well over 100k SQL updates per second with durability and double redundancy. If your current system can run that fast, let us know. Also, the auto-generated CRUD procedures mentioned above run at the full speed of pre-optimized procedures. Using just CRUD, VoltDB is the fastest durable Key-Value store I’m aware of. Myth #4: “I have to know Java to use VoltDB.” As of VoltDB 3.0, released over a year ago, (we’re on V4.2 today), a user can build VoltDB apps and run the server without ever directly interacting with the Java CLI tools or any Java code. Client apps can also be written without Java. As mentioned previously, VoltDB supports client drivers in many languages, with community-contributed drivers in others. If you do want to leverage VoltDB’s ability to perform multi-statement, transactional logic, only the most basic understanding of Java is needed to build stored procedures. We provide example applications in our kit that can be copied from and/or modified. We even support declaring procedures in Groovy right in your DDL. Myth #5: “VoltDB has garbage collection problems because it is written in Java.” The short answer is this is just not true. First, VoltDB uses native C++ code for all of its data storage and SQL execution paths. This is done to avoid putting long-lived data on the Java heap, but also to allow very fine control of memory use. VoltDB’s special native data structures use less memory than competing systems and actively fight allocation fragmentation. What’s left on the Java heap is either near-immortal configuration and setup data, or very short-lived per-transaction data. This is almost the perfect workload for the Java garbage collector, so there is minimal impact to the user’s workload. For more detail, read Ariel Weisberg’s excellent blog post on the topic. Myth #6: “VoltDB’s SQL is very limited.” This misconception is based on historical truth, but is way out of date in 2014. The original H-Store project, upon which VoltDB is based, started with very limited, OLTP-focused SQL, but much has changed in the past years. VoltDB is rapidly approaching SQL-92 functionality for DML and DQL SQL. One of the final holdouts, subselects, shipped in part in 4.2; more functionality will ship in upcoming versions. In addition to this ANSI SQL baseline, we’ve added materialized view support, function-based index support, native JSON functionality and more. We have a lot more coming in 2014. We’re proud to announce all of this is fully Unicode-compliant and is in production in several uses in countries with multi-byte character sets. Myth #7: “Yes, VoltDB supports cross-partition transactions, but they are too slow to use.” Yes we do, and no they aren’t too slow to use. Keep in mind that NoSQL systems often don’t offer cross-partition transactions and many other NewSQL systems have functionality or performance limitations. Few benchmarks stress this functionality, so I’ll be as clear as possible here. Traditionally, VoltDB’s cross-partition transactions have been slower than our partitioned operations. Additional coordination needs to happen to make these transactions ACID at the “serializable” isolation level at which all VoltDB transactions run. Furthermore, these operations don’t scale with cluster size and sometimes can vary in performance, depending on your networking performance. But are they too slow to use? I’ll let you be the judge. Cross-partition writes run on the order of 1,000 per second,regardless of cluster size. This number can vary by an order of magnitude depending on hardware. Cross partition reads run on the order of 50,000 per second, regardless of cluster size. Again, there is some variability, but less than with writes. It’s possible to send writes to each partition individually using the “Send Everywhere” pattern. These writes are transactional at each partition, but individual partitions might fail and rollback independently of others. A common pattern we see is a mix of single-partition writes and cross-partition reads. Cross-partition writes are often used to update infrequently changing lookup tables. Of course at 1000 per second, they can still change fairly frequently. Note that if your read operation is scanning millions of tuples, you won’t see the same throughput, but if it’s populating a schema supported leaderboard or computing a time window statistic using a materialized view, you will. The bottom line is that most of our customers not only use cross-partition transactions, but also these kinds of operations are one of the key benefits VoltDB offers. We are always working to improve the performance of cross-partition operations, and in recent releases we have made a lot of progress. In 4.0 we introduced read optimizations that got us to the 50k/second number. We also have write optimizations in the pipeline. In Conclusion Many of these myths and misconceptions are firmly rooted in the world of traditional RDBMSs. VoltDB is a NewSQL database. When evaluating VoltDB, consider the non-traditional features that set us head and shoulders above traditional RDBMSs, NoSQL and other NewSQL options: Intelligent, asynchronous clients allow tremendous throughput per-client, as well as terrific management. Native export connectors push tuples into downstream systems like message queues, analytic data-warehouses and Hadoop Active-Active intra-cluster replication with sub-millisecond latency ensures perfectly consistent high availability and redundant durability. Seamless node failure and replacement without impacting serializable ACID semantics. Transactional cluster expansion with low performance impact and without impacting serializable ACID semantics provides near-limitless scalability. 1500 active, connected clients per server node with minimal latency impact meets the needs of most high-velocity, high transaction volume businesses. VoltDB is open source (https://github.com/VoltDB/voltdb) If you want to hear more, or if you have a use case you want to talk to us about, contact us. We’d love to hear from you. If you’d like to try VoltDB out yourself, you can download a free trial here.

May 28, 2014

by John Piekos

· 9,907 Views · 1 Like

Implementing Correlation IDs in Spring Boot (for Distributed Tracing in SOA/Microservices)

After attending Sam Newman’s microservice talks at Geecon last week I started to think more about what is most likely an essential feature of service-oriented/microservice platforms for monitoring, reporting and diagnostics: correlation ids. Correlation ids allow distributed tracing within complex service oriented platforms, where a single request into the application can often be dealt with by multiple downstream service. Without the ability to correlate downstream service requests it can be very difficult to understand how requests are being handled within your platform. I’ve seen the benefit of correlation ids in several recent SOA projects I have worked on, but as Sam mentioned in his talks, it’s often very easy to think this type of tracing won’t be needed when building the initial version of the application, but then very difficult to retrofit into the application when you do realise the benefits (and the need for!). I’ve not yet found the perfect way to implement correlation ids within a Java/Spring-based application, but after chatting to Sam via email he made several suggestions which I have now turned into a simple project using Spring Boot to demonstrate how this could be implemented. Why? During both of Sam’s Geecon talks he mentioned that in his experience correlation ids were very useful for diagnostic purposes. Correlation ids are essentially an id that is generated and associated with a single (typically user-driven) request into the application that is passed down through the stack and onto dependent services. In SOA or microservice platforms this type of id is very useful, as requests into the application typically are ‘fanned out’ or handled by multiple downstream services, and a correlation id allows all of the downstream requests (from the initial point of request) to be correlated or grouped based on the id. So called ‘distributed tracing’ can then be performed using the correlation ids by combining all the downstream service logs and matching the required id to see the trace of the request throughout your entire application stack (which is very easy if you are using a centralised logging framework such as logstash) The big players in the service-oriented field have been talking about the need for distributed tracing and correlating requests for quite some time, and as such Twitter have created their open source Zipkin framework (which often plugs into their RPC framework Finagle), and Netflix has open-sourced their Karyon web/microservice framework, both of which provide distributed tracing. There are of course commercial offering in this area, one such product being AppDynamics, which is very cool, but has a rather hefty price tag. Creating a proof-of-concept in Spring Boot As great as Zipkin and Karyon are, they are both relatively invasive, in that you have to build your services on top of the (often opinionated) frameworks. This might be fine for some use cases, but no so much for others, especially when you are building microservices. I’ve been enjoying experimenting with Spring Boot of late, and this framework builds on the much known and loved (at least by me :-) ) Spring framework by providing lots of preconfigured sensible defaults. This allows you to build microservices (especially ones that communicate via RESTful interfaces) very rapidly. The remainder of this blog pos explains how I implemented a (hopefully) non-invasive way of implementing correlation ids. Goals Allow a correlation id to be generated for a initial request into the application Enable the correlation id to be passed to downstream services, using as method that is as non-invasive into the code as possible Implementation I have created two projects on GitHub, one containing an implementation where all requests are being handled in a synchronous style (i.e. the traditional Spring approach of handling all request processing on a single thread), and also one for when an asynchronous (non-blocking) style of communication is being used (i.e., using the Servlet 3 asynchronous support combined with Spring’s DeferredResult and Java’s Futures/Callables). The majority of this article describes the asynchronous implementation, as this is more interesting: Spring Boot asynchronous (DeferredResult + Futures) communication correlation id Github repo The main work in both code bases is undertaken by the CorrelationHeaderFilter, which is a standard Java EE Filter that inspects the HttpServletRequest header for the presence of a correlationId. If one is found then we set a ThreadLocal variable in the RequestCorrelation Class (discussed later). If a correlation id is not found then one is generated and added to the RequestCorrelation Class: public class CorrelationHeaderFilter implements Filter { //... @Override public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException { final HttpServletRequest httpServletRequest = (HttpServletRequest) servletRequest; String currentCorrId = httpServletRequest.getHeader(RequestCorrelation.CORRELATION_ID_HEADER); if (!currentRequestIsAsyncDispatcher(httpServletRequest)) { if (currentCorrId == null) { currentCorrId = UUID.randomUUID().toString(); LOGGER.info("No correlationId found in Header. Generated : " + currentCorrId); } else { LOGGER.info("Found correlationId in Header : " + currentCorrId); } RequestCorrelation.setId(currentCorrId); } filterChain.doFilter(httpServletRequest, servletResponse); } //... private boolean currentRequestIsAsyncDispatcher(HttpServletRequest httpServletRequest) { return httpServletRequest.getDispatcherType().equals(DispatcherType.ASYNC); } The only thing is this code that may not instantly be obvious is the conditional check currentRequestIsAsyncDispatcher(httpServletRequest), but this is here to guard against the correlation id code being executed when the Async Dispatcher thread is running to return the results (this is interesting to note, as I initially didn’t expect the Async Dispatcher to trigger the execution of the filter again?) Here is the RequestCorrelation Class, which contains a simple ThreadLocal static variable to hold the correlation id for the current Thread of execution (set via the CorrelationHeaderFilter above) public class RequestCorrelation { public static final String CORRELATION_ID = "correlationId"; private static final ThreadLocal id = new ThreadLocal(); public static String getId() { return id.get(); } public static void setId(String correlationId) { id.set(correlationId); } } Once the correlation id is stored in the RequestCorrelation Class it can be retrieved and added to downstream service requests (or data store access etc) as required by calling the static getId() method within RequestCorrelation. It is probably a good idea to encapsulate this behaviour away from your application services, and you can see an example of how to do this in a RestClient Class I have created, which composes Spring’s RestTemplate and handles the setting of the correlation id within the header transparently from the calling Class. @Component public class CorrelatingRestClient implements RestClient { private RestTemplate restTemplate = new RestTemplate(); @Override public String getForString(String uri) { String correlationId = RequestCorrelation.getId(); HttpHeaders httpHeaders = new HttpHeaders(); httpHeaders.set(RequestCorrelation.CORRELATION_ID, correlationId); LOGGER.info("start REST request to {} with correlationId {}", uri, correlationId); //TODO: error-handling and fault-tolerance in production ResponseEntity response = restTemplate.exchange(uri, HttpMethod.GET, new HttpEntity(httpHeaders), String.class); LOGGER.info("completed REST request to {} with correlationId {}", uri, correlationId); return response.getBody(); } } //... calling Class public String exampleMethod() { RestClient restClient = new CorrelatingRestClient(); return restClient.getForString(URI_LOCATION); //correlation id handling completely abstracted to RestClient impl } Making this work for asynchronous requests… The code included above works fine when you are handling all of your requests synchronously, but it is often a good idea in a SOA/microservice platform to handle requests in a non-blocking asynchronous manner. In Spring this can be achieved by using the DeferredResult Class in combination with the Servlet 3 asynchronous support. The problem with using ThreadLocal variables within the asynchronous approach is that the Thread that initially handles the request (and creates the DeferredResult/Future) will not be the Thread doing the actual processing. Accordingly, a bit of glue code is needed to ensure that the correlation id is propagated across the Threads. This can be achieved by extending Callable with the required functionality: (don’t worry if example Calling Class code doesn’t look intuitive – this adaption between DeferredResults and Futures is a necessary evil within Spring, and the full code including the boilerplate ListenableFutureAdapter is in my GitHub repo): public class CorrelationCallable implements Callable { private String correlationId; private Callable callable; public CorrelationCallable(Callable targetCallable) { correlationId = RequestCorrelation.getId(); callable = targetCallable; } @Override public V call() throws Exception { RequestCorrelation.setId(correlationId); return callable.call(); } } //... Calling Class @RequestMapping("externalNews") public DeferredResult externalNews() { return new ListenableFutureAdapter<>(service.submit(new CorrelationCallable<>(externalNewsService::getNews))); } And there we have it – the propagation of correlation id regardless of the synchronous/asynchronous nature of processing! You can clone the Github report containing my asynchronous example, and execute the application by running mvn spring-boot:run at the command line. If you access http://localhost:8080/externalNews in your browser (or via curl) you will see something similar to the following in your Spring Boot console, which clearly demonstrates a correlation id being generated on the initial request, and then this being propagated through to a simulated external call (have a look in the ExternalNewsServiceRest Class to see how this has been implemented): [nio-8080-exec-1] u.c.t.e.c.w.f.CorrelationHeaderFilter : No correlationId found in Header. Generated : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : start REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 [nio-8080-exec-2] u.c.t.e.c.w.f.CorrelationHeaderFilter : Found correlationId in Header : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : completed REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 Conclusion I’m quite happy with this simple prototype, and it does meet the two goals I listed above. Future work will include writing some tests for this code (shame on me for not TDDing!), and also extend this functionality to a more realistic example. I would like to say a massive thanks to Sam, not only for sharing his knowledge at the great talks at Geecon, but also for taking time to respond to my emails. If you’re interested in microservices and related work I can highly recommend Sam’s Microservice book which is available in Early Access at O’Reilly. I’ve enjoyed reading the currently available chapters, and having implemented quite a few SOA projects recently I can relate to a lot of the good advice contained within. I’ll be following the development of this book with keen interest! If you have any comments or thoughts then please do share them via the comment below, or feel free to get in touch via the usual mechanisms! References I used Tomasz Nurkiewicz’s excellent blog several times for learning how best to wire up all of the DeferredResult/Future code in Spring: http://www.nurkiewicz.com/2013/03/deferredresult-asynchronous-processing.html

May 28, 2014

by Daniel Bryant

· 73,961 Views · 2 Likes