Databases Resources

The Latest Databases Topics

Jersey: Listing all Resources, Paths, Verbs to Build an Entry Point/Index for an API

I’ve been playing around with Jersey over the past couple of days and one thing I wanted to do was create an entry point or index which listed all my resources, the available paths and the verbs they accepted. Guido Simone explained a neat way of finding the paths and verbs for a specific resource using Jersey’sIntrospectionModeller: AbstractResource resource = IntrospectionModeller.createResource(JacksonResource.class); System.out.println("Path is " + resource.getPath().getValue()); String uriPrefix = resource.getPath().getValue(); for (AbstractSubResourceMethod srm :resource.getSubResourceMethods()) { String uri = uriPrefix + "/" + srm.getPath().getValue(); System.out.println(srm.getHttpMethod() + " at the path " + uri + " return " + srm.getReturnType().getName()); } If we run that against j4-minimal‘s JacksonResource class we get the following output: Path is /jackson GET at the path /jackson/{who} return com.g414.j4.minimal.JacksonResource$Greeting GET at the path /jackson/awesome/{who} return javax.ws.rs.core.Response That’s pretty neat but I didn’t want to have to manually list all my resources since I’ve already done that using Guice . I needed a way to programatically get hold of them and I partially found the way to do this from this postwhich suggests using Application.getSingletons(). I actually ended up using Application.getClasses() and I ended up with ResourceListingResource: @Path("/") public class ResourceListingResource { @GET @Produces(MediaType.APPLICATION_JSON) public Response showAll( @Context Application application, @Context HttpServletRequest request) { String basePath = request.getRequestURL().toString(); ObjectNode root = JsonNodeFactory.instance.objectNode(); ArrayNode resources = JsonNodeFactory.instance.arrayNode(); root.put( "resources", resources ); for ( Class aClass : application.getClasses() ) { if ( isAnnotatedResourceClass( aClass ) ) { AbstractResource resource = IntrospectionModeller.createResource( aClass ); ObjectNode resourceNode = JsonNodeFactory.instance.objectNode(); String uriPrefix = resource.getPath().getValue(); for ( AbstractSubResourceMethod srm : resource.getSubResourceMethods() ) { String uri = uriPrefix + "/" + srm.getPath().getValue(); addTo( resourceNode, uri, srm, joinUri(basePath, uri) ); } for ( AbstractResourceMethod srm : resource.getResourceMethods() ) { addTo( resourceNode, uriPrefix, srm, joinUri( basePath, uriPrefix ) ); } resources.add( resourceNode ); } } return Response.ok().entity( root ).build(); } private void addTo( ObjectNode resourceNode, String uriPrefix, AbstractResourceMethod srm, String path ) { if ( resourceNode.get( uriPrefix ) == null ) { ObjectNode inner = JsonNodeFactory.instance.objectNode(); inner.put("path", path); inner.put("verbs", JsonNodeFactory.instance.arrayNode()); resourceNode.put( uriPrefix, inner ); } ((ArrayNode) resourceNode.get( uriPrefix ).get("verbs")).add( srm.getHttpMethod() ); } private boolean isAnnotatedResourceClass( Class rc ) { if ( rc.isAnnotationPresent( Path.class ) ) { return true; } for ( Class i : rc.getInterfaces() ) { if ( i.isAnnotationPresent( Path.class ) ) { return true; } } return false; } } The only change I’ve made from Guido Simone’s solution is that I also call resource.getResourceMethods()because resource.getSubResourceMethods() only returns methods which have a @Path annotation. Since we’ll sometimes define our path at the class level and then define different verbs that operate on that resource it misses some methods out. If we run a cURL command (piped through python to make it look nice) against the root we get the following output: $ curl http://localhost:8080/ -w "\n" 2>/dev/null | python -mjson.tool { "resources": [ { "/bench": { "path": "http://localhost:8080/bench", "verbs": [ "GET", "POST", "PUT", "DELETE" ] } }, { "/sample/{who}": { "path": "http://localhost:8080/sample/{who}", "verbs": [ "GET" ] } }, { "/jackson/awesome/{who}": { "path": "http://localhost:8080/jackson/awesome/{who}", "verbs": [ "GET" ] }, "/jackson/{who}": { "path": "http://localhost:8080/jackson/{who}", "verbs": [ "GET" ] } }, { "/": { "path": "http://localhost:8080/", "verbs": [ "GET" ] } } ] }

July 24, 2013

by Mark Needham

· 21,034 Views · 2 Likes

Getting started with JPA and Mule

Working with JPA managed entities in Mule applications can be difficult. Since the JPA session is not propagated between message processors, transformers are typically needed to produce an entity from a message’s payload, pass it to a component for processing, then serialize it back to an un-proxied representation for further processing. Transactions have been complicated too. Its difficult to coordinate a transaction between multiple components that are operating with JPA entity payloads. Finally the lack of support for JPA queries makes it difficult to load objects without working with raw SQL and the JDBC transport. Mule Support for JPA Entities The JPA module aims to simplify working with JPA managed entities with Mule. It provides message processors that map to an EntityManager’s methods. The message processors participate in Mule transactions, making it easy to structure JPA transactions within Mule flows. The JPA module also provides a @PersistenceContext implementation. This allows Mule components to participate in JPA transactions. Installing the JPA Module To install the JPA Module you need to click on “Help” followed by “Install New Software…” from Mule Studio. Select the “MuleStudio Cloud Connectors Update Site” from the “Work With” drop-down list then find the “Mule Java Persistence API Module Mule Extension.” This is illustrated below: Fetching JPA Entities JPA query language or criteria queries can be executed using the “query” MP. Supplying a statement to the query will execute the given query and return the results to the next message processor, as illustrated in the following Gist: The queryParameters-ref defines the parameters. In this case the message’s payload as the parameters to the query. The following query illustrates how a Map payload could be used to populate query parameters: The query processor also supports criteria queries by setting the queryParameters-ref to an instance of a CriteriaQuery, as illustrated in the functional test snippet below. CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder(); CriteriaQuery criteriaQuery = criteriaBuilder.createQuery(Dog.class); Root from = criteriaQuery.from(Dog.class); Predicate condition = criteriaBuilder.equal(from.get("name"), "Cujo"); criteriaQuery.where(condition); runFlowWithPayloadAndExpect("testQuery", expectedResults, criteriaQuery); You can use the ”find” MP to load a single object if you know its ID: Transactions and Entity Operations The default behavior of most JPA providers, like Hibernate, is to provide proxies on entity relationships to avoid loading full object graphs into memory. When these objects are detached from the JPA session, however, attempts to access relations in the object will often fail because the proxied session is no longer available. This complicates using JPA is Mule applications as JPA objects pass between message processors and inbetween flows and the session subsequently becomes unavailable. The JPA module allows you to avoid this by wrapping your operations in a transactional block. Let’s first look at how to persist an object then query it within a transaction. The below assumes the message’s payload is an instance of the Dog domain class. Now let’s see how we can use the merge processor to attach a JPA object to a new session. This can be useful when passing a JPA entity from one flow to another. ....other processing here.... Detaching an entity is just as simple: Component Operations with JPA The real power of using JPA with Mule is allowing your business services to participate in Mule managed JPA transactions. A @PersistenceContext EntityManager reference in your component class will cause Mule to inject a reference to a transactional flow’s current EntityManager for that method, as illustrated in the following class: public class DogServiceImpl { @PersistenceContext EntityManager entityManager; public Dog groom(Dog dog) { return entityManager.merge(dog); } } We can now wire the component up in a flow: Conclusion JPA is an important part of the JEE ecosystem and hopefully this module will simplify your use of JPA managed entities in Mule applications.

July 24, 2013

by John D'Emic

· 8,231 Views

Algorithm of the Week: Spatial Indexing with Quadtrees and Hilbert Curves

some time ago at oredev, after the sessions, there was "birds of a feather" - a sort of mini-unconference. anyone could write up a topic on the whiteboard; interested individuals added their names, and each group got allocated a room to chat about the topic. i joined the "spatial indexing" group, and we spent a fascinating hour and a half talking about spatial indexing methods, reminding me of several interesting algorithms and techniques. spatial indexing is increasingly important as more and more data and applications are geospatially-enabled. efficiently querying geospatial data, however, is a considerable challenge: because the data is two-dimensional (or sometimes, more), you can't use standard indexing techniques to query on position. spatial indexes solve this through a variety of techniques. in this post, we'll cover several - quadtrees , geohashes (not to be confused with geohashing ), and space-filling curves - and reveal how they're all interrelated. quadtrees quadtrees are a very straightforward spatial indexing technique. in a quadtree, each node represents a bounding box covering some part of the space being indexed, with the root node covering the entire area. each node is either a leaf node - in which case it contains one or more indexed points, and no children, or it is an internal node, in which case it has exactly four children, one for each quadrant obtained by dividing the area covered in half along both axes - hence the name. a representation of how a quadtree divides an indexed area. source: wikipedia inserting data into a quadtree is simple: starting at the root, determine which quadrant your point occupies. recurse to that node and repeat, until you find a leaf node. then, add your point to that node's list of points. if the list exceeds some pre-determined maximum number of elements, split the node, and move the points into the correct subnodes. a representation of how a quadtree is structured internally. to query a quadtree, starting at the root, examine each child node, and check if it intersects the area being queried for. if it does, recurse into that child node. whenever you encounter a leaf node, examine each entry to see if it intersects with the query area, and return it if it does. note that a quadtree is very regular - it is, in fact, a trie , since the values of the tree nodes do not depend on the data being inserted. a consequence of this is that we can uniquely number our nodes in a straightforward manner: simply number each quadrant in binary (00 for the top left, 10 for the top right, and so forth), and the number for a node is the concatenation of the quadrant numbers for each of its ancestors, starting at the root. using this system, the bottom right node in the sample image would be numbered 11 01. if we define a maximum depth for our tree, then, we can calculate a point's node number without reference to the tree - simply normalize the node's coordinates to an appropriate integer range (for example, 32 bits each), and then interleave the bits from the x and y coordinates -each pair of bits specifies a quadrant in the hypothetical quadtree. geohashes this system might seem familiar: it's a geohash ! at this point, you can actually throw out the quadtree itself - the node number, or geohash, contains all the information we need about its location in the tree. each leaf node in a full-height tree is a complete geohash, and each internal node is represented by the range from its smallest leaf node to its largest one. thus, you can efficiently locate all the points under any internal node by indexing on the geohash by performing a query for everything within the numeric range covered by the desired node. querying once we've thrown away the tree itself becomes a little more complex. instead of refining our search set recursively, we need to construct a search set ahead of time. first, find the smallest prefix (or quadtree node) that completely covers the query area. in the worst case, this may be substantially larger than the actual query area - for example, a small shape in the center of the indexed area that intersects all four quadrants would require selecting the root node for this step. the aim, now, is to construct a set of prefixes that completely covers the query region, while including as little area outside the region as possible. if we had no other constraints, we could simply select the set of leaf nodes that intersect the query area - but that would result in a lot of queries. another constraint, then, is that we want to minimise the number of distinct ranges we have to query for. one approach to doing this is to start by setting a maximum number of ranges we're willing to have. construct a set of ranges, initially populated with the prefix we identified earlier. pick the node in the set that can be subdivided without exceeding the maximum range count and will remove the most unwanted area from the query region. repeat this until there are no ranges in the set that can be further subdivided. finally, examine the resulting set, and join any adjacent ranges, if possible. the diagram below demonstrates how this works for a query on a circular area with a limit of 5 query ranges. how a query for a region is broken into a series of geohash prefixes/ranges. this approach works well, and it allows us to avoid the need to do recursive lookups - the set of range lookups we do execute can all be done in parallel. since each lookup can be expected to require a disk seek, parallelizing our queries allows us to substantially cut down the time required to return the results. still, we can do better. you may notice that all the areas we need to query in the above diagram are adjacent, yet we can only merge two of them (the two in the bottom right of the selected area) into a single range query, requiring us to do 4 separate queries. this is due in part to the order that our geohashing approach 'visits' subregions, working left to right, then top to bottom in each quad. the discontinuity as we go from top right to bottom left quad results in us having to split up some ranges that we could otherwise make contiguous. if we were to visit regions in a different order, perhaps we could minimise or eliminate these discontinuities, resulting in more areas that can be treated as adjacent and fetched with a single query. with an improvement in efficiency like that, we could do fewer queries for the same area covered, or conversely, the same number of queries, but including less extraneous area. illustrates the order in which the geohashing approach 'visits' each quad. hilbert curves suppose instead, we visit regions in a 'u' shape. within each quad, of course, we also visit subquads in the same 'u' shape, but aligned so as to match up with neighbouring quads. if we organise the orientation of these 'u's correctly, we can completely eliminate any discontinuities, and visit the entire area at whatever resolution we choose continuously, fully exploring each region before moving on to the next. not only does this eliminate discontinuities, but it also improves the overall locality. the pattern we get if we do this may look familiar - it's a hilbert curve. hilbert curves are part of a class of one-dimensional fractals known as space-filling curves , so named because they are one dimensional lines that nevertheless fill all available space in a fixed area. they're fairly well known, in part thanks to xkcd's use of them for a map of the internet . as you can see, they're also of use for spatial indexing, since they exhibit exactly the locality and continuity required. for example, if we take another look at the example we used for finding the set of queries required to encompass a circle above, we find that we can reduce the number of queries by one: the small region in the lower left is now contiguous with the region to its right, and whilst the two regions at the bottom are no longer contiguous with each other, the rightmost one is now contiguous with the large area in the upper right. illustrates the order in which a hilbert curve 'visits' each quad. one thing that our elegant new system is lacking, so far, is a way of converting between a pair of (x,y) coordinates and the corresponding position in the hilbert curve. with geohashing it was easy and obvious - just interleave the x and y coordinates - but there's no obvious way to modify that for a hilbert curve. searching the internet, you're likely to come across many descriptions of how hilbert curves are drawn, but few if any descriptions of how to find the position of an arbitrary point. to figure this out, we need to take a closer look at how the hilbert cure can be recursively constructed. the first thing to observe is that although most references to hilbert curves focus on how to draw the curve, this is a distraction from the essential property of the curve, and its importance to us: it's an ordering for points on a plane. if we express a hilbert curve in terms of this ordering, drawing the curve itself becomes trivial - simply a matter of connecting the dots. forget about how to connect adjacent sub-curves, and instead focus on how we can recursively enumerate the points. hilbert curves are all about ordering a set of points on a 2d plane at the root level, enumerating the points is simple: pick a direction and a start point, and proceed around the four quadrants, numbering them 0 to 3. the difficulty is introduced when we want to determine the order we visit the sub-quadrants in while maintaining the overall adjacency property. examination reveals that each of the sub-quadrants' curves is a simple transformation of the original curve: there are only four possible transformations. naturally, this applies recursively to sub-sub quadrants, and so forth. the curve we use for a given quadrant is determined by the curve we used for the square it's in, and the quadrant's position. with a little work, we can construct a table that encapsulates this: suppose we want to use this table to determine the position of a point on a third-level hilbert curve. for the sake of this example, assume our point has coordinates (5,2) starting with the first square on the diagram, find the quadrant your point is in - in this case, it's the upper right quadrant. the first part of our hilbert curve position, then, is 3 (11 in binary). next, we consult the square shown in the inset of square 3 - in this case, it's the second square. repeat the process: which sub-quadrant does our point fall into? here, it's the lower left one, meaning the next part of our position is 1, and the square we should consult next is the second one again. repeating the process one final time, we find our point falls in the upper right sub-sub-quadrant, our final coordinate is 3 (11 in binary). stringing them together, we now know the position of our point on the curve is 110111 binary, or 55. let's be a little more methodical, and write methods to convert between x,y coordinates and hilbert curve positions. first, we need to express our diagram above in terms a computer can understand: hilbert_map = { 'a': {(0, 0): (0, 'd'), (0, 1): (1, 'a'), (1, 0): (3, 'b'), (1, 1): (2, 'a')}, 'b': {(0, 0): (2, 'b'), (0, 1): (1, 'b'), (1, 0): (3, 'a'), (1, 1): (0, 'c')}, 'c': {(0, 0): (2, 'c'), (0, 1): (3, 'd'), (1, 0): (1, 'c'), (1, 1): (0, 'b')}, 'd': {(0, 0): (0, 'a'), (0, 1): (3, 'c'), (1, 0): (1, 'd'), (1, 1): (2, 'd')}, } in the snippet above, each element of 'hilbert_map' corresponds to one of the four squares in the diagram above. to make things easier to follow, i've identified each one with a letter - 'a' is the first square, 'b' the second, and so forth. the value for each square is a dict, mapping x and y coordinates for the (sub-)quadrant to the position along the line (the first part of the value tuple) and the square to use next (the second part of the value tuple). here's how we can use this to translate x and y coordinates into a hilbert curve position: def point_to_hilbert(x, y, order=16): current_square = 'a' position = 0 for i in range(order - 1, -1, -1): position <<= 2 quad_x = 1 if x & (1 << i) else 0 quad_y = 1 if y & (1 << i) else 0 quad_position, current_square = hilbert_map[current_square][(quad_x, quad_y)] position |= quad_position return position the input to this function is the integer x and y coordinates, and the order of the curve. an order 1 curve fills a 2x2 grid, an order 2 curve fills a 4x4 grid, and so forth. our x and y coordinates, then, should be normalized to a range of 0 to 2order-1. the function works by stepping over each bit of the x and y coordinates, starting with the most significant. for each, it determines which (sub-)quadrant the coordinate lies in, by testing the corresponding bit, then fetches the position along the line and the next square to use from the table we defined earlier. the curve position is set as the least significant 2 bits on the position variable, and at the beginning of the next loop, it's left-shifted to make room for the next set of coordinates. let's check that we've written the function correctly by running our example from above through it: >>> point_to_hilbert(5,2,3)55 presto! for a further test, we can use the function to generate a complete list of ordered points for a hilbert curve, then use a spreadsheet to graph them and see if we get a hilbert curve. enter the following expression into an interactive python interpreter: >>> points =[(x, y)for x in range(8)for y in range(8)]>>> sorted_points = sorted(points, key=lambda k: point_to_hilbert(k[0], k[1],3))>>>print'\n'.join('%s,%s'% x for x in sorted_points) take the resulting text, paste it into a file called 'hilbert.csv', open it in your favorite spreadsheet, and instruct it to generate a scatter plot. the result is, of course, a nicely plotted hilbert curve! the inverse of point_to_hilbert is a straightforward reversal of the hilbert_map; implementing it is left as an exercise for the reader. conclusion there you have it - spatial indexing, from quadtrees to geohashes to hilbert curves. one final observation: if you express the ordered sequence of x,y coordinates required to draw a hilbert curve in binary, do you notice anything interesting about the ordering? does it remind you of anything? just to wrap up, a caveat: all of the indexing methods i've described today are only well-suited to indexing points. if you want to index lines, polylines, or polygons, you're probably out of luck with these methods - and so far as i'm aware, the only known algorithm for effectively indexing shapes is the r-tree , an entirely different and more complex beast.

July 23, 2013

by Nick Johnson

· 43,826 Views

Log4j 2: Performance close to insane

Recently a respected member of the Apache community tried Log4j 2 and wrote on Twitter: (Quote from Mark Struberg: @TheASF #log4j2 rocks big times! Performance is close to insane ^^ http://logging.apache.org/log4j/2.x/ ) It happened shortly after Remko Popma contributed something which is now called the “AsyncLoggers”. Some of you might know Log4j 2 has AsyncAppenders already. They are similar like the ones you can find in Log4j 1 and other logging frameworks. I am honest: I wasn’t so excited about the new feature until I read the tweet on its performance and became curious. Clearly Java logging has many goals. Among them: logging must be as fast as hell. Nobody wants his logging framework to become a bottleneck. Of course you’ll always have a cost when logging. There is some operation the CPU must perform. Something is happening, even when you decide NOT to write a log statement. Logging is expected to be invisible. Until now, the well-known logging frameworks were similar in speed. Benchmarks are unreliable after all. We have made some benchmarks over at Apache Logging. Sometimes one logging frameworks wins, sometimes the other. But at the end of the day you can say they are all very good and you can choose whatever your liking is. Until we got Remko’s contribution and Log4j 2 became “insanely fast”. Small software projects running one thread might not care about performance so much. When running a SaaS you simply don’t know when your app gets so much attraction that you need to scale. Then you suddenly need some extra power. With Log4j 2, running 64 threads might bring you twelve times more logging throughput than with comparable frameworks. We speak of more than 18,000,000 messages per second, while others do around 1,500,000 or less in the same environment. I saw the chart, but simply couldn’t believe it. There must be something wrong. I rechecked. I ran the tests myself. It’s like that: Log4j 2 is insanely fast. Async Performance, last read on July 19, 2013 As of now, we have a logging framework which performs lots better than every other logging framework out there. As of now we need to justify our decision when we do not want to use Log4j 2, if speed matters.Everything else than Log4j 2 can become a bottleneck and a risk. With such a fast logging framework you might even consider to log a bit more in production than you did before. Eventually I wrote Remko an e-mail and asked him what exactly the difference between the old AsyncAppenders and the new Asynchronous Loggers is. The difference between old AsynAppenders and new AsyncLoggers “The Asynchronous Loggers do two things differently than the AsyncAppender”, he told me, “they try to do the minimum amount of work before handing off the log message to another thread, and they use a different mechanism to pass information between the producer and consumer threads. AsyncAppender uses an ArrayBlockingQueue to hand off the messages to the thread that writes to disk, and Asynchronous Loggers use the LMAX Disruptor library. Especially the Disruptor has made a large performance difference.” In other terms, the AsyncAppender use a first-in-first-out Queue to work through messages. But the Async Logger uses something new – the Disruptor. To be honest, I had never heard of it. And furthermore, I never thought much about scaling my logging framework. When somebody said “scale the system”, I thought about the database, the app server and much more, but usually not logging. In production, logging was off. End of story. But Remko thinks about scaling when it comes to logging. “Looking at the performance test results for the Asynchronous Loggers, the first thing you notice is that some ways of logging scale much better than others. By scaling better I mean that you get more throughput when you add more threads. If your throughput increases a constant amount with every thread you add, you have linear scalability. This is very desirable but can be difficult to achieve.”, he wrote me. “Comparing synchronous to asynchronous, you would expect any asynchronous mechanism to scale much better than synchronous logging because you don’t do the I/O in the producing thread any more, and we all know that ‘I/O is slow’ (and I’ll get back to this in a bit)”. Yes, exactly my understanding. I thought it would be enough to send something to a queue, and something else would pick it up and write the message. The app would go on. This is exactly what the old AsyncAppender does, wrote Remko: “With AsyncAppender, all your application thread needs to do is create a LogEvent object and put it on the ArrayBlockingQueue; the consuming thread will then take these events off the queue and do all the time-consuming work. That is, the work of turning the event into bytes and writing these bytes to the I/O device. Since the application threads do not need to do the I/O, you would expect this to scale better, meaning adding threads will allow you to log more events.” If you believed that like me, take a seat and a deep breath. We were wrong. “What may surprise you is that this is not the case.”, he wrote. “If you look at the performance numbers for the AsyncAppenders of all logging frameworks, you’ll see that every time you double the number of threads, your throughput per thread roughly halves.” “So your total throughput remains more or less flat! AsyncAppenders are faster than synchronous logging, but they are similar in the sense that neither of them gives you more total throughput when you add more threads.”, he told me. It hit me like a hammer. Basically instead of making your logging faster with adding more threads you made basically: nothing. After all Appenders didn’t scale until now. I asked Remko why this was the case. “It turns out that queues are not the most optimal data structure to pass information between threads. The concurrent queues that are part of the standard Java libraries use locks to make sure that values don’t get corrupted and to ensure data visibility between threads.”. LMAX Disruptor? “The LMAX team did a lot of research on this and found that these queues have a lot of lock contention. An interesting thing they found is that queues are always either full or empty: If your producer is faster, your queue will be full most of the time (and that may be a problem in itself ). If your consumer is fast enough, your queue will be empty most of the time. Either way, you will have contention on the head or on the tail of the queue, where both the producer and the consumer thread want to update the same field. To resolve this, the LMAX team came up with the Disruptor library, which is a lock-free data structure for passing messages between threads. Here is a performance comparison between the Disruptor and ArrayBlockingQueue:Performance Comparison.” Wow. After all these years of Java programming I actually felt a bit like a Junior programmer again. I missed the LMAX disruptor and even never considered it a performance problem to use the Queue. I wonder what other performance problems I did not discover so far. I realized, I had to re-learn Java. I asked Remko how he could find a library like the LMAX disruptor. I mean nobody writes software, creates an instance of a Queue-class, doubts its performance and finally searches the internet for “something better”. Or are there really people of that kind? “How I found about the Disruptor? The short answer is, it was all a mistake.”, he started. “Okay, perhaps that was a bit too short, so here is the longer answer: a colleague of mine wrote a small logger, essentially adding a time-stamped log message to a queue, with a background thread that took these strings off the queue and wrote them to disk. He did this because he needed better performance than what he could get with log4j-1.x. I did some testing and found it was faster, I don’t remember exactly by how much. I was quite surprised because I had been using log4j for years and had never thought it would be easily outperformed. Until then I had assumed that the well-known libraries would be fast, because, well… To be honest, I had just assumed. So this was a bit of an eye-opener for me. However, the custom logger was a bit bare-bones in terms of functionality so I started to look around for alternatives.” “Before I start talking about the Disruptor, I have to confess something. I recently went back to see how much faster the custom logger was than log4j-1.x, but when I measured it it was actually slower! It turned out that I had been comparing the custom logger to an old beta of log4j-2.0, I think beta3 or beta4. AsyncAppender in those betas still had a performance issue (LOG4J2-153 if you’re curious). If I had compared the custom logger to the AsyncAppender in log4j-1.x, I would have found that log4j-1.x was faster and I would not have thought about it further. But because I made this mistake I started to look for other high-performance logging libraries that were richer in functionality. I did not find such a logging library, but I ran into a whole bunch of other interesting stuff, including the Disruptor. Eventually I decided to try to combine Log4j-2, which has a very nice code base, with the Disruptor. The result of this was eventually accepted into Log4j-2 itself, and the rest, as they say, was history.” “One thing I came across that I should mention here is Peter Lawrey’sChronicle library. Chronicle uses memory-mapped files to write tens of millions of messages per second to disk with very low latency. Remember that above I said that “we all know that I/O is slow”? Chronicle shows that synchronous I/O can be very, very fast.“. “It was via Peter’s work that I came across the Disruptor. There is a lot of good material out there about the Disruptor. Just to give you a few pointers: Martin Fowler: LMAX Trisha Lee on LMAX under the hood (slightly outdated now but the most detailed material I know of) …and video presentations like this The Disruptor google group is also highly recommended. Recommended readings on Java performance in general are: Martin Thompson’s “Mechanical Sympathy” Martin Thompson Presentations. Martin Thompson has done a number of articles and presentations on various aspects of high performance computing in java. He does a great job of making the complex stuff that is going on under the hood accessible.” My bookmarks folder went full after reading this e-mail, and I appreciate the lots of starting points for improving my knowledge on Java performance. Should I use AsyncLoggers by default? I was sure I want to use the new Async Loggers. This all sounds just fantastic. But on the other hand, I am a bit scared and even a little paranoid to include new dependencies or new technologies like the new Log4j 2 Async Loggers. I asked Remko if he would use the new feature by default or if he would enable them just for a few, limited use cases. “I use Async Loggers by default, yes.”, he wrote me. “One use case when you would _not_ want to use asynchronous logging is when you use logging for audit purposes. In that case a logging error is a problem that your application needs to know about and deal with. I believe that most applications are different, in that they don’t care too much about logging errors. Most applications don’t want to stop if a logging exception occurs, in fact, they don’t even want to know about it. By default, appenders in Log4j-2.0 will suppress exceptions so the application doesn’t need to try/catch every log statement. If that is your usage, then you will not lose anything by using asynchronous loggers, so you get only the benefits, which is improved performance.” “One nice little detail I should mention is that both Async Loggers and Async Appenders fix something that has always bothered me in Log4j-1.x, which is that they will flush the buffer after logging the last event in the queue. With Log4j-1.x, if you used buffered I/O, you often could not see the last few log events, as they were still stuck in the memory buffer. Your only option was setting immediateFlush to true, which forces disk I/O on every single log event and has a performance impact. With Async Loggers and Appenders in Log4j-2.0 your log statements are all flushed to disk, so they are always visible, but this happens in a very efficient manner.” Isn’t it risky to log to use Log4js AsyncLoggers? But considering that Log4j-1 had serious threading issues and the modern world uses cloud computing and clustering all the time to scale their apps,isn’t asynchronous logging some kind of additional risk? Or is it safe? I knew my questions would sound like the questions of a decision maker, not of an developer. But the whole LMAX thing was so new to me and since I maintain the old and really ugly Log4j 1 code, I simply had to ask. Remko: “There are a number of questions in there. First, is Log4j-2 safer from a concurrency perspective than Log4j-1.x? I believe so. The Log4j-2 team has put in considerable effort to support multi-threaded applications, and the asynchronous loggers are just a very recent and relatively small addition to the project. Log4j-2 uses more granular locking than log4j-1.x, and is architecturally simpler, which should result in fewer issues, and any issues that do come up will be easier to fix.” “On the other hand, Log4j-2 is still in beta and is under active development, although recently I think most effort is being spent on fixing things and tying up loose ends rather than adding new features. I believe it is stable enough for production use. If you are considering using Log4j-2, for performance or other reasons, I’d suggest you do your due diligence and test, just like you would before adopting any other 3rd party library in your project.” (Sidenote: A stable version of Log4j2 can be expected soon, most likely autumn 2013). Sounded good to me. And yes, I can perfectly agree with that from my own observations on the project, though I personally did not write code in the Log4j 2 repository. “The other question I see is: Is asynchronous logging riskier than synchronous logging? I don’t think so, in fact, if your application is multi threaded the opposite may be the case: once the log event has been handed off to the consumer thread that does the I/O, there is only that one thread dealing with the layouts, appenders and all the other logging-related components. So after the hand-off you’re single-threaded and you don’t need to worry about any threading issues like deadlock and liveliness etc any more.” “You can take this one step further and make your business logic completely single-threaded, using the disruptor for all I/O or communication with external systems. Single-threaded business logic without lock contention can be blazingly fast. The results at LMAX (6 million transactions/sec, with less than 10 ms latency) speak for themselves.” Reading Remko’s message I learned three things. First, I had to learn more about Java performance. Second, I definitely want to make my applications use Log4j 2. As first step, I will enable it in my Struts 2 apps, which I use often. Third, a web application framework using the LMAX Disruptor might blow us all away. I would like to give a big thank you and a hug to Remko Popma for answering my questions and working on this blog post with me. All errors are my own.

July 20, 2013

by Christian Grobmeier

· 7,502 Views · 1 Like

Playing with NHibernate - Inverse and Cascade Mapping Attributes

I have to admit that NHibernate provides a really flexible way of handling class inheritance and parent-child relationship.

July 12, 2013

by Mariano Vazquez

· 36,442 Views

JAX RS: Streaming a Response using StreamingOutput

A couple of weeks ago Jim and I were building out a neo4j unmanaged extension from which we wanted to return the results of a traversal which had a lot of paths. Our code initially looked a bit like this: package com.markandjim @Path("/subgraph") public class ExtractSubGraphResource { private final GraphDatabaseService database; public ExtractSubGraphResource(@Context GraphDatabaseService database) { this.database = database; } @GET @Produces(MediaType.TEXT_PLAIN) @Path("/{nodeId}/{depth}") public Response hello(@PathParam("nodeId") long nodeId, @PathParam("depth") int depth) { Node node = database.getNodeById(nodeId); final Traverser paths = Traversal.description() .depthFirst() .relationships(DynamicRelationshipType.withName("whatever")) .evaluator( Evaluators.toDepth(depth) ) .traverse(node); StringBuilder allThePaths = new StringBuilder(); for (org.neo4j.graphdb.Path path : paths) { allThePaths.append(path.toString() + "\n"); } return Response.ok(allThePaths.toString()).build(); } } We then compiled that into a JAR, placed it in ‘plugins’ and added the following line to ‘conf/neo4j-server.properties’: org.neo4j.server.thirdparty_jaxrs_classes=com.markandjim=/unmanaged After we’d restarted the neo4j server we were able to call this end point using cURL like so: $ curl -v http://localhost:7474/unmanaged/subgraph/1000/10 This approach works quite well but Jim pointed out that it was quite inefficient to load all those paths up into memory so we thought it would be quite cool if we could stream it as we got to each path. Traverser wraps an iterator so we are lazily evaluating the result set in any case. After a bit of searching we came StreamingOutput which is exactly what we need. We adapted our code to use that instead: package com.markandjim @Path("/subgraph") public class ExtractSubGraphResource { private final GraphDatabaseService database; public ExtractSubGraphResource(@Context GraphDatabaseService database) { this.database = database; } @GET @Produces(MediaType.TEXT_PLAIN) @Path("/{nodeId}/{depth}") public Response hello(@PathParam("nodeId") long nodeId, @PathParam("depth") int depth) { Node node = database.getNodeById(nodeId); final Traverser paths = Traversal.description() .depthFirst() .relationships(DynamicRelationshipType.withName("whatever")) .evaluator( Evaluators.toDepth(depth) ) .traverse(node); StreamingOutput stream = new StreamingOutput() { @Override public void write(OutputStream os) throws IOException, WebApplicationException { Writer writer = new BufferedWriter(new OutputStreamWriter(os)); for (org.neo4j.graphdb.Path path : paths) { writer.write(path.toString() + "\n"); } writer.flush(); } }; return Response.ok(stream).build(); } As far as I can tell the only discernible difference between the two approaches is that you get an almost immediate response from the streamed approached whereas the first approach has to put everything in the StringBuilder first. Both approaches make use of chunked transfer encoding which according to tcpdump seems to have a maximum packet size of 16332 bytes: 00:10:27.361521 IP localhost.7474 > localhost.55473: Flags [.], seq 6098196:6114528, ack 179, win 9175, options [nop,nop,TS val 784819663 ecr 784819662], length 16332 00:10:27.362278 IP localhost.7474 > localhost.55473: Flags [.], seq 6147374:6163706, ack 179, win 9175, options [nop,nop,TS val 784819663 ecr 784819663], length 16332

July 10, 2013

by Mark Needham

· 114,329 Views · 4 Likes

Changing MySQL Binary Log Files Location to Another Directory

What is the Default? Usually in most installations, binary log files are located in the MySQL default directory (/var/lib/mysql) just next to the data files. Why Should I Move the Binary Logs to Another Directory? Each data modification (INSERT, UPDATE, DELETE...) and data definition (ALTER, ADD, DROP...) statement that you perform in your server are recorded in the Log files. Therefore, each time you make any of these statements, you actually update both your data files and your log files. The result is high IO utilization that is focused on a specific disk area. A common recommendation in the database field is to separate these files to two different disks in order to get a better performance. How to Perform it? Change the log-bin variable in the my.cnf to log-bin=/path/to/new/directory/mysql-bin Purge as many files as you can (PURGE BINLOG...) in order to minimize the number of moved files (see stop 4). Stop the master (service mysql stop). Move the files to the new directory: mv /var/lib/mysql/mysql-bin.* /path/to/new/directory Start the master again (service mysql start). Bottom Line Few steps and your server is ready for more traffic and data. Keep Performing, Moshe Kaplan

June 30, 2013

by Moshe Kaplan

· 49,295 Views

Linksheet: Apache Cassandra

Curator's note: This collection of resources collected by Tim Spann will help you get started with Apache Cassandra, the open source distributed database system originally developed at Facebook. http://hector-client.github.io/hector/build/html/index.html https://www.openshift.com/blogs/cassandra-on-openshift https://github.com/shekhargulati/cassandra-openshift-quickstart-java http://whyjava.wordpress.com/2012/02/23/logging-jaxws-soap-request-and-response-using-a-java-property/ https://dzone.com/articles/creating-your-first-java https://github.com/boneill42/naughty-or-nice http://www.slideshare.net/DataStax/college-credit-creating-your-first-app-in-java-with-cassandra

June 30, 2013

by Tim Spann

CORE

· 5,289 Views · 1 Like

Integration of Amazon Redshift Data Warehouse with Talend Data Integration

In this blog post, I will show you how to "ETL" all kinds of data to Amazon’s cloud data warehouse Redshift wit Talend’s big data components. Let’s begin with a short introduction to Amazon Redshift (copied from website): "Amazon Redshift is [part of Amazon Web Services (AWS) and] a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. With a few clicks in the AWS Management Console, customers can launch a Redshift cluster, starting with a few hundred gigabytes and scaling to a petabyte or more, for under $1,000 per terabyte per year. Traditional data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.“ Sounds interesting! And indeed, we already see companies using Talend’s Redshift connectors. From Talend perspective it is not much more than just another database. If you have ever used a Talend connector, you can integrate to Redshift within some minutes. In the next sections, I will describe all necessary steps and give some hints regarding configuration issues and performance improvements. Be aware: You need Talend Open Studio for Data Integration (Apache License, open source) or any Talend Enterprise Edition / Platform which contains the Cloud components to see and use Amazon Redshift connectors. The open source edition offers all connectors and functionality to integrate with Amazon Redshift. However, Enterprise versions offer some more features (e.g. versioning), comfort (e.g. wizards) and commercial support. Setup Amazon Redshift Setup of Amazon Redshift is very easy. Just follow Amazon‘s getting started guide: http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html. Like every other AWS guide, it is very easy to understand and use. Be aware, that you just have to do step 1, 2 and 3 of the getting started guide for using it with Talend. Some hints: - Step 1 („before you begin“): Just sign up. Client tools and drivers are not necessary because they are already installed within Talend Studio. - Step 2 („launch a cluster“): Yes, please start your cluster! - Step 3(„authorize access“): If you are not sure what to do here, select Connection Type = CIDR/IP. Find out your IP address (http://whatismyipaddress.com) and enter it with „/32“ at the end. Example: „192.168.1.1/32“ Now you can connect to Amazon Redshift from your Talend Studio on your local computer. Step 4 (connect) and step 5 (create table, data, queries) are not necessary, this will be done from Talend Studio. Of course, you should not forget to delete your cluster (step 7) when you are done. Otherwise, you will pay for every hour, even if you do not access your DWH. Connect to Amazon Redshift from Talend Studio Create a new connection to Amazon Redshift database as you do with every other relational database. The easiest way is to use „DB Connection Wizard“ in metadata. Just enter your connection information and check if it works. You get all information about configuration from Amazon Web Console. The connection string looks something like this: „jdbc:paraccel://talend-demo-cluster.cp8t6c5.eu-west-1.redshift.amazonaws.com:5439/dev“ Next, right click on the created connection and select „retrieve schema“. „public“ is the default schema which you (have to) use. Now, you are ready to use this connection within Talend Jobs to write to Amazon Redshift and read from it. Create Talend Jobs (Write, Read, Delete) Amazon Redshift components work like any other Talend (relational) database components. Look at www.help.talend.com for more information if you have not used them before (or just try them out, they are very self-explanatory). You just have to drag&drop your connection from metadata . Afterwards, you can easily write data (tRedShiftOutput), read data (tRedshiftInput), or do any other queries such as delete or copy (tRedShiftRow). In the following job, I start with deleting all content in the Amazon Redshift table. Then, I read data from a MySQL table and insert it into an Amazon Redshift table. The table is created automatically (as I have configured it this way). After this subjob is finished, I read the data again, and store it to a CSV file (which is also created automatically). Of course, this is no business use case, but it shows how to use different Amazon Redshift components. Query Data from Amazon Redshift You can connect to Amazon Redshift directly from Talend Studio to explore and query data of the DWH. Thus, no other database tool is required. Just right click on your Amazon Redshift connection in metadata and select „edit queries“. Here you can define, execute and save SQL queries. Improve Performance Write performance of Amazon Redshift is relatively low compared to „classical“ relational databases (in your data center) as you have to upload all data into the cloud. Different alternatives exist to improve performance: - Bulk inserts: „Extended insert“ (in advanced settings) improves performance a lot, but still not to hyperspeed… Also, as it is bulk, you can just do inserts! It is not compatible to „rejects“ or „updates“ - AWS S3 and COPY command: S3 is Amazon’s „simple storage service“, a key-value store – also called NoSQL today – for storing very large objects. You can use Amazon Redshift’s COPY command (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) to transfer data from S3 to Amazon Redshift with good performance. Though, you still have to copy data to S3 before, same „cloud problem“ here. The COPY command can be used with tRedshiftRow, so no problem at all from Talend perspective. To transfer data to S3, you can either use the Talend S3 components from Talendforge, Talend’s open source community (http://www.talendforge.org/exchange), or use camel-s3, an Apache Camel component which is included in Talend ESB. The latter is an option, if you use Talend Data Services which combines Talend DI and Talend ESB in its unified platform. Summary You need not be a cloud or DWH expert, or an expert developer to integrate with Amazon’s cloud data warehouse Redshift. It is very easy with Talend’s integration solutions. Just drag&drop, configure, do some graphical mappings / transformations (if necessary), that’s it. Code is generated. Job runs. You can integrate Amazon Redshift almost as simple as any other relational database. Just be aware of some cloud specific security and performance issues. With Talend, you can easily „ETL“ all data from different sources to Redshift and store it there for under $1,000 per terabyte per year – even with the open source version! Best regards, Kai Wähner (Contact and feedback via @KaiWaehner, www.kai-waehner.de, LinkedIn / Xing) This is content from my blog: http://www.kai-waehner.de/blog/2013/06/26/integration-of-amazon-redshift-cloud-data-warehouse-aws-saas-dwh-with-talend-data-integration-di-big-data-bd-enterprise-service-bus-esb/

June 27, 2013

by Kai Wähner

CORE

· 20,579 Views · 1 Like

3 Ways to Optimize for Paging in MySQL

Lots and lots of web applications need to page through information. From customer records, to the albums in your itunes collection. So as web developers and architects, it’s important that we do all this efficiently. Start by looking at how you’re fetching information from your MySQL database. We’ve outlined three ways to do just that. 1. Paging without discarding records Ultimately we’re trying to avoid discarding records. After all if the server doesn’t fetch them, we save big. How else can we avoid this extra work. How about remember the last name. For example: select id, name, address, phone FROM customers WHERE id > 990 ORDER BY id LIMIT 10; Of course such a solution would only work if you were paging by ID. If you page by name, it might get messier as there may be more than one person with the same name. If ID doesn’t work for your application, perhaps returning paged users by USERNAME might work. Those would be unique: SELECT id, username FROM customers WHERE username > '[email protected]' ORDER BY username LIMIT 10; Paging queries can be slow with SQL as they often involve the OFFSET keyword which instructs the server you only want a subset. However it typically scans collects and then discards those rows first. With deferred join or by maintaining a place or position column you can avoid this, and speedup your database dramatically. 2. Try using a Deferred Join This is an interesting trick. Suppose you have pages of customers. Each page displays ten customers. The query will use LIMIT to get ten records, and OFFSET to skip all the previous page results. When you get to the 100th page, it’s doing LIMIT 10 OFFSET 990. So the server has to go and read all those records, then discard them. SELECT id, name, address, phone FROM customers ORDER BY name LIMIT 10 OFFSET 990; MySQL is first scanning an index then retrieving rows in the table by primary key id. So it’s doing double lookups and so forth. Turns out you can make this faster with a tricky thing called a deferred join. The inside piece just uses the primary key. An explain plan shows us “using index” which we love! SELECT id FROM customers ORDER BY name LIMIT 10 OFFSET 990; Now combine this using an INNER JOIN to get the ten rows and data you want: SELECT id, name, address, phone FROM customers INNER JOIN ( SELECT id FROM customers ORDER BY name LIMIT 10 OFFSET 990) AS my_results USING(id); That’s pretty cool! 3. Maintain a Page or Place column Another way to trick the optimizer from retrieving rows it doesn’t need is to maintain a column for the page, place or position. Yes you need to update that column whenever you (a) INSERT a row (b) DELETE a row ( c) move a row with UPDATE. This could get messy with page, but a straight place or position might work easier. SELECT id, name, address, phone FROM customers WHERE page = 100 ORDER BY name; Or with place column something like this: SELECT id, name, address, phone FROM customers WHERE place BETWEEN 990 AND 999 ORDER BY name;

June 25, 2013

by Sean Hull

· 21,294 Views

Implementing Memcached a Servlet Filter for Spring MVC-Based RESTful Services

I have a number of Spring MVC based RESTful services that return JSON. In 90% of the cases, the state of objects these services return will not change within a 24 hour period. This makes them (the JSON objects) perfect candidates for simple caching enabled by memcached. The idea was to have every request to Spring controllers intercepted, cache key generated and checked against the cache. If the key and corresponding value (JSON string) is available (a cache hit), it is returned to the caller as-is without making a full round trip to the database. However, if the cache has no entry for the key and hence no corresponding value (a cache miss), the call is forwarded to the controller, which in turn calls the logic to fetch desired object from the database and not only return it to the caller but also update the cache with the returned content. Keys are generated using the URL of the service in case of GET requests and the URL concatenated with POSTed input (as JSON) in case of POST requests. The resultant strings are encoded with MD5 to come up with a 32 character cache key which is well within the 250 character key length limit of memcached. Performance impact of using MD5 is yet to be evaluated during our load testing cycle. I started off trying to get hold of JSON response in the postHandle method of a Spring HandlerInterceptor. However since we are using @ResponseBody annotation in our controller, the JSON would be written directly to the stream. The ModelAndView was of course null because of this reason. If we removed the annotation and returned ModelAndView from the controller, the intended JSON object got enclosed in a map wrapper. A quick question on stack overflow didn’t help as the only suggestion I got was to extract my original object from the map wrapper. I wanted to keep this option (as discussed here as well ) as my last resort. The solution I eventually came up with involved Replacing the HandlerInterceptor with Servlet Filters Using DelegatingFilterProxy to make my filters spring application context aware Using HttpServletRequestWrapper to get control of the POST request body in the filter on the way in Using HttpServletResponseWrapper to get control of the response content in the filter on the way out True, its probably a more complex solution than just overriding MappingJacksonJsonView and extracting my JSON object, but it is more generic as it does not assume that all my content will always be JSON. Lets first start with the filter definition in the web.xml cacheFilter org.springframework.web.filter.DelegatingFilterProxy ... cacheFilter /* A standard filter configuration except for the fact that the filter class is always going to be org.springframework.web.filter.DelegatingFilterProxy. Where do you specify your own class ? As a bean in your spring context xml. The name of the filter and the name of the bean must be the same for the delegation to happen. Using the DelegatingFilterProxy allowed me to use my Filters with Spring. I can inject my dependencies as I would normally. Next, lets look at my MemcacheFilter filter Memcache Filter Class public class MemcacheFilter implements Filter { private static Logger logger = Logger.getLogger(MemcacheFilter.class); private CacheConfig cacheConfig; /** * Memcached lookup is being performed in this method. Firstly, keys are * generated depending on the request method (GET/POST). Then a cache lookup * is performed. If a value is obtained, the value is written to the * response otherwise, the actual target (in this case, Spring's Dispatcher * Servlet) is called by calling doFilter on the filteChain. The dispatcher * servlet calls the controller to produce the desire response which is * intercepted when the doFilter method returns. The Response is added to * the cache if the reponse code was 200(OK). * * @param request * @param response * @param filterChain * @throws IOException * @throws ServletException */ public void doFilter(ServletRequest request, ServletResponse response, FilterChain filterChain) throws IOException, ServletException { try { if ((request instanceof HttpServletRequest) && (response instanceof HttpServletResponse)) { // Wrapping the response in HTTPServletResponseWrapper MemcacheResponseWrapper responseWrap = new MemcacheResponseWrapper((HttpServletResponse) response); // Wrapping the request in HTTPServletResponseWrapper MemcacheRequestWrapper requestWrap = new MemcacheRequestWrapper((HttpServletRequest) request); // Get Memcached Client Instance MemcachedClient client = cacheConfig.getMemcachedClient(); Key keyGenerator = getKeyGenerator(requestWrap); if (keyGenerator != null) { String key = keyGenerator.getKey(requestWrap, cacheConfig); String value = (String) client.get(key); if (value == null) { // cache miss logger.info("Cache miss for key " + key); // call next filter/actual target for value filterChain.doFilter(requestWrap, responseWrap); if (responseWrap.getStatus() == HttpServletResponse.SC_OK) { // obtaining response content from // HttpServletResponseWrapper value = responseWrap.getOutputStream().toString(); // adding response to cache client.add(key, 0, value); logger.info("Adding response to cache: "+ (value.length() > 50 ? value.substring(0,50) + "..." : value)); } else { logger.warn("Did not add content to cache as response status is not 200"); } } else { // This case is a cache hit logger.info("Cache hit for key " + key); response.getWriter().println(value); } } else { logger.warn("Request skipped because no key generator could be found for the request's method"); // attempting call to actual target filterChain.doFilter(request, response); } } } catch (Exception ex) { logger.info("Cache functionality skipped due to exception", ex); // attempting call to actual target filterChain.doFilter(request, response); } } /** * Factory method that returns KeyGenerator based on the request method. * * @param httpRequest * @return */ private Key getKeyGenerator(HttpServletRequest httpRequest) { Key keyGenerator = null; if (httpRequest.getMethod().equalsIgnoreCase("GET")) { keyGenerator = new GetRequestKey(); } else if (httpRequest.getMethod().equalsIgnoreCase("POST")) { keyGenerator = new PostRequestKey(); } return keyGenerator; } public void init(FilterConfig arg0) throws ServletException { logger.debug("init"); } public CacheConfig getCacheConfig() { return cacheConfig; } public void setCacheConfig(CacheConfig cacheConfig) { this.cacheConfig = cacheConfig; } public void destroy() { logger.debug("destroy"); } } 1. I first wrap my request and response objects in the following statements. I have had to create the wrappers as well. Will get to those later. // Wrapping the response in HTTPServletResponseWrapper MemcacheResponseWrapper responseWrap = new MemcacheResponseWrapper((HttpServletResponse) response); // Wrapping the request in HTTPServletResponseWrapper MemcacheRequestWrapper requestWrap = new MemcacheRequestWrapper((HttpServletRequest) request); 2. Next, I have one of my injected classes, CacheConfig, provide me with a memcache client which I will use later to look up the cache. // Get Memcached Client Instance MemcachedClient client = cacheConfig.getMemcachedClient(); 3. I make a call to a function that tells me which key generator I should use, a GET one or a POST one depending on the request method. Key keyGenerator = getKeyGenerator(requestWrap); /** * Factory method that returns KeyGenerator based on the request method. * * @param httpRequest * @return */ private Key getKeyGenerator(HttpServletRequest httpRequest) { Key keyGenerator = null; if (httpRequest.getMethod().equalsIgnoreCase("GET")) { keyGenerator = new GetRequestKey(); } else if (httpRequest.getMethod().equalsIgnoreCase("POST")) { keyGenerator = new PostRequestKey(); } return keyGenerator; } 4. Check for a cache hit using the Key returned by the Key Generator. If its a miss, call next filter or target to compute actual value, get value from the response wrapper, and add it to the cache. if (keyGenerator != null) { String key = keyGenerator.getKey(requestWrap, cacheConfig); String value = (String) client.get(key); if (value == null) { // cache miss logger.info("Cache miss for key " + key); // call next filter/actual target for value filterChain.doFilter(requestWrap, responseWrap); if (responseWrap.getStatus() == HttpServletResponse.SC_OK) { // obtaining response content from // HttpServletResponseWrapper value = responseWrap.getOutputStream().toString(); // adding response to cache client.add(key, 0, value); logger.info("Adding response to cache: "+ (value.length() > 50 ? value.substring(0,50) + "..." : value)); } 5. If its a cache hit, just get return cached value else { // This case is a cache hit logger.info("Cache hit for key " + key); response.getWriter().println(value); } Lets take a look at each of the Wrappers. I am not going into a a lot of detail into how each of these work. Request Wrapper Class On the way in, the original POST content is extracted from the request and put in a String Buffer. To the filter, this content is returned via the toString() method of the WrappedInputStream class whereas the subsequently called controller calls the read method. public class MemcacheRequestWrapper extends HttpServletRequestWrapper { protected ServletInputStream stream; protected HttpServletRequest origRequest = null; protected BufferedReader reader = null; public MemcacheRequestWrapper(HttpServletRequest request) throws IOException { super(request); origRequest = request; } public ServletInputStream createInputStream() throws IOException { return (new WrappedInputStream(origRequest)); } @Override public ServletInputStream getInputStream() throws IOException { if (reader != null) { throw new IllegalStateException("getReader() has already been called for this request"); } if (stream == null) { stream = createInputStream(); } return stream; } @Override public BufferedReader getReader() throws IOException { if (reader != null) { return reader; } if (stream != null) { throw new IllegalStateException("getReader() has already been called for this request"); } stream = createInputStream(); reader = new BufferedReader(new InputStreamReader(stream)); return reader; } private class WrappedInputStream extends ServletInputStream { private StringBuffer originalInput = new StringBuffer(); private HttpServletRequest originalRequest; private ByteArrayInputStream byteArrayInputStream; public WrappedInputStream(HttpServletRequest request) throws IOException { this.originalRequest = request; BufferedReader bufferedReader = null; try { InputStream inputStream = request.getInputStream(); if (inputStream != null) { bufferedReader = new BufferedReader(new InputStreamReader(inputStream)); char[] charBuffer = new char[128]; int bytesRead = -1; while ((bytesRead = bufferedReader.read(charBuffer)) > 0) { originalInput.append(charBuffer, 0, bytesRead); } } byteArrayInputStream = new ByteArrayInputStream(originalInput.toString().getBytes()); } catch (IOException ex) { throw ex; } finally { if (bufferedReader != null) { try { bufferedReader.close(); } catch (IOException ex) { throw ex; } } } } @Override public String toString() { return this.originalInput.toString(); } @Override public int read() throws IOException { return byteArrayInputStream.read(); } } } Response Wrapper Class The response wrapper is similar to the request wrapper. Instead of the read method, there is a write method, called by the controller when its writing JSON content. This is stored in the wrapper and called in the filter. public class MemcacheResponseWrapper extends HttpServletResponseWrapper { protected ServletOutputStream stream; protected PrintWriter writer = null; protected HttpServletResponse origResponse = null; private int httpStatus = 200; public MemcacheResponseWrapper(HttpServletResponse response) { super(response); response.setContentType("application/json"); origResponse = response; } public ServletOutputStream createOutputStream() throws IOException { return (new WrappedOutputStream(origResponse)); } public ServletOutputStream getOutputStream() throws IOException { if (writer != null) { throw new IllegalStateException("getWriter() has already been called for this response"); } if (stream == null) { stream = createOutputStream(); } return stream; } public PrintWriter getWriter() throws IOException { if (writer != null) { return writer; } if (stream != null) { throw new IllegalStateException("getOutputStream() has already been called for this response"); } stream = createOutputStream(); writer = new PrintWriter(stream); return writer; } @Override public void sendError(int sc) throws IOException { httpStatus = sc; super.sendError(sc); } @Override public void sendError(int sc, String msg) throws IOException { httpStatus = sc; super.sendError(sc, msg); } @Override public void setStatus(int sc) { httpStatus = sc; super.setStatus(sc); } public int getStatus() { return httpStatus; } private class WrappedOutputStream extends ServletOutputStream { private StringBuffer originalOutput = new StringBuffer(); private HttpServletResponse originalResponse; public WrappedOutputStream(HttpServletResponse response) { this.originalResponse = response; } @Override public String toString() { return this.originalOutput.toString(); } @Override public void write(int arg0) throws IOException { originalOutput.append((char) arg0); originalResponse.getOutputStream().write(arg0); } } }

June 25, 2013

by Faheem Sohail

· 22,608 Views · 1 Like

Resolving SOAPFaultException caused by com.ctc.wstx.exc. WstxUnexpectedCharException

If you’re using any of these tools for Web Services – Axis2, CXF etc. – that internally make use of Woodstox XML processor (wstx), and you're getting an exception like this during webservice calls, javax.xml.ws.soap.SOAPFaultException: Error reading XMLStreamReader. at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:...) ... Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character ... at com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:...) at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:...) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:...) at com.ctc.wstx.sr.BasicStreamReader.nextTag(BasicStreamReader.java:...) the problem is that the wstx tokenizer/parser encountered unexpected (but not necessarily invalid per se) character; character that is not legal in current context. Could happen, for example, if white space was missing between attribute value and name of next attribute, according to API docs (http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/exc/WstxUnexpectedCharException.html). This simply means that you’re receiving an ill-formed SOAP XML as response. You need to check the SOAP response construction logic/code at the other end you’re communicating to.

June 24, 2013

by Singaram Subramanian

· 21,082 Views

Automating Nginx Reverse Proxy Configuration

It’s really nice if you can decouple your external API from the details of application segregation and deployment. In a previous post I explained some of the benefits of using a reverse proxy. On my current project we’ve building a distributed service oriented architecture that also exposes an HTTP API, and we’re using a reverse proxy to route requests addressed to our API to individual components. We have chosen the excellent Nginx web server to serve as our reverse proxy; it’s fast, reliable and easy to configure. We use it to aggregate multiple services exposing HTTP APIs into a single URL space. So, for example, when you type: http://api.example.com/product/pinstripe_suit It gets routed to: http://10.0.1.101:8001/product/pinstripe_suit But when you go to: http://api.example.com/customer/103474783 It gets routed to http://10.0.1.104:8003/customer/103474783 To the consumer of the API it appears that they are exploring a single URL space (http://api.example.com/blah/blah), but behind the scenes the different top level segments of the URL route to different back end servers. /product/… routes to 10.0.1.101:8001, but /customer/… routes to 10.0.1.104:8003. We also want this to be self-configuring. So, say I want to create a new component of the system that records stock levels. Rather than extending an existing component, I want to be able to write a stand-alone executable or service that exposes an HTTP endpoint, have it be automatically deployed to one of the hosts in my cloud infrastructure, and have Nginx automatically route requests addressed http://api.example.com/stock/whatever to my new component. We also want to load balance these back end services. We might want to deploy several instances of our new stock API and have Nginx automatically round robin between them. We call each top level segment ( /stock, /product, /customer ) a claim. A component publishes an ‘AddApiClaim’ message over RabbitMQ when it comes on line. This message has 3 fields: ‘Claim', ‘ipAddress’, and ‘PortNumber’. We have a special component, ProxyAutomation, that subscribes to these messages and rewrites the Nginx configuration as required. It uses SSH and SCP to log into the Nginx server, transfer the various configuration files, and instruct Nginx to reload its configuration. We use the excellent SSH.NET library to automate this. A really nice thing about Nginx configuration is wildcard includes. Take a look at our top level configuration file: ... http { include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfile on; keepalive_timeout 65; include /etc/nginx/conf.d/*.conf; } Line 16 says, take any *.conf file in the conf.d directory and add it here. Inside conf.d is a single file for all api.example.com requests: include /etc/nginx/conf.d/api.example.com.conf.d/upstream.*.conf; server { listen 80; server_name api.example.com; include /etc/nginx/conf.d/api.example.com.conf.d/location.*.conf; location / { root /usr/share/nginx/api.example.com; index index.html index.htm; } } This is basically saying listen on port 80 for any requests with a host header ‘api.example.com’. This has two includes. The first one at line 1, I’ll talk about later. At line 7 it says ‘take any file named location.*.conf in the subdirectory ‘api.example.com.conf.d’ and add it to the configuration. Our proxy automation component adds new components (AKA API claims) by dropping new location.*.conf files in this directory. For example, for our stock component it might create a file, ‘location.stock.conf’, like this: location /stock/ { proxy_pass http://stock; } This simply tells Nginx to proxy all requests addressed to api.example.com/stock/… to the upstream servers defined at ‘stock’. This is where the other include mentioned above comes in, ‘upstream.*.conf’. The proxy automation component also drops in a file named upstream.stock.conf that looks something like this: upstream stock { server 10.0.0.23:8001; server 10.0.0.23:8002; } This tells Nginx to round-robin all requests to api.example.com/stock/ to the given sockets. In this example it’s two components on the same machine (10.0.0.23), one on port 8001 and the other on port 8002. As instances of the stock component get deployed, new entries are added to upstream.stock.conf. Similarly, when components get uninstalled, the entry is removed. When the last entry is removed, the whole file is also deleted. This infrastructure allows us to decouple infrastructure configuration from component deployment. We can scale the application up and down by simply adding new component instances as required. As a component developer, I don’t need to do any proxy configuration, just make sure my component publishes add and remove API claim messages and I’m good to go.

June 19, 2013

by Mike Hadlow

· 59,153 Views

Relations with not-found="ignore"

NHibernate has a lot of interesting and specific option for mapping entities that can really cover every scenario you have in mind, but you need to be aware of every implication each advanced option has on performances. If you are in a legacy-database scenario where entity A reference Entity B, but someone outside the control of NHibernate can delete record from table used by Entity B, without setting the corresponding referencing field on Entity A. We will end with a Database with broken reference, where rows from Table A references with a field id a record in Table B that no longer exists. When this happens, if you load an Entity of type A that reference an Entity of type B that was deleted, it will throw an exception if you try to access navigation property, because NHibernate cannot find related entity in the Database. If you know NHibernate you can use the not-found=”Ignore” mapping option, that basically tells NHibernate to ignore a broken reference key, if EntityA references an Entity B that was already deleted from database, the reference will be ignored, navigation property will be set to Null, and no exception occurs. This kind of solution is not without side effects, first of all you will find that Every time you load an Entity of Type A another query is issued to the database to verify if related Entity B is really there. This actually disable lazy load, because related entity is always selected. This is not an optimum scenario, because you will end with a lot of extra query and this happens because not-found=”ignore” is only a way to avoid a real problem: you have broken foreign-key in your database. My suggestion is, fix data in database, keep the database clean without broken foreign-keys and remove all not-found=”ignore” mapping option unless you really have no other solution. Please remember that even if you are using NHibernate, you should not forget SQL capabilities. As an example SQL Server (and quite all of the relational database in the market) has the ability to setup rules for foreign-key, es ON DELETE SET NULL that automatically set to null a foreign key on a table, when related record is deleted. Such a feature will prevent you from having broken foreign key, even if some legacy process manipulates the database deleting records without corresponding update in related foreign-key. - See more at: http://www.codewrecks.com/blog/index.php/2013/06/18/relations-with-not-foundignore-disable-lazy-load-and-impact-on-performances/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+AlkampferEng+%28Alkampfer%27s+Place%29#sthash.93db7RQX.dpuf

June 19, 2013

by Ricci Gian Maria

· 5,148 Views

How to Optimize MySQL UNION for High Speed

There are two ways to speedup UNIONs in a MySQL database. First use UNION ALL if at all possible, and second try to push down your conditions. 1. UNION ALL is much faster than UNION How does a UNION work? Imagine you have two tables for shirts. The short_sleeve table looks like this: blue green gray black And long_sleeve another that looks like this: red green yellow blue Related: Why Generalists are Better at Scaling the Web If you UNION those two tables, first MySQL will sort the combined set into a temp table like this: black blue blue gray green green red yellow Once it’s done this sort, it can easily remove the duplicate blue & duplicate green for this resulting set: black blue gray green red yellow See also: Mythical MySQL DBA – the talent drought. Why does it do this? UNION is defined that way in SQL. Duplicates must be removed and this is an efficient way for the MySQL engine to remove them. Combine results, sort, remove duplicates and return the set. Queries with UNION can be accelerated in two ways. Switch to UNION ALL or try to push ORDER BY, LIMIT and WHERE conditions inside each subquery. You’ll be glad you did! What if we did UNION ALL? The result would look like this: blue green gray black red green yellow blue Read this: MySQL DBA Interview & Hiring Guide. It doesn’t have to sort, and doesn’t have to remove duplicates. If you imagine combining two 10 million row tables, and don’t have to sort, this speedup can be HUGE. 2. Use Push-down Conditions to speedup UNION in MySQL Imagine with our example above the shirts have a design date, the year they were released. Yes we’re keeping this example very simple to illustrate the concept. Here is the short_sleeve table: blue 2013 green 2013 green 2012 gray 2011 black 2009 black 2011 And long_sleeve table looks like this: red 2012 red 2013 green 2011 yellow 2010 blue 2011 For 2013 designs could combine them like this: (SELECT type, release FROM short_sleeve) UNION (SELECT type, release FROM long_sleeve); WHERE release >=2013; See also: 5 More Things Deadly to Scalability and the original 5 Things Toxic to Scalability.. Here the WHERE clause works on this 11 record temp table: black 2009 black 2011 blue 2011 blue 2013 gray 2011 green 2013 green 2012 green 2011 red 2012 red 2013 yellow 2010 But it would be much faster to move the WHERE inside each subquery like this: (SELECT type, release FROM short_sleeve WHERE release >=2013) UNION (SELECT type, release FROM long_sleeve WHERE release >=2013); That would be operating on a combined 3 record table. Faster to sort & remove duplicates. Smaller result sets cache better too, providing a pay forward dividend. That’s what performance optimization is all about! Read this: RDS or MySQL – 10 Use Cases. Remember multi-million row sets in each part of this query will quickly illustrate the optimization. We’re using very small results to make visualizing easier. You can also use this optimization for ORDER BY and for LIMIT conditions. By reducing the number of records returned by EACH PART of the UNION, you reduce the work that happens at the stage where they are all combined. If you’re seeing some UNION queries in your slow query log, I suggest you try this optimization out and see if you can tweak it.

June 17, 2013

by Sean Hull

· 24,121 Views

Searchable Documents? Yes You Can. Another Reason to Choose AsciiDoc

Elasticsearch is a flexible and powerful open source, distributed real-time search and analytics engine for the cloud based on Apache Lucene which provides full text search capabilities. It is document oriented and schema free. Asciidoctor is a pure Ruby processor for converting AsciiDoc source files and strings into HTML 5, DocBook 4.5 and other formats. Apart of Asciidoctor Ruby part, there is an Asciidoctor-java-integration project which let us call Asciidoctor functions from Java without noticing that Ruby code is being executed. In this post we are going to see how we can use Elasticsearch over AsciiDocdocuments to make them searchable by their header information or by their content. Let's add required dependencies: junit junit 4.11 test com.googlecode.lambdaj lambdaj 2.3.3 org.elasticsearch elasticsearch 0.90.1 org.asciidoctor asciidoctor-java-integration 0.1.3 Lambdaj library is used to convert AsciiDoc files to a json documents. Now we can start an Elasticsearch instance which in our case it is going to be an embedded instance. node = nodeBuilder().local(true).node(); Next step is parse AsciiDoc document header, read its content and convert them into a json document. An example of json document stored in Elasticsearch can be: { "title":"Asciidoctor Maven plugin 0.1.2 released!", "authors":[ { "author":"Jason Porter", "email":"[email protected]" } ], "version":null, "content":"= Asciidoctor Maven plugin 0.1.2 released!.....", "tags":[ "release", "plugin" ] } And for converting an AsciiDoc File to a json document we are going to useXContentBuilder class which is provided by ElasticsearchJava API to create jsondocuments programmatically. package com.lordofthejars.asciidoctor; import static org.elasticsearch.common.xcontent.XContentFactory.*; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.util.List; import org.asciidoctor.Asciidoctor; import org.asciidoctor.Author; import org.asciidoctor.DocumentHeader; import org.asciidoctor.internal.IOUtils; import org.elasticsearch.common.xcontent.XContentBuilder; import ch.lambdaj.function.convert.Converter; public class AsciidoctorFileJsonConverter implements Converter { private Asciidoctor asciidoctor; public AsciidoctorFileJsonConverter() { this.asciidoctor = Asciidoctor.Factory.create(); } public XContentBuilder convert(File asciidoctor) { DocumentHeader documentHeader = this.asciidoctor.readDocumentHeader(asciidoctor); XContentBuilder jsonContent = null; try { jsonContent = jsonBuilder() .startObject() .field("title", documentHeader.getDocumentTitle()) .startArray("authors"); Author mainAuthor = documentHeader.getAuthor(); jsonContent.startObject() .field("author", mainAuthor.getFullName()) .field("email", mainAuthor.getEmail()) .endObject(); List authors = documentHeader.getAuthors(); for (Author author : authors) { jsonContent.startObject() .field("author", author.getFullName()) .field("email", author.getEmail()) .endObject(); } jsonContent.endArray() .field("version", documentHeader.getRevisionInfo().getNumber()) .field("content", readContent(asciidoctor)) .array("tags", parseTags((String)documentHeader.getAttributes().get("tags"))) .endObject(); } catch (IOException e) { throw new IllegalArgumentException(e); } return jsonContent; } private String[] parseTags(String tags) { tags = tags.substring(1, tags.length()-1); return tags.split(", "); } private String readContent(File content) throws FileNotFoundException { return IOUtils.readFull(new FileInputStream(content)); } } Basically we are building the json document by calling startObject methods to start a new object, field method to add new fields, and startArray to start an array. Then this builder will be used to render the equivalent object in json format. Notice that we are using readDocumentHeader method from Asciidoctor class which returns header attributes from AsciiDoc file without reading and rendering the whole document. And finally content field is set with all document content. And now we are ready to start indexing documents. Note that populateData method receives as parameter a Client object. This object is from Elasticsearch Java APIand represents a connection to Elasticsearch database. import static ch.lambdaj.Lambda.convert; //.... private void populateData(Client client) throws IOException { List asciidoctorFiles = new ArrayList() {{ add(new File("target/test-classes/java_release.adoc")); add(new File("target/test-classes/maven_release.adoc")); }; List jsonDocuments = convertAsciidoctorFilesToJson(asciidoctorFiles); for (int i=0; i < jsonDocuments.size(); i++) { client.prepareIndex("docs", "asciidoctor", Integer.toString(i)).setSource(jsonDocuments.get(i)).execute().actionGet(); } client.admin().indices().refresh(new RefreshRequest("docs")).actionGet(); } private List convertAsciidoctorFilesToJson(List asciidoctorFiles) { return convert(asciidoctorFiles, new AsciidoctorFileJsonConverter()); } It is important to note that the first part of the algorithm is converting all our AsciiDocfiles (in our case two) to XContentBuilder instances by using previous converter class and the method convert of Lambdaj project. If you want you can take a look to both documents used in this example in https://github.com/asciidoctor/asciidoctor.github.com/blob/develop/news/asciidoctor-java-integration-0-1-3-released.adoc and https://github.com/asciidoctor/asciidoctor.github.com/blob/develop/news/asciidoctor-maven-plugin-0-1-2-released.adoc. Next part is inserting documents inside one index. This is done by using prepareIndexmethod, which requires an index name (docs), an index type (asciidoctor), and the idof the document being inserted. Then we call setSource method which transforms theXContentBuilder object to json, and finally by calling execute().actionGet(), data is sent to database. The final step is only required because we are using an embedded instance ofElasticsearch (in production this part should not be required), which refresh the indexes by calling refresh method. After that point we can start querying Elasticsearch for retrieving information from our AsciiDoc documents. Let's start with very simple example, which returns all documents inserted: SearchResponse response = client.prepareSearch().execute().actionGet(); Next we are going to search for all documents that has been written by Alex Sotowhich in our case is one. import static org.elasticsearch.index.query.QueryBuilders.matchQuery; //.... QueryBuilder matchQuery = matchQuery("author", "Alex Soto"); QueryBuilder matchQuery = matchQuery("author", "Alexander Soto"); Note that I am searching for field author the string Alex Soto, which returns only one. The other document is written by Jason. But it is interesting to say that if you search for Alexander Soto, the same document will be returned; Elasticsearch is smart enough to know that Alex and Alexander are very similar names so it returns the document too. More queries, how about finding documents written by someone who is called Alex, but not Soto. import static org.elasticsearch.index.query.QueryBuilders.fieldQuery; //.... QueryBuilder matchQuery = fieldQuery("author", "+Alex -Soto"); And of course no results are returned in this case. See that in this case we are using afield query instead of a term query, and we use +, and - symbols to exclude and include words. Also you can find all documents which contains the word released on title. import static org.elasticsearch.index.query.QueryBuilders.matchQuery; //.... QueryBuilder matchQuery = matchQuery("title", "released"); And finally let's find all documents that talks about 0.1.2 release, in this case only one document talks about it, the other one talks about 0.1.3. QueryBuilder matchQuery = matchQuery("content", "0.1.2"); Now we only have to send the query to Elasticsearch database, which is done by using prepareSearch method. SearchResponse response = client.prepareSearch("docs") .setTypes("asciidoctor") .setQuery(matchQuery) .execute() .actionGet(); SearchHits hits = response.getHits(); for (SearchHit searchHit : hits) { System.out.println(searchHit.getSource().get("content")); } Note that in this case we are printing the AsciiDoc content through console, but you could use asciidoctor.render(String content, Options options) method to render the content into required format. So in this post we have seen how to index documents using Elasticsearch, how to get some important information from AsciiDoc files using Asciidoctor-java-integration project, and finally how to execute some queries to inserted documents. Of course there are more kind of queries in Elasticsearch, but the intend of this post wasn't to explore all possibilities of Elasticsearch. Also as corollary, note how important it is using AsciiDoc format for writing your documents. Without much effort you can build a search engine for your documentation. On the other side, imagine all code that would be required to implement the same using any proprietary binary format like Microsoft Word. So we have shown another reason to use AsciiDoc instead of other formats.

June 10, 2013

by Alex Soto

· 4,843 Views

Mapping Enums Done Right With @Convert in JPA 2.1

If you ever worked with Java enums in JPA you are definitely aware of their limitations and traps. Usingenum as a property of your @Entity is often very good choice, however JPA prior to 2.1 didn’t handle them very well. It gave you 2+1 choices: @Enumerated(EnumType.ORDINAL) (default) will map enum values using Enum.ordinal(). Basically first enumerated value will be mapped to 0 in database column, second to 1, etc. This is very compact and works great to the point when you want to modify your enum. Removing or adding value in the middle or rearranging them will totally break existing records. Ouch! To make matters worse, unit and integration tests often work on clean database, so they won’t catch discrepancy in old data. @Enumerated(EnumType.STRING) is much safer because it stores string representation of enum. You can now safely add new values and move them around. However renaming enum in Java code will still break existing records in DB. Even more important, such representation is very verbose, unnecessarily consuming database resources. You can also use raw representation (e.g. single char or int) and map it manually back and forth in @PostLoad/@PrePersist/@PreUpdate events. Most flexible and safe from database perspective, but quite ugly. Luckily Java Persistence API 2.1 (JSR-388) released few days ago provides standardized mechanism of pluggable data converters. Such API was present for ages in proprietary forms and it’s not really rocket science, but having it as part of JPA is a big improvement. To my knowledge Eclipselink is the only JPA 2.1 implementation available to date, so we will use it to experiment a bit. We will start from sample Spring application developed as part of “Poor man’s CRUD: jqGrid, REST, AJAX, and Spring MVC in one house” article. That version had no persistence, so we will add thin DAO layer on top of Spring Data JPA backed by Eclipselink. Only entity so far is Book: @Entity public class Book { @Id @GeneratedValue(strategy = IDENTITY) private Integer id; //... private Cover cover; //... } Where Cover is an enum: public enum Cover { PAPERBACK, HARDCOVER, DUST_JACKET } Neither ORDINAL nor STRING is a good choice here. The former because rearranging first three values in any way will break loading of existing records. The latter is too verbose. Here is where custom converters in JPA come into play: import javax.persistence.AttributeConverter; import javax.persistence.Converter; @Converter public class CoverConverter implements AttributeConverter { @Override public String convertToDatabaseColumn(Cover attribute) { switch (attribute) { case DUST_JACKET: return "D"; case HARDCOVER: return "H"; case PAPERBACK: return "P"; default: throw new IllegalArgumentException("Unknown" + attribute); } } @Override public Cover convertToEntityAttribute(String dbData) { switch (dbData) { case "D": return DUST_JACKET; case "H": return HARDCOVER; case "P": return PAPERBACK; default: throw new IllegalArgumentException("Unknown" + dbData); } } } OK, I won’t insult you, my dear reader, explaining this. Converting enum to whatever will be stored in relational database and vice-versa. Theoretically JPA provider should apply converters automatically if they are declared with: @Converter(autoApply = true) It didn’t work for me. Moreover declaring them explicitly instead of @Enumerated in@Entity class didn’t work as well: import javax.persistence.Convert; //... @Convert(converter = CoverConverter.class) private Cover cover; Resulting in exception: Exception Description: The converter class [com.blogspot.nurkiewicz.CoverConverter] specified on the mapping attribute [cover] from the class [com.blogspot.nurkiewicz.Book] was not found. Please ensure the converter class name is correct and exists with the persistence unit definition. Bug or feature, I had to mention converter in orm.xml: And it flies! I have a freedom of modifying my Cover enum (adding, rearranging, renaming) without affecting existing records. One tip I would like to share with you is related to maintainability. Every time you have a piece of code mapping from or to enum, make sure it’s tested properly. And I don’t mean testing every possible existing value manually. I am more after a test making sure that newenum values are reflected in mapping code. Hint: code below will fail (by throwingIllegalArgumentException) if you add new enum value but forget to add mapping code from it: for (Cover cover : Cover.values()) { new CoverConverter().convertToDatabaseColumn(cover); } Custom converters in JPA 2.1 are much more useful than what we saw. If you combine JPA with Scala, you can use @Converter to map database columns directly toscala.math.BigDecimal, scala.Option or small case class. In Java there will finally be a portable way of mapping Joda time. Last but not least, if you like (very) strongly typed domain, you may wish to have PhoneNumber class (with isInternational(),getCountryCode() and custom validation logic) instead of String or long. This small addition in JPA 2.1 will surely improve domain objects quality significantly. If you wish to play a bit with this feature, sample Spring web application is available on GitHub.

June 6, 2013

by Tomasz Nurkiewicz

· 69,469 Views · 6 Likes

Serialization and injection

Serialization is a form of persistence: serialized data survives the process and the RAM where it was created and can be reconstituted inside different processes and machines that live in a different time or place. Sometimes serialization is a poor form of persistence in fact, one that confuses the boundary between the different schemas the data can fit in. However, what I found useful in the last years of development is to institute a strict separation: serialize Value Objects, Entities, and everything that represents the state of the application. Meanwhile, use Dependency Injection over services that are part of a larger object graph and never serialize this second kind of objects. In the discussion that follows, I make the assumption that serialization and deserialization occur on the same machine (e.g. like for web-oriented sessions.) The problem with serialization, which work transparently most of the time, is the need to serialize service objects instead of limiting the procedure to data structures. How can you store such objects? Not options Some options to solve this problems are really not options. Serialization by itself will fail because of the staleness of the references contained in these objects. For example, in PHP trying to serialize a database connections composed by a Repository or DAO object will rightly fail with an exception. Whenever an object represents a resource of the current machine, it cannot usually be serialized except in the case when the only resource involved is RAM. If the resource is disk space or other running processes such as a database daemon, the reconstitution of the object in another place and time will fail and it's best to just stop the developer immediately during storage. Quasi-options Some solutions to the problem try to avoid the staleness problem by serializing objects without their resources, and make them regrab a new version of them on deserialization. In PHP for example, this can be done with the __sleep() and __wakeup() magic methods, called automatically during serialization and deserializaton respectively. This deserialization mechanism introduces a dependency from the serialized Entity to external services: such a dependency is already in place when building the object the first time (passing the XService in the constructor) but it is aggravated when deserializing (depending on a XServiceFactory instead of just an XService). An improvement, from the dependencies point of view, is to reattach collaborators to deserialized objects like you would for other persistence-related tasks. For example, EntityRepository can inject the missing pieces of Entity every time its find() method is called. However, there is still another option, which is the most resilient from the modelling point of view and not only that of dependency management: injecting non-serializable collaborators through the stack. Objects can collaborate even without keeping field references to each other, and injecting dependencies as parameters move the dependency starting point from the server to the client object (which may or may not be desirable). What is most important is that Entities are relieved of having to manage external references in any context, not only that of persistence and in particular serialization. The metaphor for the 3rd option Misko Hevery likes to say: have you ever seen a credit card able to charge itself? If a CreditCard is an Entity in your domain, it would be very strange to keeping a wire attached to your wallet wherever you go. With the first option, you have the card spring a wire when it is taken out of the wallet, like in horror movies. This intelligent cable tries as its best to attach to the nearest Point of Sale (a bad case of bluetooth I think). With Repositories in mind, you're not dealing with automated wires anymore, but you're still attaching cables between cards and fixed devices. In reality, cards collaborate with the PoS in a fast process that does not last more than a few seconds. Actually, sometimes they don't touch it at all, as in all Internet-based purchases. Keeping services around to deal with external dependencies does not mean the API of your Domain Model has to be biased towards service objects: pos.charge(creditCard); // can equivalently be: creditCard.chargeOn(pos); This is a form of Double Dispatch since there are two objects collaborating and you can dispatch (send messages) to both, being polimorphic by substituting both objects. The sequence of calls is: client -> creditCard -> pos The client object still looks at CreditCard as a behaviorally complete object, but it is clear which dependency is necessary to run each use case (CreditCard method). You can persist a CreditCard easily and send it over the wire to caches or databases. When it comes the time to charge, it is the client that has to bring forward a service able to connect to a bank.

June 5, 2013

by Giorgio Sironi

· 7,260 Views

Write CSV Data into Hive and Python

Apache Hive is a high level SQL-like interface to Hadoop. It lets you execute mostly unadulterated SQL, like this: CREATE TABLE test_table(key string, stats map); The map column type is the only thing that doesn’t look like vanilla SQL here. Hive can actually use different backends for a given table. Map is used to interface with column oriented backends like HBase. Essentially, because we won’t know ahead of time all the column names that could be in the HBase table, Hive will just return them all as a key/value dictionary. There are then helpers to access individual columns by key, or even pivot the map into one key per logical row. As part of the Hadoop family, Hive is focused on bulk loading and processing. So it’s not a surprise that Hive does not support inserting raw values like the following SQL: INSERT INTO suppliers (supplier_id, supplier_name) VALUES (24553, 'IBM'); However, for unit testing Hive scripts, it would be nice to be able to insert a few records manually. Then you could run your map reduce HQL, and validate the output. Luckily, Hive can load CSV files, so it’s relatively easy to insert a handful or records that way. CREATE TABLE foobar(key string, stats map) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' ; LOAD DATA LOCAL INPATH '/tmp/foobar.csv' INTO TABLE foobar; This will load a CSV file with the following data, where c4ca4-0000001-79879483-000000000124 is the key, and comments and likesare columns in a map. c4ca4-0000001-79879483-000000000124,comments:0|likes:0 c4ca4-0000001-79879483-000000000124,comments:0|likes:0 Because I’ve been doing this quite a bit in my unit tests, I wrote a quick Python helper to dump a list of key/map tuples to a temporary CSV file, and then load it into Hive. This uses hiver to talk to Hive over thrift. import hiver from django.core.files.temp import NamedTemporaryFile def _hql(self, hql): client = hiver.connect(settings.HIVE_HOST, settings.HIVE_PORT) try: client.execute(hql) finally: client.shutdown() def insert(self, table_name, rows): ''' cannot insert single rows via hive, need to save to a temp file and bulk load that ''' csv_file = NamedTemporaryFile(delete=True) for row in rows: map_repr = '|'.join('%s:%s' % (key, value) for key, value in row[1].items()) csv_file.write(row[0] + "," + map_repr + "\n") csv_file.flush() try: _hql('DROP TABLE IF EXISTS %s' % table_name) _hql(""" CREATE TABLE %s ( key string, map ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' """ % (table_name)) _hql(""" LOAD DATA LOCAL INPATH '%s' INTO TABLE %s """ % (csv_file.name, table_name) finally: csv_file.close() You can call it like this: insert('test_table', [ ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879496-000000000124', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ])

June 5, 2013

by Chase Seibert

· 14,787 Views

Hadoop REST API - WebHDFS

Hadoop provides a Java native API to support file system operations..

June 3, 2013

by Istvan Szegedi

· 57,497 Views · 5 Likes