Data Engineering Resources

The Latest Data Engineering Topics

I've seen a lot of discussion about how to monitor Camel based applications. Most people are looking for the following features: ability to view services (contexts, endpoints, routes), to view performance statistics (route throughput, etc) and to perform basic operations (start/stop routes, send messages, etc). This post will breakdown the options (that I know of) that are available today (as of Camel 2.8). If you have used other approaches or know of other ongoing development in this area, please let me know. JMX APIs Camel uses JMX to provide a standardized way to access metadata about contexts/routes/endpoints defined in a given application. Also, you can use JMX to interact with these components (start/stop routes, etc) in some interesting ways. I recently had some very specific Camel/ActiveMQ monitoring requests from a client. After looking at the options, we ended up building a standalone Tomcat web app that used JSPs, jQuery, Ajax and JMX APIs to view route/endpoint statistics, manage Camel routes (stop, start, etc) and monitor/manipulate ActiveMQ queues. It provided some much needed visibility and management features for our Camel/ActiveMQ based message processing application... CamelContext If you have a handle to the CamelContext, there are various APIs that can help describe and manage routes and endpoints. These are used by the existing Camel Web Console and can be used to build custom interface to retrieve and use this information in various ways... here are some of the notable APIs... getRouteDefinitions() getEndpoints() getEndpointsMap() getRouteStatus(routeId) startRoute(routeId) stopRoute(routeId) removeRoute(routeId) addRoutes(routeBuilder) suspendRoute(routeId) resumeRoute(routeId) With a little creativity, you can use these APIs to manage/monitor and re-wire a Camel application dynamically. Camel Web Console This console provides web and REST interfaces to Camel contexts/routes/endpoints and allows you to view/manage endpoints/routes, send messages to endpoints, viewing route statistics, etc. That being said, using this web console with an existing Camel application is tricky at the moment. It's currently deployed as a war file that only has access to the CamelContext defined in its embedded spring XML file. Though the entire camel-web project can be embedded and customized in your application if you desire (and know Scalate). Given my recent client requirements, I opted to build my own basic app using JSPs/JMX as described above. There has been some recent support for deploying this console in OSGI, where it should be able to view any CamelContexts deployed in the container, etc. However, I'm yet to see this work...more on this later. Using Camel APIs There are also a number of Camel technologies/patterns that can be used to add monitoring to existing routes. wire tap - can add message logging (to a file or JMS queue/topic, etc) or other inline processing advicewith - can be used to modify existing routes to apply before/after operations or add/remove operations in a route intercept - can be used to intercept Exchanges while they are in route, can apply to all endpoints, certain endpoints or just starting endpoints BrowsableEndpoint - is an interface which Endpoints may implement to support the browsing of the exchanges which are pending or have been sent on it. That being said, it takes some creativity to use these effectively and caution to not adversely affect the routes you are trying to monitor. Hyperic HQ You can use this tool to monitor Servicemix (or any process), but it more geared towards system monitoring and JVM stats. I didn't find it useful for any Camel specific monitoring. jConsole/VisualVM these are standard JMX based consoles. They aren't web based and can't be customized (easily anyways) to provide anything more than a tree-like view of JMX MBeans. If you know where to look though, you can do a lot with it. Summary These are just some quick notes at this point. As I learn about other ways of monitoring Camel, I'll update this list and give some more detailed comparison. Any comments are welcome...

June 27, 2012

by Ben O'Day

· 20,124 Views

Using Cookies to implement a RememberMe functionality

Some web applications may need a "Remember Me" functionality. This means that, after a user login, user will have access from same machine to all its data even after session expired. This access will be possible until user does a logout. If you are using Spring and its login form, then you should use "Remember Me" functionality already implemented inside the framework. Some web frameworks also offer a type of SignIn panel which already has "remember me" built-in. But in case you have to implement "Remember Me" functionality by your own, this can be easily achieved using Cookies. Java has a Cookie class named javax.servlet.http.Cookie. Algorithm is straight-forward: your login panel must contain a "Remember Me" check after a succesfull login with "Remember Me" check selected, you can create two cookies: one to keep the value for rememberMe and one to keep a token which has to identify the logged user. For sake of security, this token must never contain user name or user password. The ideea is to generate a random id as token value. And token value aside with user id must be saved in your storage (database) whenever a login is needed, you have to look if there is any cookie saved by you, and if so and your "rememberMe" value is true, you can take the user from storage based on your token and do an automatic login. when a logout is done, you have to delete the cookie that keeps the token To add a cookie, you have to specify the maximum age of the cookie in seconds : HttpServletResponse servletResponse = ...; Cookie c = new Cookie(COOKIE_NAME, encodeString(uuid)); c.setMaxAge(365 * 24 * 60 * 60); // one year servletResponse.addCookie(c); To delete a cookie, you have to find cookie by name and set its maximum age to 0, before adding it to servlet response: HttpServletRequest servletRequest = ...; HttpServletResponse servletResponse = ... ; Cookie[] cookies = servletRequest.getCookies(); for (int i = 0; i < cookies.length; i++) { Cookie c = cookies[i]; if (c.getName().equals(COOKIE_NAME)) { c.setMaxAge(0); c.setValue(null); servletResponse.addCookie(c); } }

June 26, 2012

by Mihai Dinca - Panaitescu

· 58,976 Views · 1 Like

How to Find the Most Connected Neo4j Node Using Cypher

As I mentioned in another post about a month ago I’ve been playing around with a neo4j graph in which I have the following relationship between nodes: One thing I wanted to do was work out which node is the most connected on the graph, which would tell me who’s worked with the most people. I started off with the following cypher query: query = " START n = node(*)" query << " MATCH n-[r:colleagues]->c" query << " WHERE n.type? = 'person' and has(n.name)" query << " RETURN n.name, count(r) AS connections" query << " ORDER BY connections DESC" I can then execute that via the neo4j console or through irb using the neography gem like so: > require 'rubygems' > require 'neography' > neo = Neography::Rest.new(:port => 7476) > neo.execute_query query # cut for brevity {"data"=>[["Carlos Villela", 283], ["Mark Needham", 221]], "columns"=>["n.name", "connections"]} That shows me each person and the number of people they’ve worked with but I wanted to be able to see the most connected person in each office . Each person is assigned to an office while they’re working out of that office but people tend to move around so they’ll have links to multiple offices: I put ‘start_date’ and ‘end_date’ properties on the ‘member_of’ relationship and we can work out a person’s current office by finding the ‘member_of’ relationship which doesn’t have an end date defined: query = " START n = node(*)" query << " MATCH n-[r:colleagues]->c, n-[r2:member_of]->office" query << " WHERE n.type? = 'person' and has(n.name) and not(has(r2.end_date))" query << " RETURN n.name, count(r) AS connections, office.name" query << " ORDER BY connections DESC" And now our results look more like this: {"data"=>[["Carlos Villela", 283, "Porto Alegre - Brazil"], ["Mark Needham", 221, "London - UK South"]], "columns"=>["n.name", "connections"]} If we want to restrict that just to return the people for a specific person we can do that as well: query = " START n = node(*)" query << " MATCH n-[r:colleagues]->c, n-[r2:member_of]->office" query << " WHERE n.type? = 'person' and has(n.name) and (not(has(r2.end_date))) and office.name = 'London - UK South'" query << " RETURN n.name, count(r) AS connections, office.name" query << " ORDER BY connections DESC" {"data"=>[["Mark Needham", 221, "London - UK South"]], "columns"=>["n.name", "connections"]} In the current version of cypher we need to put brackets around the not expression otherwise it will apply the not to the rest of the where clause. Another way to get around that would be to put the not part of the where clause at the end of the line. While I am able to work out the most connected person by using these queries I’m not sure that it actually tells you who the most connected person is because it’s heavily biased towards people who have worked on big teams. Some ways to try and account for this are to bias the connectivity in favour of those you have worked longer with and also to give less weight to big teams since you’re less likely to have a strong connection with everyone as you might in a smaller team. I haven’t got onto that yet though!

June 26, 2012

by Mark Needham

· 13,517 Views

When Should I Use An ORM?

I think like everyone, I go through the same journey whenever I find out about a new technology.. Huh? –> This is really cool –> I use it everywhere –> Hmm, sometimes it’s not so great Remember when people were writing websites with XSLT transforms? Yes, exactly. XML is great for storing a data structure as a string, but you really don’t want to be coding your application’s business logic with it. I’ve gone through a similar journey with Object Relational Mapping tools. After hand-coding my DALs, then code generating them, ORMs seemed like the answer to all my problems. I became an enthusiastic user of NHibernate through a number of large enterprise application builds. Even today I would still use an ORM for most classes of enterprise application. However there are some scenarios where ORMs are best avoided. Let me introduce my easy to use, ‘when to use an ORM’ chart. It’s got two axis, ‘Model Complexity’ and ‘Throughput’. The X-axis, Model Complexity, describes the complexity of your domain model; how many entities you have and how complex their relationships are. ORMs excel at mapping complex models between your domain and your database. If you have this kind of model, using an ORM can significantly speed up and simplify your development time and you’d be a fool not to use one. The problem with ORMs is that they are a leaky abstraction. You can’t really use them and not understand how they are communicating with your relational model. The mapping can be complex and you have to have a good grasp of both your relational database, how it responds to SQL requests, and how your ORM comes to generate both the relational schema and the SQL that talks to it. Thinking of ORMs as a way to avoid getting to grips with SQL, tables, and indexes will only lead to pain and suffering. Their benefit is that they automate the grunt work and save you the boring task of writing all that tedious CRUD column to property mapping code. The Y-axis in the chart, Throughput, describes the transactional throughput of your system. At very high levels, hundreds of transactions per second, you need hard-core DBA foo to get out of the deadlocked hell where you will inevitably find yourself. When you need this kind of scalability you can’t treat your ORM as anything other than a very leaky abstraction. You will have to tweak both the schema and the SQL it generates. At very high levels you’ll need Ayende level NHibernate skills to avoid grinding to a halt. If you have a simple model, but very high throughput, experience tells me that an ORM is more trouble than it’s worth. You’ll end up spending so much time fine tuning your relational model and your SQL that it simply acts as an unwanted obfuscation layer. In fact, at the top end of scalability you should question the choice of a relational ACID model entirely and consider an eventually-consistent event based architecture. Similarly, if your model is simple and you don’t have high throughput, you might be better off using a simple data mapper like SimpleData. So, to sum up, ORMs are great, but think twice before using one where you have a simple model and high throughput.

June 25, 2012

by Mike Hadlow

· 19,331 Views

C# – Generic Serialization Methods

Short blog post today. These are a couple of generic serialize and deserialize methods that can be easily used when needing to serialize and deserialize classes. The methods work with any .Net type. That includes built-in .Net types and custom classes that you might create yourself. These methods will only serialize PUBLIC properties of a class. Also, the XML will be human-readable instead of one long line of text. Serialize Method /// /// Serializes the data in the object to the designated file path /// /// Type of Object to serialize /// Object to serialize /// FilePath for the XML file public static void Serialize(T dataToSerialize, string filePath) { try { using (Stream stream = File.Open(filePath, FileMode.Create, FileAccess.ReadWrite)) { XmlSerializer serializer = new XmlSerializer(typeof(T)); XmlTextWriter writer = new XmlTextWriter(stream, Encoding.Default); writer.Formatting = Formatting.Indented; serializer.Serialize(writer, dataToSerialize); writer.Close(); } } catch { throw; } } Deserialize Method /// /// Deserializes the data in the XML file into an object /// /// Type of object to deserialize /// FilePath to XML file /// Object containing deserialized data public static T Deserialize(string filePath) { try { XmlSerializer serializer = new XmlSerializer(typeof(T)); T serializedData; using (Stream stream = File.Open(filePath, FileMode.Open, FileAccess.Read)) { serializedData = (T)serializer.Deserialize(stream); } return serializedData; } catch { throw; } } Here is some sample code to show the methods in action. Person p = new Person() { Name = "John Doe", Age = 42 }; XmlHelper.Serialize(p, @"D:\text.xml"); Person p2 = new Person(); p2 = XmlHelper.Deserialize(@"D:\text.xml"); Console.WriteLine("Name: {0}", p2.Name); Console.WriteLine("Age: {0}", p2.Age); Console.Read();

June 24, 2012

by Ryan Alford

· 28,694 Views

Handling HTTP 404 Error in ASP.NET Web API

Introduction: Building modern HTTP/RESTful/RPC services has become very easy with the new ASP.NET Web API framework. Using ASP.NET Web API framework, you can create HTTP services which can be accessed from browsers, machines, mobile devices and other clients. Developing HTTP services is now quite easy for ASP.NET MVC developer becasue ASP.NET Web API is now included in ASP.NET MVC. In addition to developing HTTP services, it is also important to return meaningful response to client if a resource(uri) not found(HTTP 404) for a reason(for example, mistyped resource uri). It is also important to make this response centralized so you can configure all of 'HTTP 404 Not Found' resource at one place. In this article, I will show you how to handle 'HTTP 404 Not Found' at one place. Description: Let's say that you are developing a HTTP RESTful application using ASP.NET Web API framework. In this application you need to handle HTTP 404 errors in a centralized location. From ASP.NET Web API point of you, you need to handle these situations, No route matched. Route is matched but no {controller} has been found on route. No type with {controller} name has been found. No matching action method found in the selected controller due to no action method start with the request HTTP method verb or no action method with IActionHttpMethodProviderRoute implemented attribute found or no method with {action} name found or no method with the matching {action} name found. Now, let create a ErrorController with Handle404 action method. This action method will be used in all of the above cases for sending HTTP 404 response message to the client. public class ErrorController : ApiController { [HttpGet, HttpPost, HttpPut, HttpDelete, HttpHead, HttpOptions, AcceptVerbs("PATCH")] public HttpResponseMessage Handle404() { var responseMessage = new HttpResponseMessage(HttpStatusCode.NotFound); responseMessage.ReasonPhrase = "The requested resource is not found"; return responseMessage; } } You can easily change the above action method to send some other specific HTTP 404 error response. If a client of your HTTP service send a request to a resource(uri) and no route matched with this uri on server then you can route the request to the above Handle404 method using a custom route. Put this route at the very bottom of route configuration, routes.MapHttpRoute( name: "Error404", routeTemplate: "{*url}", defaults: new { controller = "Error", action = "Handle404" } ); Now you need handle the case when there is no {controller} in the matching route or when there is no type with {controller} name found. You can easily handle this case and route the request to the above Handle404 method using a custom IHttpControllerSelector. Here is the definition of a custom IHttpControllerSelector, public class HttpNotFoundAwareDefaultHttpControllerSelector : DefaultHttpControllerSelector { public HttpNotFoundAwareDefaultHttpControllerSelector(HttpConfiguration configuration) : base(configuration) { } public override HttpControllerDescriptor SelectController(HttpRequestMessage request) { HttpControllerDescriptor decriptor = null; try { decriptor = base.SelectController(request); } catch (HttpResponseException ex) { var code = ex.Response.StatusCode; if (code != HttpStatusCode.NotFound) throw; var routeValues = request.GetRouteData().Values; routeValues["controller"] = "Error"; routeValues["action"] = "Handle404"; decriptor = base.SelectController(request); } return decriptor; } } Next, it is also required to pass the request to the above Handle404 method if no matching action method found in the selected controller due to the reason discussed above. This situation can also be easily handled through a custom IHttpActionSelector. Here is the source of custom IHttpActionSelector, public class HttpNotFoundAwareControllerActionSelector : ApiControllerActionSelector { public HttpNotFoundAwareControllerActionSelector() { } public override HttpActionDescriptor SelectAction(HttpControllerContext controllerContext) { HttpActionDescriptor decriptor = null; try { decriptor = base.SelectAction(controllerContext); } catch (HttpResponseException ex) { var code = ex.Response.StatusCode; if (code != HttpStatusCode.NotFound && code != HttpStatusCode.MethodNotAllowed) throw; var routeData = controllerContext.RouteData; routeData.Values["action"] = "Handle404"; IHttpController httpController = new ErrorController(); controllerContext.Controller = httpController; controllerContext.ControllerDescriptor = new HttpControllerDescriptor(controllerContext.Configuration, "Error", httpController.GetType()); decriptor = base.SelectAction(controllerContext); } return decriptor; } } Finally, we need to register the custom IHttpControllerSelector and IHttpActionSelector. Open global.asax.cs file and add these lines, configuration.Services.Replace(typeof(IHttpControllerSelector), new HttpNotFoundAwareDefaultHttpControllerSelector(configuration)); configuration.Services.Replace(typeof(IHttpActionSelector), new HttpNotFoundAwareControllerActionSelector()); Summary: In addition to building an application for HTTP services, it is also important to send meaningful centralized information in response when something goes wrong, for example 'HTTP 404 Not Found' error. In this article, I showed you how to handle 'HTTP 404 Not Found' error in a centralized location. Hopefully you will enjoy this article too.

June 22, 2012

by Imran Baloch

· 51,410 Views

Managing ActiveMQ with JMX APIs

Here is a quick example of how to programmatically access ActiveMQ MBeans to monitor and manipulate message queues... First, get a connection to a JMX server (assumes localhost, port 1099, no auth) Note, always cache the connection for subsequent requests (can cause memory utilization issues otherwise) JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://localhost:1099/jmxrmi"); JMXConnector jmxc = JMXConnectorFactory.connect(url); MBeanServerConnection conn = jmxc.getMBeanServerConnection(); Then, you can execute various operations such as addQueue, removeQueue, etc... String operationName="addQueue"; String parameter="MyNewQueue"; ObjectName activeMQ = new ObjectName("org.apache.activemq:BrokerName=localhost,Type=Broker"); if(parameter != null) { Object[] params = {parameter}; String[] sig = {"java.lang.String"}; conn.invoke(activeMQ, operationName, params, sig); } else { conn.invoke(activeMQ, operationName,null,null); } Also, you can get an ActiveMQ QueueViewMBean instance for a specified queue name... ObjectName activeMQ = new ObjectName("org.apache.activemq:BrokerName=localhost,Type=Broker"); BrokerViewMBean mbean = (BrokerViewMBean) MBeanServerInvocationHandler.newProxyInstance(conn, activeMQ,BrokerViewMBean.class, true); for (ObjectName name : mbean.getQueues()) { QueueViewMBean queueMbean = (QueueViewMBean) MBeanServerInvocationHandler.newProxyInstance(mbsc, name, QueueViewMBean.class, true); if (queueMbean.getName().equals(queueName)) { queueViewBeanCache.put(cacheKey, queueMbean); return queueMbean; } } Then, execute one of several APIs against the QueueViewMBean instance... Queue monitoring - getEnqueueCount(), getDequeueCount(), getConsumerCount(), etc... Queue manipulation - purge(), getMessage(String messageId), removeMessage(String messageId), moveMessageTo(String messageId, String destinationName), copyMessageTo(String messageId, String destinationName), etc... Summary The APIs can easily be used to build a web or command line based tool to support remote ActiveMQ management features. That being said, all of these features are available via the JMX console itself and ActiveMQ does provide a web console to support some management/monitoring tasks. See these pages for more information... http://activemq.apache.org/jmx-support.html http://activemq.apache.org/web-console.html

June 22, 2012

by Ben O'Day

· 32,198 Views · 1 Like

Wrapping Begin/End Asynchronous API into C#5 Tasks

Microsoft offered programmers several different ways of dealing with the asynchronous programming since .NET 1.0. The first model was Asynchronous programming model or APM for short. The pattern is implemented with two methods named BeginOperation and EndOperation. .NET 4 introduced new pattern – Task Asynchronous Pattern and with the introduction of .NET 4.5, Microsoft added language support for language integrated asynchronous coding style. You can check the MSDN for more samples and information. I will assume that you are familiar with it and have written code using it. You can wrap existing APM pattern into TPL pattern using the Task.Factory.FromAsync methods. For example: public static Task> ExecuteAsync(this DataServiceQuery query, object state) { return Task.Factory.FromAsync>(query.BeginExecute, query.EndExecute, state); } It is easy to wrap most of the asynchronous functions this way, but some cannot be since the wrapper functions assume that the last two parameters to the BeginOperation are AsyncCallback and object, and there are some versions of asynchronous operations that have different specifications. Examples: 1. Extra parameters after the object state parameter: IAsyncResult DataServiceContext.BeginExecuteBatch( AsyncCallback callback, object state, params DataServiceRequest[] queries); 2. Missing the expected object state parameter and different return type: ICancelableAsyncResult BeginQuery(AsyncCallback callBack); WorkItemCollection EndQuery(ICancelableAsyncResult car); Short solution for the first example The short and elegant way for wrapping the first example is to provide the following wrapper: public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { if (context == null) throw new ArgumentNullException("context"); return Task.Factory.FromAsync( context.BeginExecuteBatch(null, state, queries), context.EndExecuteBatch); } We simply call the Begin method ourselves and then wrap it using an another overload for FromAsync function. The longer way However, we can fully wrap it ourselves by simulating what the FromAsync wrapper does. The complete code is listed below. public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { // this will be our sentry that will know when our async operation is completed var tcs = new TaskCompletionSource(); try { context.BeginExecuteBatch((iar) => { try { var result = context.EndExecuteBatch(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { // if the inner operation was canceled, this task is cancelled too tcs.TrySetCanceled(); } catch (Exception ex) { // general exception has been set bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }, state, queries); } catch { tcs.TrySetResult(default(DataServiceResponse)); // propagate exceptions to the outside throw; } return tcs.Task; } Besides educational benefits, writing the full wrapper code allows us to add cancellation, logging and diagnostic information. Once we understand how to wrap APM pattern, We can now tackle the second problem easily. Handling the BeginQuery/EndQuery We will first create our own wrapper function in the spirit of the above code with the notable difference that we use the ICancelableAsyncResult interface instead of the IAsyncResult. public static class TaskEx { public static Task FromAsync(Func beginMethod, Func endMethod) { if (beginMethod == null) throw new ArgumentNullException("beginMethod"); if (endMethod == null) throw new ArgumentNullException("endMethod"); var tcs = new TaskCompletionSource(); try { beginMethod((iar) => { try { var result = endMethod(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { tcs.TrySetCanceled(); } catch (Exception ex) { bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }); } catch { tcs.TrySetResult(default(TResult)); throw; } return tcs.Task; } } The code is pretty self-explanatory and we can go ahead with the wrapping. There are four different operations that are exposed both in synchronous and asynchronous version: Query, LinkQuery, CountOnlyQuery and RegularQuery. The extension methods are short since we have already created our generic wrapper above: public static Task RunQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginQuery, query.EndQuery); } public static Task RunLinkQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginLinkQuery, query.EndLinkQuery); } public static Task RunCountOnlyQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginCountOnlyQuery, query.EndCountOnlyQuery); } public static Task RunRegularQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginRegularQuery, query.EndRegularQuery); } That is it for today, you can write your own handy extensions easily for APM functions out there.

June 21, 2012

by Toni Petrina

· 10,650 Views

Top 10 Causes of Java EE Enterprise Performance Problems

Performance problems are one of the biggest challenges to expect when designing and implementing Java EE related technologies.

June 20, 2012

by Pierre - Hugues Charbonneau

· 274,121 Views · 20 Likes

Fast Index Creation with InnoDB

Innodb can indexes built by sort since Innodb Plugin for MySQL 5.1 which is a lot faster than building them through insertion, especially for tables much larger than memory and large uncorrelated indexes you might be looking at 10x difference or more. Yet for some reason Innodb team has chosen to use very small (just 1MB) and hard coded buffer for this operation, which means almost any such index build operation has to use excessive sort merge passes significantly slowing down index built process. Mark Callaghan and Facebook Team has fixed this in their tree back in early 2011 adding innodb_merge_sort_block_size variable and I was thinking this small patch will be merged to MySQL 5.5 promptly, yet it has not happen to date. Here is example of gains you can expect (courtesy of Alexey Kopytov), using 1Mil rows Sysbench table. Buffer Length | alter table sbtest add key(c) 1MB 34 sec 8MB 26 sec 100MB 21 sec 128MB 17 sec REBUILD 37 sec REBUILD in this table is using “fast_index_creation=0″ which allows to disable fast index creation in Percona Server and force complete table to be rebuilt instead. Looking at this data we can see even for such small table there is possible to improve index creation time 2x by using large buffer. Also we can see we can substantially improve performance even increasing it from 1MB to 8MB, which might be sensible as default as even small systems should be able to allocate 8MB to do alter table. You may be wondering why in this case table rebuild is so close in performance to building index by sort with small buffer – this comes from building index on long character field with very short length, Innodb would use fixed size records for sort space which results in a lot more work done than you would otherwise need. Having some optimization to better deal with this case also would be nice. The table also was fitting in buffer pool completely in this case which means table rebuild could have done fast too. Results are from Percona Server 5.5.24

June 19, 2012

by Peter Zaitsev

· 4,522 Views

How to Identify and Resolve Hibernate N+1 SELECT's Problems

Let’s assume that you’re writing code that’d track the price of mobile phones. Now, let’s say you have a collection of objects representing different Mobile phone vendors (MobileVendor), and each vendor has a collection of objects representing the PhoneModels they offer. To put it simple, there’s exists a one-to-many relationship between MobileVendor:PhoneModel. MobileVendor Class Class MobileVendor{ long vendor_id; PhoneModel[] phoneModels; ... } Okay, so you want to print out all the details of phone models. A naive O/R implementation would SELECT all mobile vendors and then do N additional SELECTs for getting the information of PhoneModel for each vendor. -- Get all Mobile Vendors SELECT * FROM MobileVendor; -- For each MobileVendor, get PhoneModel details SELECT * FROM PhoneModel WHERE MobileVendor.vendorId=? As you see, the N+1 problem can happen if the first query populates the primary object and the second query populates all the child objects for each of the unique primary objects returned. Resolve N+1 SELECTs problem (i) HQL fetch join "from MobileVendor mobileVendor join fetch mobileVendor.phoneModel PhoneModels" Corresponding SQL would be (assuming tables as follows: t_mobile_vendor for MobileVendor and t_phone_model for PhoneModel) SELECT * FROM t_mobile_vendor vendor LEFT OUTER JOIN t_phone_model model ON model.vendor_id=vendor.vendor_id (ii) Criteria query Criteria criteria = session.createCriteria(MobileVendor.class); criteria.setFetchMode("phoneModels", FetchMode.EAGER); In both cases, our query returns a list of MobileVendor objects with the phoneModels initialized. Only one query needs to be run to return all the PhoneModel and MobileVendor information required.

June 13, 2012

by Singaram Subramanian

· 201,687 Views · 13 Likes

Every Programmer Should Know These Latency Numbers

This is interesting stuff; Jonas Bonér organized some general some latency data by Peter Norvig as a Gist, and others expanded on it. What's interesting is how, scaling time up by a billion, converts a CPU instruction cycle into approximately one heartbeat, and yields a disk seek time of "a semester in university". ### Latency numbers every programmer should know L1 cache reference ......................... 0.5 ns Branch mispredict ............................ 5 ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Compress 1K bytes with Zippy ............. 3,000 ns = 3 µs Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD* ..... 1,000,000 ns = 1 ms Disk seek ........................... 10,000,000 ns = 10 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->Netherlands->CA .... 150,000,000 ns = 150 ms Assuming ~1GB/sec SSD ![Visual representation of latencies](http://i.imgur.com/k0t1e.png) Visual chart provided by [ayshen](https://gist.github.com/ayshen) Data by [Jeff Dean](http://research.google.com/people/jeff/) Originally by [Peter Norvig](http://norvig.com/21-days.html#answers) Lets multiply all these durations by a billion: Magnitudes: ### Minute: L1 cache reference 0.5 s One heart beat (0.5 s) Branch mispredict 5 s Yawn L2 cache reference 7 s Long yawn Mutex lock/unlock 25 s Making a coffee ### Hour: Main memory reference 100 s Brushing your teeth Compress 1K bytes with Zippy 50 min One episode of a TV show (including ad breaks) ### Day: Send 2K bytes over 1 Gbps network 5.5 hr From lunch to end of work day ### Week SSD random read 1.7 days A normal weekend Read 1 MB sequentially from memory 2.9 days A long weekend Round trip within same datacenter 5.8 days A medium vacation Read 1 MB sequentially from SSD 11.6 days Waiting for almost 2 weeks for a delivery ### Year Disk seek 16.5 weeks A semester in university Read 1 MB sequentially from disk 7.8 months Almost producing a new human being The above 2 together 1 year ### Decade Send packet CA->Netherlands->CA 4.8 years Average time it takes to complete a bachelor's degree

June 12, 2012

by Howard Lewis Ship

· 137,503 Views

How to Get the JPQL/SQL String From a CriteriaQuery in JPA ?

I.T. is full of complex things that should (and sometimes could) be simple. Getting the JQPL/SQL String representation for a JPA 2.0 CriteriaQuery is one of them. By now you all know the JPA 2.0 Criteria API : a type safe way to write a JQPL query. This API is clever in the way that you don’t use Strings to build your query, but is quite verbose… and sometimes you get lost in dozens of lines of Java code, just to write a simple query. You get lost in your CriteriaQuery, you don’t know why your query doesn’t work, and you would love to debug it. But how do you debug it ? Well, one way would be by just displaying the JPQL and/or SQL representation. Simple, isn’t it ? Yes, but JPA 2.0 javax.persistence.Query doesn’t have an API to do this. You then need to rely on the implementation… meaning, the code is different if you use EclipseLink, Hibernate or OpenJPA. The CriteriaQuery we want to debug Let’s say you have a simple Book entity and you want to retrieve all the books sorted by their id. Something like SELECT b FROM Book b ORDER BY b.id DESC. How would you write this with the CriteriaQuery ? Well, something like these 5 lines of Java code : CriteriaBuilder cb = em.getCriteriaBuilder(); CriteriaQuery q = cb.createQuery(Book.class); Root b = q.from(Book.class); q.select(b).orderBy(cb.desc(b.get("id"))); TypedQuery findAllBooks = em.createQuery(q); So imagine when you have more complex ones. Sometimes, you just get lost, it gets buggy and you would appreciate to have the JPQL and/or SQL String representation to find out what’s happening. You could then even unit test it. Getting the JPQL/SQL String Representations for a Criteria Query So let’s use an API to get the JPQL/SQL String representations of a CriteriaQuery (to be more precise, the TypedQuery created from a CriteriaQuery). The bad news is that there is no standard JPA 2.0 API to do this. You need to use the implementation API hoping the implementation allows it (thank god that’s (nearly) the case for the 3 main JPA ORM frameworks). The good news is that the Query interface (and therefore TypedQuery) has an unwrap method. This method returns the provider’s query API implementation. Let’s see how you can use it with EclipseLink, Hibernate and OpenJPA. EclipseLink EclipseLink‘s Query representation is the org.eclipse.persistence.jpa.JpaQuery interface and the org.eclipse.persistence.internal.jpa.EJBQueryImpl implementation. This interface gives you the wrapped native query (org.eclipse.persistence.queries.DatabaseQuery) with two very handy methods : getJPQLString() and getSQLString(). Unfortunatelly the getJPQLString() method will not translate a CriteriaQuery into JPQL, it only works for queries originally written in JPQL (dynamic or named query). The getSQLString() method relies on the query being “prepared”, meaning you have to run the query once before getting the SQL String representation. findAllBooks.unwrap(JpaQuery.class).getDatabaseQuery().getJPQLString(); // doesn't work for CriteriaQuery findAllBooks.unwrap(JpaQuery.class).getDatabaseQuery().getSQLString(); Hibernate Hibernate‘s Query representation is org.hibernate.Query. This interface has several implementations and the very useful method that returns the SQL query string : getQueryString(). I couldn’t find a method that returns the JPQL representation, if I’ve missed something, please let me know. findAllBooks.unwrap(org.hibernate.Query.class).getQueryString() OpenJPA OpenJPA‘s Query representation is org.apache.openjpa.persistence.QueryImpl and also has a getQueryString() method that returns the SQL (not the JPQL). It delegates the call to the internal org.apache.openjpa.kernel.Query interface. I couldn’t find a method that returns the JPQL representation, if I’ve missed something, please let me know. findAllBooks.unwrap(org.apache.openjpa.persistence.QueryImpl.class).getQueryString() Unit testing Once you get your SQL String, why not unit test it ? Hey, but I don’t want to test my ORM, why would I do that ? Well, it happens that I’ve discovered a but in the new releases of OpenJPA by unit testing a query… so, there is a use case for that. Anyway, this is how you could do it : assertEquals("SELECT b FROM Book b ORDER BY b.id DESC", findAllBooksCriteriaQuery.unwrap(org.apache.openjpa.persistence.QueryImpl.class).getQueryString()); Conclusion As you can see, it’s not that simple to get a String representation for a TypedQuery. Here is a digest of the three main ORMs : ORM Framework Query implementation How to get the JPQL String How to get the SPQL String EclipseLink JpaQuery getDatabaseQuery().getJPQLString()* getDatabaseQuery().getSQLString()** Hibernate Query N/A getQueryString() OpenJPA QueryImpl getQueryString() N/A (*) Only possible on a dynamic or named query. Not possible on a CriteriaQuery (**) You need to execute the query first, if not, the value is null To illustrate all that I’ve written simple test cases using EclipseLink, Hibernate and OpenJPA that you can download from GitHub. Give it a try and let me know. And what about having an API in JPA 2.1 ? For a developers’ point of view it would be great to have two methods in the javax.persistence.Query (and therefore javax.persistence.TypedQuery) interface that would be able to easily return the JPQL and SQL String representations, e.g : Query.getJPQLString() and Query.getSQLString(). Hey, that would be the perfect time to have it in JPA 2.1 that will be shipped in less than a year. Now, as an implementer, this might be tricky to do, I would love to ear your point of view on this. Anyway, I’m going to post an email to the JPA 2.1 Expert Group… just in case we can have this in the next version of JPA ;o) References http://efreedom.com/Question/1-6412774/Get-SQL-String-JPQLQuery http://old.nabble.com/Cannot-get-the-JPQL—SQL-String-of-a-CriteriaQuery-td33882629.html http://paddyweblog.blogspot.fr/2010/04/some-examples-of-criteria-api-jpa-20.html http://www.altuure.com/2010/09/23/jpa-criteria-api-by-samples-part-i/ http://www.altuure.com/2010/09/23/jpa-criteria-api-by-samples-%E2%80%93-part-ii/ http://www.jumpingbean.co.za/blogs/jpa2-criteria-api http://wiki.eclipse.org/EclipseLink/FAQ/JPA#How_to_get_the_SQL_for_a_Query.3F

June 5, 2012

by Antonio Goncalves

· 60,970 Views · 1 Like

Database unit testing with DBUnit, Spring and TestNG

I really like Spring, so I tend to use its features to the fullest. However, in some dark corners of its philosophy, I tend to disagree with some of its assumptions. One such assumption is the way database testing should work. In this article, I will explain how to configure your projects to make Spring Test and DBUnit play nice together in a multi-developers environment. Context My basic need is to be able to test some complex queries: before integration tests, I've to validate those queries get me the right results. These are not unit tests per se but let's assilimate them as such. In order to achieve this, I use since a while a framework named DBUnit. Although not maintained since late 2010, I haven't found yet a replacement (be my guest for proposals). I also have some constraints: I want to use TestNG for all my test classes, so that new developers wouldn't think about which test framework to use I want to be able to use Spring Test, so that I can inject my test dependencies directly into the test class I want to be able to see for myself the database state at the end of any of my test, so that if something goes wrong, I can execute my own queries to discover why I want every developer to have its own isolated database instance/schema Considering the last point, our organization let us benefit from a single Oracle schema per developer for those "unit-tests". Basic set up Spring provides the AbstractTestNGSpringContextTests class out-of-the-box. In turn, this means we can apply TestNG annotations as well as @Autowired on children classes. It also means we have access to the underlying applicationContext, but I prefer not to (and don't need to in any case). The structure of such a test would look like this: @ContextConfiguration(location = "classpath:persistence-beans.xml") public class MyDaoTest extends AbstractTestNGSpringContextTests { @Autowired private MyDao myDao; @Test public void whenXYZThenTUV() { ... } } Readers familiar with Spring and TestNG shouldn't be surprised here. Bringing in DBunit DbUnit is a JUnit extension targeted at database-driven projects that, among other things, puts your database into a known state between test runs. [...] DbUnit has the ability to export and import your database data to and from XML datasets. Since version 2.0, DbUnit can also work with very large datasets when used in streaming mode. DbUnit can also help you to verify that your database data match an expected set of values. DBunit being a JUnit extension, it's expected to extend the provided parent class org.dbunit.DBTestCase. In my context, I have to redefine some setup and teardown operation to use Spring inheritance hierarchy. Luckily, DBUnit developers thought about that and offer relevant documentation. Among the different strategies available, my tastes tend toward the CLEAN_INSERT and NONE operations respectively on setup and teardown. This way, I can check the database state directly if my test fails. This updates my test class like so: @ContextConfiguration(locations = {"classpath:persistence-beans.xml", "classpath:test-beans.xml"}) public class MyDaoTest extends AbstractTestNGSpringContextTests { @Autowired private MyDao myDao; @Autowired private IDatabaseTester databaseTester; @BeforeMethod protected void setUp() throws Exception { // Get the XML and set it on the databaseTester // Optional: get the DTD and set it on the databaseTester databaseTester.setSetUpOperation(DatabaseOperation.CLEAN_INSERT); databaseTester.setTearDownOperation(DatabaseOperation.NONE); databaseTester.onSetup(); } @Test public void whenXYZThenTUV() { ... } } Per-user configuration with Spring Of course, we need to have a specific Spring configuration file to inject the databaseTester. As an example, here is one: However, there's more than meets the eye. Notice the databaseTester has to be fed a datasource. Since a requirement is to have a database per developer, there are basically two options: either use a in-memory database or use the same database as in production and provide one such database schema per developer. I tend toward the latter solution (when possible) since it tends to decrease differences between the testing environment and the production environment. Thus, in order for each developer to use its own schema, I use Spring's ability to replace Java system properties at runtime: each developer is characterized by a different user.name. Then, I configure a PlaceholderConfigurer that looks for {user.name}.database.properties file, that will look like so: db.username=myusername1 db.password=mypassword1 db.schema=myschema1 This let me achieve my goal of each developer using its own instance of Oracle. If you want to use this strategy, do not forget to provide a specific database.properties for the Continuous Integration server. Huh oh? Finally, the whole testing chain is configured up to the database tier. Yet, when the previous test is run, everything is fine (or not), but when checking the database, it looks untouched. Strangely enough, if you did load some XML dataset and assert it during the test, it does behaves accordingly: this bears all symptoms of a transaction issue. In fact, when you closely look at Spring's documentation, everything becomes clear. Spring's vision is that the database should be left untouched by running tests, in complete contradiction to DBUnit's. It's achieved by simply rollbacking all changes at the end of the test by default. In order to change this behavior, the only thing to do is annotate the test class with @TransactionConfiguration(defaultRollback=false). Note this doesn't prevent us from specifying specific methods that shouldn't affect the database state on a case-by-case basis with the @Rollback annotation. The test class becomes: @ContextConfiguration(locations = {classpath:persistence-beans.xml", "classpath:test-beans.xml"}) @TransactionConfiguration(defaultRollback=false) public class MyDaoTest extends AbstractTestNGSpringContextTests { @Autowired private MyDao myDao; @Autowired private IDatabaseTester databaseTester; @BeforeMethod protected void setUp() throws Exception { // Get the XML and set it on the databaseTester // Optional: get the DTD and set it on the databaseTester databaseTester.setSetUpOperation(DatabaseOperation.CLEAN_INSERT); databaseTester.setTearDownOperation(DatabaseOperation.NONE); databaseTester.onSetup(); } @Test public void whenXYZThenTUV() { ... } } Conclusion Though Spring and DBUnit views on database testing are opposed, Spring's configuration versatility let us make it fit our needs (and benefits from DI). Of course, other improvements are possible: pushing up common code in a parent test class, etc. To go further: Spring Test documentation DBUnit site Database data verification Database testing best practices Generating DTD from your database schema

June 4, 2012

by Nicolas Fränkel

· 59,697 Views

Spring Integration Gateways - Null Handling & Timeouts

Spring Integration (SI) Gateways Spring Integration Gateways () provide a semantically rich interface to message sub-systems. Gateways are specified using namespace constructs, these reference a specific Java interface () that is backed by an object dynamically implemented at run-time by the Spring Integration framework. Furthermore, these Java interfaces can, if you so wish, be defined entirely independent of any Spring artefacts - that's both code and configuration. One of the primary advantages of using the SI gateway as an interface to message sub-systems is that it's possible to automatically adopt the benefit of rich, default and customisable, gateway configuration. One such configuration attribute deserves further scrutiny and discussion primarily because it's easy to misunderstand and misconfigure around - default-reply-timeout. Primary Motivator for Gateway Analysis During recent consulting engagements, I've encountered a number of deployments that use Spring Integration Gateway specifications that may, in some circumstances, lead to production operational instability. This has often been in high-pressure environments or those where technology support is not backed by adequate training, testing, review or technology mentoring. How do gateways behave in Spring Integration (R2.0.5) One of the key sections, regarding gateways, in the Spring Integration manual clearly explains gateway semantics. Below is a 2-dimensional table of possible non-standard gateway returns for each of the scenarios that the SI Manual (r2.0.5) refers to. Gateway Non-standard Responses Runtime Events default-reply-timeout=x Single-threaded default-reply-timeout=x Multi-threaded default-reply-timeout=null Single-threaded default-reply-timeout=null Multi-threaded 1. Long Running Process Thread Parked null returned Thread Parked Thread Parked 2. Null Returned Downstream null returned null returned Thread Parked Thread Parked 3. void method Downstream null returned null returned Thread Parked Thread Parked 4. Runtime Exception Error handler invoked or exception thrown. Error handler invoked or exception thrown. Error handler invoked or exception thrown. Error handler invoked or exception thrown. The key parts of this table are the conditions that lead to invoking threads being parked (noted in red), nulls returned (noted in orange) and exceptions (noted in green). Each contributor consists of configuration that is under the developers control, deployed code that is under developers control and conditions that are usually not under developers control. Clearly, the column headings in the table above are divided into two sections; two gateway configuration attributes. The default-reply-timeout is set by the SI configured and is the amount of time that a client call is wiling to wait for a response from the gateway. Secondly, synchronous flows are represented by Single-threaded flows, asynchronous by Multi-threaded flows. A synchronous, or single-threaded flow, is one such as the following: The implicit input channel (gateway-request-channel) has no associated dispatcher configured. An asynchronous, or multi-threaded flow, is one such as the following: The explicit input channel has a dispatcher configured ("taskExecutor"). This task executor specifies a thread pool that supplies threads for execution and whose configuration as above marks a thread boundary. Note: This is not the only way of making channels asynchronous The other configuration attribute referenced is default-reply-timeout, this is set on the gateway namespace configuration such as the example above. Note that both of these runtime aspects are set by the configurer during SI flow design and implementation. They are entirely under developer control. The 'Runtime Events' column indicates gateway relevant runtime events that have to be considered during gateway configuration - these are obviously not under developer control. Trigger conditions for these events are not as unusual as one may hope. 1. Long Running Processes It's not uncommon for thread pools to become exhausted because all pooled threads are waiting for an external resource accessed through a socket, this may be a long running database query, a firewall keeping a connection open despite the server terminating etc. There is significant potential for these types of trigger. Some long-running processes terminate naturally, sometimes they never completed - an application restart is required. 2. Null returned downstream A null may be returned from a downstream SI construct such as a Transformer, Service Activator or Gateway. A Gateway may return null in some circumstances such as following a gateway timeout event. 3. Void method downstream Any custom code invoked during an SI flow may use a void method signature. This can also be caused by configuration in circumstances where flows are determined dynamically at runtime. 4. Runtime Exception RuntimeException's can be triggered during normal operation and are generally handled by catching them at the gateway or allowing them to propagate through. The reason that they are coloured green in the table above is that they are generally much easier to handle than timeouts. Gateway Timeout Handling Strategies There are four possible outcomes from invoking a gateway with a request message, all of these as a result of specific runtime events: a) an ordinary message response, b) an exception message, c) a null or d) no-response. Ordinary business responses and exceptions are straight forward to understand and will not be covered further in this article. The two significant outcomes that will be explored further are strategies for dealing with nulls and no-response. Generally speaking, long running processes either terminate or not. Long running processes that terminate may eventually return a message through the invoked gateway or timeout depending on timeout configuration, in which case a null may be returned. The severity of this as a problem depends on throughput volume, length of long running process and system resources (thread-pool size). Configuration exists for default-reply-timeout In the case where a long running process event is underway and a default-reply-timeout has been set, as long as the long running process completes before the default-reply-timeout expires, there is no problem to deal with. However, if the long running process does not complete before that timeout expires one of three outcomes will apply. Firstly, if the long running process terminates subsequent to the reply timeout expiry, the gateway will have already returned null to the invoker so the null response needs handling by the invoker. The thread handling the long-running process will be returned to the pool. Secondly, if the long running process does not terminate and a reply timeout has been set, the gateway will return null to the gateway invoker but the thread executing the long-running process will not get returned to the pool. Thirdly, and most significantly, if a default-reply-timeout has been configured but the long running process is running on the same thread as the invoker, i.e. synchronous channels supply messages to that process, the thread will not return, the default-reply-timeout has no affect. Assuming the most common processing scenario, a long running process completes either before or after the reply timeout expiry. When a null is returned by the gateway, the invoker is forced to deal with a null response. It's often unacceptable to force gateway consumers to deal with null responses and is not necessary as with a little additional configuration, this can be avoided. Absent Configuration for default-reply-timeout The most significant danger exists around gateways that have no default-reply-timeout configuration set. A long running process or a null returned from downstream will mean that the invoking thread is parked. This is true for both synchronous and asynchronous flows and may ultimately force an application to be restarted because the invoker thread pool is likely to start on a depletion course if this continues to occur. Spring Integration Timeout Handling Design Strategies For those Spring Integration configuration designers that are comfortable with gateway invokers dealing with null responses, exceptions and set default-reply-timeouts on gateways, there's no need to read further. However, if you wish to provide clients of your gateway a more predictable response, a couple of strategies exist for handling null responses from gateways in order that invokers are protected from having to deal with them. Firstly, the simpliest solution is to wrap the gateway with a service activator. The gateway must have the default-reply-timeout attribute value set in order to avoid unnecessary parking of threads. In order to avoid the consequence of long-running threads it's also very prudent to use a dispatcher soon after entry to the gateway - this breaks the thread boundary. Whilst this is a valid technical approach, the impact is that we have forced a different entry point to our message sub-system. Entry is now via a Service Activator rather than a Gateway. A side affect of this change is that the testing entry point changes. Integration tests that would normally reference a gateway to send a message now have to locate the backing implementation for the Service Activator, not ideal. An alternative approach toward solving this problem would be to configure two gateways with a Service Activator between them. Only one of the gateways would be exposed to invokers, the outer one. Both Gateways would reference the same service interface. The outer gateway specification would not specify the default-reply-timeout but would specify the input and output channels in the same way that a single gateway would. The Service Activator between the Gateways would handle null gateway responses and possibly any exceptions if preferred to the gateway error handler approach. An example is as follows: The Service Activator bean (enrollmentServiceGatewayHandler) deals with both null and exception responses from the adapted gateway (enrollmentServiceAdaptedGateway), in the situation where these are generated a business response detailing the error is generated. Spring Integration R2.1 Changes async-executor on gateway spec

May 26, 2012

by Matt Vickery

· 24,405 Views · 1 Like

The Limited Usefulness of AsyncContext.start()

Some time ago I came across What's the purpose of AsyncContext.start(...) in Servlet 3.0? question. Quoting the Javadoc of aforementioned method: Causes the container to dispatch a thread, possibly from a managed thread pool, to run the specified Runnable. To remind all of you, AsyncContext is a standard way defined in Servlet 3.0 specification to handle HTTP requests asynchronously. Basically HTTP request is no longer tied to an HTTP thread, allowing us to handle it later, possibly using fewer threads. It turned out that the specification provides an API to handle asynchronous threads in a different thread pool out of the box. First we will see how this feature is completely broken and useless in Tomcat and Jetty - and then we will discuss why the usefulness of it is questionable in general. Our test servlet will simply sleep for given amount of time. This is a scalability killer in normal circumstances because even though sleeping servlet is not consuming CPU, but sleeping HTTP thread tied to that particular request consumes memory - and no other incoming request can use that thread. In our test setup I limited the number of HTTP worker threads to 10 which means only 10 concurrent requests are completely blocking the application (it is unresponsive from the outside) even though the application itself is almost completely idle. So clearly sleeping is an enemy of scalability. @WebServlet(urlPatterns = Array("/*")) class SlowServlet extends HttpServlet with Logging { protected override def doGet(req: HttpServletRequest, resp: HttpServletResponse) { logger.info("Request received") val sleepParam = Option(req.getParameter("sleep")) map {_.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") } } Benchmarking this code reveals that the average response times are close to sleep parameter as long as the number of concurrent connections is below the number of HTTP threads. Unsurprisingly the response times begin to grow the moment we exceed the HTTP threads count. Eleventh connection has to wait for any other request to finish and release worker thread. When the concurrency level exceeds 100, Tomcat begins to drop connections - too many clients are already queued. So what about the the fancy AsyncContext.start() method (do not confuse with ServletRequest.startAsync())? According to the JavaDoc I can submit any Runnable and the container will use some managed thread pool to handle it. This will help partially as I no longer block HTTP worker threads (but still another thread somewhere in the servlet container is used). Quickly switching to asynchronous servlet: @WebServlet(urlPatterns = Array("/*"), asyncSupported = true) class SlowServlet extends HttpServlet with Logging { protected override def doGet(req: HttpServletRequest, resp: HttpServletResponse) { logger.info("Request received") val asyncContext = req.startAsync() asyncContext.setTimeout(TimeUnit.MINUTES.toMillis(10)) asyncContext.start(new Runnable() { def run() { logger.info("Handling request") val sleepParam = Option(req.getParameter("sleep")) map {_.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") asyncContext.complete() } }) } } We are first enabling the asynchronous processing and then simply moving sleep() into a Runnable and hopefully a different thread pool, releasing the HTTP thread pool. Quick stress test reveals slightly unexpected results (here: response times vs. number of concurrent connections): Guess what, the response times are exactly the same as with no asynchronous support at all (!) After closer examination I discovered that when AsyncContext.start() is called Tomcat submits given task back to... HTTP worker thread pool, the same one that is used for all HTTP requests! This basically means that we have released one HTTP thread just to utilize another one milliseconds later (maybe even the same one). There is absolutely no benefit of calling AsyncContext.start() in Tomcat. I have no idea whether this is a bug or a feature. On one hand this is clearly not what the API designers intended. The servlet container was suppose to manage separate, independent thread pool so that HTTP worker thread pool is still usable. I mean, the whole point of asynchronous processing is to escape the HTTP pool. Tomcat pretends to delegate our work to another thread, while it still uses the original worker thread pool. So why I consider this to be a feature? Because Jetty is "broken" in exactly same way... No matter whether this works as designed or is only a poor API implementation, using AsyncContext.start() in Tomcat and Jetty is pointless and only unnecessarily complicates the code. It won't give you anything, the application works exactly the same under high load as if there was no asynchronous logic at all. But what about using this API feature on correct implementations like IBM WAS? It is better, but still the API as is doesn't give us much in terms of scalability. To explain again: the whole point of asynchronous processing is the ability to decouple HTTP request from an underlying thread, preferably by handling several connections using the same thread. AsyncContext.start() will run the provided Runnable in a separate thread pool. Your application is still responsive and can handle ordinary requests while long-running request that you decided to handle asynchronously are processed in a separate thread pool. It is better, unfortunately the thread pool and thread per connection idiom is still a bottle-neck. For the JVM it doesn't matter what type of threads are started - they still occupy memory. So we are no longer blocking HTTP worker threads, but our application is not more scalable in terms of concurrent long-running tasks we can support. In this simple and unrealistic example with sleeping servlet we can actually support thousand of concurrent (waiting) connections using Servlet 3.0 asynchronous support with only one extra thread - and without AsyncContext.start(). Do you know how? Hint: ScheduledExecutorService. Postscriptum: Scala goodness I almost forgot. Even though examples were written in Scala, I haven't used any cool language features yet. Here is one: implicit conversions. Make this available in your scope: implicit def blockToRunnable[T](block: => T) = new Runnable { def run() { block } } And suddenly you can use code block instead of instantiating Runnable manually and explicitly: asyncContext start { logger.info("Handling request") val sleepParam = Option(req.getParameter("sleep")) map { _.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") asyncContext.complete() } Sweet!

May 22, 2012

by Tomasz Nurkiewicz

· 17,547 Views · 1 Like

Lucene Setup on OracleDB in 5 Minutes

This tutorial is for people who want to run an Apache Lucene example with OracleDB in just five minutes.

May 19, 2012

by Mohammad Juma

· 31,358 Views · 4 Likes

Virtualization in WPF with VirtualizingStackPanel

First blogged about this on my previous blog site here: http://consultingblogs.emc.com/merrickchaffer/archive/2011/02/14/virtualization-in-wpf-with-virtualizingstackpanel.aspx However, having come across this again today on a project, I thought it was important enough to re-blog! Finally managed to figure out how to get virtualization to actually behave itself in a listbox wpf control. Turns out that in order for Virtualization to work, you need three things satisfied. Use a control that supports virtualization (e.g. list box or list view). (see Controls That Implement Performance Features section at bottom of this page for more info http://msdn.microsoft.com/en-us/library/cc716879.aspx#Controls ) Ensure that the ScrollViewer.CanContentScroll attached property is set to True on the containing list box / list view control. Ensure that either the list box has a height set, or that it is contained within a parent Grid row, where that row definition has a height set (Height="*" will do if you want it to occupy the Client window height). Note: Do not use height=”Auto” as this will not work, as this instructs WPF to simply size the row to the height needed to fit all the items of the list box in, hence you do not get the vertical scroll bar appearing. Ensure that there is no wrapping ScrollViewer control around the list box, as this will prevent virtualization from occuring. Ensure that you use a VirtualizingStackPanel in the ItemsPanelTemplate for the ListBox.ItemsPanel Example

May 14, 2012

by Merrick Chaffer

· 28,302 Views

EasyNetQ, a simple .NET API for RabbitMQ

After pondering the results of our message queue shootout, we decided to run with Rabbit MQ. Rabbit ticks all of the boxes, it’s supported (by Spring Source and then VMware ultimately), scales and has the features and performance we need. The RabbitMQ.Client provided by Spring Source is a thin wrapper that quite faithfully exposes the AMQP protocol, so it expects messages as byte arrays. For the shootout tests spraying byte arrays around was fine, but in the real world, we want our messages to be .NET types. I also wanted to provide developers with a very simple API that abstracted away the Exchange/Binding/Queue model of AMQP and instead provides a simple publish/subscribe and request/response model. My inspiration was the excellent work done by Dru Sellers and Chris Patterson with MassTransit (the new V2.0 beta is just out). The code is on GitHub here: https://github.com/mikehadlow/EasyNetQ The API centres around an IBus interface that looks like this: /// /// Provides a simple Publish/Subscribe and Request/Response API for a message bus. /// public interface IBus : IDisposable { /// /// Publishes a message. /// /// The message type /// The message to publish void Publish(T message); /// /// Subscribes to a stream of messages that match a .NET type. /// /// The type to subscribe to /// /// A unique identifier for the subscription. Two subscriptions with the same subscriptionId /// and type will get messages delivered in turn. This is useful if you want multiple subscribers /// to load balance a subscription in a round-robin fashion. /// /// /// The action to run when a message arrives. /// void Subscribe(string subscriptionId, Action onMessage); /// /// Makes an RPC style asynchronous request. /// /// The request type. /// The response type. /// The request message. /// The action to run when the response is received. void Request(TRequest request, Action onResponse); /// /// Responds to an RPC request. /// /// The request type. /// The response type. /// /// A function to run when the request is received. It should return the response. /// void Respond(Func responder); } To create a bus, just use a RabbitHutch, sorry I couldn’t resist it :) var bus = RabbitHutch.CreateRabbitBus("localhost"); You can just pass in the name of the server to use the default Rabbit virtual host ‘/’, or you can specify a named virtual host like this: var bus = RabbitHutch.CreateRabbitBus("localhost/myVirtualHost"); The first messaging pattern I wanted to support was publish/subscribe. Once you’ve got a bus instance, you can publish a message like this: var message = new MyMessage {Text = "Hello!"}; bus.Publish(message); This publishes the message to an exchange named by the message type. You subscribe to a message like this: bus.Subscribe("test", message => Console.WriteLine(message.Text)); This creates a queue named ‘test_’ and binds it to the message type’s exchange. When a message is received it is passed to the Action delegate. If there are more than one subscribers to the same message type named ‘test’, Rabbit will hand out the messages in a round-robin fashion, so you get simple load balancing out of the box. Subscribers to the same message type, but with different names will each get a copy of the message, as you’d expect. The second messaging pattern is an asynchronous RPC. You can call a remote service like this: var request = new TestRequestMessage {Text = "Hello from the client! "}; bus.Request(request, response => Console.WriteLine("Got response: '{0}'", response.Text)); This first creates a new temporary queue for the TestResponseMessage. It then publishes the TestRequestMessage with a return address to the temporary queue. When the TestResponseMessage is received, it passes it to the Action delegate. RabbitMQ happily creates temporary queues and provides a return address header, so this was very easy to implement. To write an RPC server. Simple use the Respond method like this: bus.Respond(request => new TestResponseMessage { Text = request.Text + " all done!" }); This creates a subscription for the TestRequestMessage. When a message is received, the Func delegate is passed the request and returns the response. The response message is then published to the temporary client queue. Once again, scaling RPC servers is simply a question of running up new instances. Rabbit will automatically distribute messages to them. The features of AMQP (and Rabbit) make creating this kind of API a breeze. Check it out and let me know what you think.

May 13, 2012

by Mike Hadlow

· 11,288 Views

Martin Fowler on ORM Hate

while i was at the qcon conference in london a couple of months ago, it seemed that every talk included some snarky remarks about object/relational mapping (orm) tools. i guess i should read the conference emails sent to speakers more carefully, doubtless there was something in there telling us all to heap scorn upon orms at least once every 45 minutes. but as you can tell, i want to push back a bit against this orm hate - because i think a lot of it is unwarranted. the charges against them can be summarized in that they are complex, and provide only a leaky abstraction over a relational data store. their complexity implies a grueling learning curve and often systems using an orm perform badly - often due to naive interactions with the underlying database. there is a lot of truth to these charges, but such charges miss a vital piece of context. the object/relational mapping problem is hard . essentially what you are doing is synchronizing between two quite different representations of data, one in the relational database, and the other in-memory. although this is usually referred to as object-relational mapping, there is really nothing to do with objects here. by rights it should be referred to as in-memory/relational mapping problem, because it's true of mapping rdbmss to any in-memory data structure. in-memory data structures offer much more flexibility than relational models, so to program effectively most people want to use the more varied in-memory structures and thus are faced with mapping that back to relations for the database. the mapping is further complicated because you can make changes on either side that have to be mapped to the other. more complication arrives since you can have multiple people accessing and modifying the database simultaneously. the orm has to handle this concurrency because you can't just rely on transactions- in most cases, you can't hold transactions open while you fiddle with the data in-memory. i think that if you if you're going to dump on something in the way many people do about orms, you have to state the alternative. what do you do instead of an orm? the cheap shots i usually hear ignore this, because this is where it gets messy. basically it boils down to two strategies, solve the problem differently (and better), or avoid the problem. both of these have significant flaws. a better solution listening to some critics, you'd think that the best thing for a modern software developer to do is roll their own orm. the implication is that tools like hibernate and active record have just become bloatware, so you should come up with your own lightweight alternative. now i've spent many an hour griping at bloatware, but orms really don't fit the bill - and i say this with bitter memory. for much of the 90's i saw project after project deal with the object/relational mapping problem by writing their own framework - it was always much tougher than people imagined. usually you'd get enough early success to commit deeply to the framework and only after a while did you realize you were in a quagmire - this is where i sympathize greatly with ted neward's famous quote that object-relational mapping is the vietnam of computer science [1] . the widely available open source orms (such as ibatis, hibernate, and active record) did a great deal to remove this problem [2] . certainly they are not trivial tools to use, as i said the underlying problem is hard, but you don't have to deal with the full experience of writing that stuff (the horror, the horror). however much you may hate using an orm, take my word for it - you're better off. i've often felt that much of the frustration with orms is about inflated expectations. many people treat the relational database "like a crazy aunt who's shut up in an attic and whom nobody wants to talk about" [3] . in this world-view they just want to deal with in-memory data-structures and let the orm deal with the database. this way of thinking can work for small applications and loads, but it soon falls apart once the going gets tough. essentially the orm can handle about 80-90% of the mapping problems, but that last chunk always needs careful work by somebody who really understands how a relational database works. this is where the criticism comes that orm is a leaky abstraction. this is true, but isn't necessarily a reason to avoid them. mapping to a relational database involves lots of repetitive, boiler-plate code. a framework that allows me to avoid 80% of that is worthwhile even if it is only 80%. the problem is in me for pretending it's 100% when it isn't. david heinemeier hansson, of active record fame, has always argued that if you are writing an application backed by a relational database you should damn well know how a relational database works. active record is designed with that in mind, it takes care of boring stuff, but provides manholes so you can get down with the sql when you have to. that's a far better approach to thinking about the role an orm should play. there's a consequence to this more limited expectation of what an orm should do. i often hear people complain that they are forced to compromise their object model to make it more relational in order to please the orm. actually i think this is an inevitable consequence of using a relational database - you either have to make your in-memory model more relational, or you complicate your mapping code. i think it's perfectly reasonable to have a more relational domain model in order to simplify your object-relational mapping. that doesn't mean you should always follow the relational model exactly, but it does mean that you take into account the mapping complexity as part of your domain model design. so am i saying that you should always use an existing orm rather than doing something yourself? well i've learned to always avoid saying "always". one exception that comes to mind is when you're only reading from the database. orms are complex because they have to handle a bi-directional mapping. a uni-directional problem is much easier to work with, particularly if your needs aren't too complex and you are comfortable with sql. this is one of the arguments for cqrs . so most of the time the mapping is a complicated problem, and you're better off using an admittedly complicated tool than starting a land war in asia. but then there is the second alternative i mentioned earlier - can you avoid the problem? avoiding the problem to avoid the mapping problem you have two alternatives. either you use the relational model in memory, or you don't use it in the database. to use a relational model in memory basically means programming in terms of relations, right the way through your application. in many ways this is what the 90's crud tools gave you. they work very well for applications where you're just pushing data to the screen and back, or for applications where your logic is well expressed in terms of sql queries. some problems are well suited for this approach, so if you can do this, you should. but its flaw is that often you can't. when it comes to not using relational databases on the disk, there rises a whole bunch of new champions and old memories. in the 90's many of us (yes including me) thought that object databases would solve the problem by eliminating relations on the disk. we all know how that worked out. but there is now the new crew of nosql databases - will these allow us to finesse the orm quagmire and allow us to shock-and-awe our data storage? as you might have gathered , i think nosql is technology to be taken very seriously. if you have an application problem that maps well to a nosql data model - such as aggregates or graphs - then you can avoid the nastiness of mapping completely. indeed this is often a reason i've heard teams go with a nosql solution. this is, i think, a viable route to go - hence my interest in increasing our understanding of nosql systems. but even so it only works when the fit between the application model and the nosql data model is good. not all problems are technically suitable for a nosql database. and of course there are many situations where you're stuck with a relational model anyway. maybe it's a corporate standard that you can't jump over, maybe you can't persuade your colleagues to accept the risks of an immature technology. in this case you can't avoid the mapping problem. so orms help us deal with a very real problem for most enterprise applications. it's true they are often misused, and sometimes the underlying problem can be avoided. they aren't pretty tools, but then the problem they tackle isn't exactly cuddly either. i think they deserve a little more respect and a lot more understanding. 1: i have to confess a deep sense of conflict with the vietnam analogy. at one level it seems like a case of the pathetic overblowing of software development's problems to compare a tricky technology to war. nasty the programming may be, but you're still in a relatively comfy chair, usually with air conditioning, and bug-hunting doesn't involve bullets coming at you. but on another level, the phrase certainly resonates with the feeling of being sucked into a quagmire. 2: there were also commercial orms, such as toplink and kodo. but the approachability of open source tools meant they became dominant. 3: i like this phrase so much i feel compelled to subject it to re-use.

May 9, 2012

by Martin Fowler

· 115,501 Views · 4 Likes