Data Resources

The Latest Data Topics

How to Compress and Uncompress a Java Byte Array Using Deflater/Inflater

Here is a small code snippet which shows an utility class that offers two methods to compress and extract a Java byte array.

February 6, 2013

by Ralf Quebbemann

· 135,217 Views · 2 Likes

Repository Pattern, Done Right

the repository pattern has been discussed a lot lately. especially about it’s usefulness since the introduction of or/m libraries. this post (which is the third in a series about the data layer) aims to explain why it’s still a great choice. let’s start with the definition : a repository mediates between the domain and data mapping layers, acting like an in-memory domain object collection. client objects construct query specifications declaratively and submit them to repository for satisfaction. objects can be added to and removed from the repository, as they can from a simple collection of objects, and the mapping code encapsulated by the repository will carry out the appropriate operations behind the scenes the repository pattern is used to create an abstraction between your domain and data layer. that is, when you use the repository you should not have to have any knowledge about the underlying data source or the data layer (i.e. entity framework, nhibernate or similar). why do we need it? read the abstractions part of my data layer article. it explains the basics to why we should use repositories or similar abstractions. but let’s also examine some simple business logic: var brokentrucks = _session.query().where(x => x.state == 1); foreach (var truck in brokentrucks) { if (truck.calculatereponsetime().totaldays > 30) sendemailtomanager(truck); } what does that give us? broken trucks? well. no. the statement was copied from another place in the code and the developer had forgot to update the query. any unit tests would likely just check that some trucks are returned and that they are emailed to the manager. so we basically have two problems here: a) most developers will likely just check the name of the variable and not on the query. b) any unit tests are against the business logic and not the query. both those problems would have been fixed with repositories. since if we create repositories we also have unit tests which targets the data layer only. implementations here are some different implementations with descriptions. base classes these classes can be reused for all different implementations. unitofwork the unit of work represents a transaction when used in data layers. typically the unit of work will roll back the transaction if savechanges() has not been invoked before being disposed. public interface iunitofwork : idisposable { void savechanges(); } paging we also need to have page results. public class pagedresult { ienumerable _items; int _totalcount; public pagedresult(ienumerable items, int totalcount) { _items = items; _totalcount = totalcount; } public ienumerable items { get { return _items; } } public int totalcount { get { return _totalcount; } } } we can with the help of that create methods like: public class userrepository { public pagedresult find(int pagenumber, int pagesize) { } } sorting finally we prefer to do sorting and page items, right? var constraints = new queryconstraints() .sortby("firstname") .page(1, 20); var page = repository.find("jon", constraints); do note that i used the property name, but i could also have written constraints.sortby(x => x.firstname) . however, that is a bit hard to write in web applications where we get the sort property as a string. the class is a bit big, but you can find it at github . in our repository we can apply the constraints as (if it supports linq): public class userrepository { public pagedresult find(string text, queryconstraints constraints) { var query = _dbcontext.users.where(x => x.firstname.startswith(text) || x.lastname.startswith(text)); var count = query.count(); //easy var items = constraints.applyto(query).tolist(); return new pagedresult(items, count); } } the extension methods are also available at github . basic contract i usually start use a small definition for the repository, since it makes my other contracts less verbose. do note that some of my repository contracts do not implement this interface (for instance if any of the methods do not apply). public interface irepository where tentity : class { tentity getbyid(tkey id); void create(tentity entity); void update(tentity entity); void delete(tentity entity); } i then specialize it per domain model: public interface itruckrepository : irepository { ienumerable findbrokentrucks(); ienumerable find(string text); } that specialization is important. it keeps the contract simple. only create methods that you know that you need. entity framework do note that the repository pattern is only useful if you have pocos which are mapped using code first. otherwise you’ll just break the abstraction using the entities. the repository pattern isn’t very useful then. what i mean is that if you use the model designer you’ll always get a perfect representation of the database (but as classes). the problem is that those classes might not be a perfect representation of your domain model. hence you got to cut corners in the domain model to be able to use your generated db classes. if you on the other hand uses code first you can modify the models to be a perfect representation of your domain model (if the db is reasonable similar to it). you don’t have to worry about your changes being overwritten as they would have been by the model designer. you can follow this article if you want to get a foundation generated for you. base class public class entityframeworkrepository where tentity : class { private readonly dbcontext _dbcontext; public entityframeworkrepository(dbcontext dbcontext) { if (dbcontext == null) throw new argumentnullexception("dbcontext"); _dbcontext = dbcontext; } protected dbcontext dbcontext { get { return _dbcontext; } } public void create(tentity entity) { if (entity == null) throw new argumentnullexception("entity"); dbcontext.set().add(entity); } public tentity getbyid(tkey id) { return _dbcontext.set().find(id); } public void delete(tentity entity) { if (entity == null) throw new argumentnullexception("entity"); dbcontext.set().attach(entity); dbcontext.set().remove(entity); } public void update(tentity entity) { if (entity == null) throw new argumentnullexception("entity"); dbcontext.set().attach(entity); dbcontext.entry(entity).state = entitystate.modified; } } then i go about and do the implementation: public class truckrepository : entityframeworkrepository, itruckrepository { private readonly truckerdbcontext _dbcontext; public truckrepository(truckerdbcontext dbcontext) { _dbcontext = dbcontext; } public ienumerable findbrokentrucks() { //compare having this statement in a business class compared //to invoking the repository methods. which says more? return _dbcontext.trucks.where(x => x.state == 3).tolist(); } public ienumerable find(string text) { return _dbcontext.trucks.where(x => x.modelname.startswith(text)).tolist(); } } unit of work the unit of work implementation is simple for entity framework: public class entityframeworkunitofwork : iunitofwork { private readonly dbcontext _context; public entityframeworkunitofwork(dbcontext context) { _context = context; } public void dispose() { } public void savechanges() { _context.savechanges(); } } nhibernate i usually use fluent nhibernate to map my entities. imho it got a much nicer syntax than the built in code mappings. you can use nhibernate mapping generator to get a foundation created for you. but you do most often have to clean up the generated files a bit. base class public class nhibernaterepository where tentity : class { isession _session; public nhibernaterepository(isession session) { _session = session; } protected isession session { get { return _session; } } public tentity getbyid(string id) { return _session.get(id); } public void create(tentity entity) { _session.saveorupdate(entity); } public void update(tentity entity) { _session.saveorupdate(entity); } public void delete(tentity entity) { _session.delete(entity); } } implementation public class truckrepository : nhibernaterepository, itruckrepository { public truckrepository(isession session) : base(session) { } public ienumerable findbrokentrucks() { return _session.query().where(x => x.state == 3).tolist(); } public ienumerable find(string text) { return _session.query().where(x => x.modelname.startswith(text)).tolist(); } } unit of work public class nhibernateunitofwork : iunitofwork { private readonly isession _session; private itransaction _transaction; public nhibernateunitofwork(isession session) { _session = session; _transaction = _session.begintransaction(); } public void dispose() { if (_transaction != null) _transaction.rollback(); } public void savechanges() { if (_transaction == null) throw new invalidoperationexception("unitofwork have already been saved."); _transaction.commit(); _transaction = null; } } typical mistakes here are some mistakes which can be stumbled upon when using or/ms. do not expose linq methods let’s get it straight. there are no complete linq to sql implementations. they all are either missing features or implement things like eager/lazy loading in their own way. that means that they all are leaky abstractions. so if you expose linq outside your repository you get a leaky abstraction. you could really stop using the repository pattern then and use the or/m directly. public interface irepository { iqueryable query(); // [...] } those repositories really do not serve any purpose. they are just lipstick on a pig (yay, my favorite) those who use them probably don’t want to face the truth: or are just not reading very good: learn about lazy loading lazy loading can be great. but it’s a curse for all which are not aware of it. if you don’t know what it is, google . if you are not careful you could get 101 executed queries instead of 1 if you traverse a list of 100 items. invoke tolist() before returning the query is not executed in the database until you invoke tolist() , firstordefault() etc. so if you want to be able to keep all data related exceptions in the repositories you have to invoke those methods. get is not the same as search there are to types of reads which are made in the database. the first one is to search after items. i.e. the user want to identify the items that he/she like to work with. the second one is when the user has identified the item and want to work with it. those queries are different. in the first one, the user only want’s to get the most relevant information. in the second one, the user likely want’s to get all information. hence in the former one you should probably return userlistitem or similar while the other case returns user . that also helps you to avoid the lazy loading problems. i usually let search methods start with findxxxx() while those getting the entire item starts with getxxxx() . also don’t be afraid of creating specialized pocos for the searches. two searches doesn’t necessarily have to return the same kind of entity information. summary don’t be lazy and try to make too generic repositories. it gives you no upsides compared to using the or/m directly. if you want to use the repository pattern, make sure that you do it properly.

February 4, 2013

by Jonas Gauffin

· 12,299 Views

Building SOLID Databases: Single Responsibility and Normalization

Introduction This instalment will cover the single responsibility principle in object-relational design, and its relationship both to data normalization and object-oriented application programming. While single responsibility is a fairly easy object-oriented principle to apply here, I think it is critical to explore in depth because it helps provide a clearer framework to address object-relational design. As in later instalments I will be using snippets of code developed elsewhere for other areas. These will not be full versions of what was written, but versions sufficient to show the basics of data structure and interface. Relations and Classes: Similarities Objects and classes, in the surface, look deceptively similar, to the point where one can look at relations as sets of classes, and in fact this equivalence is the basis of object-relational database design. Objects are data structures used to store state, which have identity and are tightly bound to interface. Relations are data structures which store state, and if they meet second normal form, have identity in the form of a primary key. Object-relational databases then provide interface and thus in an object-relational database, relations are classes, and contain sets of objects of a certain class. Relations and Classes: Differences So similar are objects and classes in structure that a very common mistake is to simply use a relational database management system as a simple object store. This tends to result in brittle database leading to a relatively brittle application. Consequently many of us see this approach as something of an anti-pattern. The reason why this doesn't work terribly well is not that the basic equivalence is bad but that relations and classes are used in very different ways. On the application layer, classes are used to model (and control) behavior, while in the database, relations and tuples are used to model information. Thus tying database structures to application classes in this way essentially overloads the data structures, turning the structures into reporting objects as well as behavior objects. Relations thus need to be seen not only as classes but as specialized classes used for persistent storage and reporting. Thus they have fundamentally different requirements than the behavior classes in the application and thus they have different reasons to change. An application class typically changes when there is a need for a change in behavior, while a relation should only change when there is a change in data retention and reporting. Relations have traditionally tended to be divorced from interface and this provides a great deal of power. While classes tend to be fairly opaque, relations tend to be very transparent. The reason here is that while both represent state information whether it is application state or other facts, objects traditionally encapsulate behavior (and thus act as building blocks of behavior), relations always encapsulate information and are building blocks of information. Thus the data structures of relations must be transparent while object-oriented design tends to push for less transparency and more abstraction. It is worth noting then that because these systems are designed to do different things, there are many DBA's who suggest encapsulating the entire database behind an API, defined by stored procedures. The typical problem with this approach is that loose coupling of the application to the interface is difficult (but see one of the first posts on this blog for a solution). When the db interface is tightly coupled to the application, then typically one ends up with problems on several levels, and it tends to sacrifice good application design for good relational database design. Single Responsibility Principle in Application Programming The single responsibility principle holds that every class should have a single responsibility which it should entirely encapsulate. A responsibility is defined as a reason to change. The canonical example of a violation of this principle is a class which might format and print a report. Because both data changes and cosmetic changes may require a change to the class, this would be a violation of the principle at issue. In an ideal world, we'd separate out the printing and the formatting so that cosmetic changes do not require changes when data changes are made and vice versa. The problem of course with the canonical example is that it is not self-contained. If you change the data in the report, it will almost certainly require cosmetic changes. You can try to automate those changes but only within limits, and you can abstract interfaces (dependency inversion) but in the end if you change the data in the report enough, cosmetic changes will become necessary. Additionally a "reason to change" is epistemologically problematic. Reasons foreseen are rarely if ever atomic, and so there is a real question as far as how far one pushes this. In terms of formatting a report, do we want to break out the class that adjusts to paper size so that if we want to move from US Letter to A4 we no longer have to change the rest of the cosmetic layout? Perfect separation of responsibilities in that example is thus impossible, as it probably always is --- you can only change business rules to a certain point before interfaces must change, and when that happens the cascading flow of necessary changes can be quite significant. The database is, however, quite different in that that responsibility of database-level code (including DDL and DML) is limited to the proposition that we should construct answers from known facts. This makes a huge difference in terms of single responsibility, and it is possible to develop mathematical definitions for single responsibility. Not only is this possible but it has been done. All of the normal forms from third on up address single responsibility. The Definition of Third Normal Form Quoting Wikipedia, Codd's definition states that a table is in 3NF if and only if both of the following conditions hold: The relation R (table) is in second normal form (2NF) Every non-prime attribute of R is non-transitively dependent (i.e. directly dependent) on every superkey of R. A non-prime attribute is an attribute not part of a superkey. In essence what third normal form states is that every relation must contain a superkey and values functionally and directly dependent on that superkey. This will become more important as we look at how data anomilies dovetail with single responsibility. Normalization and Single Responsibility The process of database normalization is an attempt to create relational databases where data anomalies do not exist. Data anomalies occur where modifying data either requires modifying other data to maintain accuracy (where no independent fact changes are recorded), or where existing data may project current or historical facts not in existence (join anomilies). This process occurs by breaking down keys and superkeys, and their dependencies, such that data is tracked in smaller, self-contained units. Beginning at third normal form, one can see relations are forming single responsibilities of managing data directly dependent on their superkeys. From this point forward, relations' structures would change (assuming no decision to further decompose a relation into a higher normal form) if and only if a change is made to what data is tracked that is directly dependent on a superkey. The responsibility of the database layer is the storage of facts and the synthesis of answers. Since the storage relations themselves handle the first, normalization is a prerequisite to good object-relational design. The one major caveat here however is that first normal form's atomicity requirement must be interpreted slightly differently in object-relational setups because more complex data structures can be atomic compared to a purely relational design. In a purely relational database, the data types that can be used are relatively minor and therefore facts must be maximally decomposed. For example we might store an IP address plus network mask as 4 ints for the address and an int for the network mask, or we might store as a single 32-bit int plus another int for the network mask but the latter poses problems of display that the former does not. In an object-relational database, however, we might store the address as an array of 4 ints for IP v4 or, if we need better performance we might build a custom type. If storage is not a problem but ease of maintenance is, we might even define relations, domains, and such to hold IP addresses, and then store the tuple in a column with appropriate functional interfaces. None of these approaches necessarily violate first normal form, as long as the data type involved properly and completely encapsulates the required behavior. Where such encapsulation is problematic, however, they would violate 1NF because they can no longer be treated as atomic values. In all cases, the specific value has a 1:1 correspondence to an IP address. Additionally where the needs are different, storage, application interface, and reporting classes should be different (this can be handled with updateable views, object-relational interfaces, and the like). Object-Relational Interfaces and Single Responsibility For purely relational databases, normalization is sufficient to address single responsibility. Object-relational designs bring some additional complexity because some behavior may be encapsulated in the object interfaces. There are two fundamental cases where this may make a difference, namely in terms of compositional patterns and in terms of encapsulated data within columns. A compositional pattern in PostgreSQL typically would occur when we use table inheritance to manage commonly co-occuring fields which occur in ways which are functionally dependent on many other fields in a database. For example, we might have a notes abstract table, and then have various tables which inherit this, possibly as part of other larger tables. A common case where composition makes a big difference is in managing notes. People may want to attach notes to all kinds of other data in a database, and so one cannot say that the text or subject of a note is mutually dependent, A typical purely relational approach is to either have many independently managed notes tables or have a single global note table which stores notes for everything, and then have multiple join tables to add join dependencies. The problem with this is that the note data is usually dependent logically, if not mathematically, on the join dependency, and so there is no reasonable way of expressing this without a great deal of complexity in the database design. An object-relational approach might be to have multiple notes tables, but have them inherit the table structure of a common notes table. This table can then be expanded, interfaces added as needed, and it should fill the single responsibility principle even though we might not be able to say that there is a natural functional dependency within the table itself. The second case has to do with storing complex information in columns. Here stability and robustness of code is especially important, and traditional approaches of the single responsibility principle apply directly to the contained data type. Example: Machine Configuration Database and SMTP configuration One of my current projects is building a network configuration database for a LedgerSMBhosting business I am helping to found (more on this soon). For competitive reasons I cannot display my whole code here. However, what I would like to do is show a very abbreviated version here as I used to solve a very specific issue. One of the basic challenges in a network configuration database is that the direct functional dependencies for a given machine may become quite complex when we assume that a given piece of network software is not likely to be running more than once on a given machine. Additionally we often want to ensure that certain sorts of software are set to be configured for certain types of machines, and so constraints can exist that force wider tables. The width and complexity of some configuration tables can possibly pose a management problem over time for the reason that they may not be obviously broken into manageable chunks of columns. One possible solution is to decompose the storage class into smaller mix-ins, each of which expresses a set of functional dependencies on a specific key, fully encapsulating a single responsibility. The overall storage class then exists to manage cross-mixin constraints and handle the actual physical storage. The data can then be presented as a unified table, or as multiple joined tables (and this works even where views would add significant complexity). In this way the smaller sub-tables can be given the responsibility of managing the configuration of specific pieces of software. We might therefore have tables like: -- abstract table, contains no data CREATE TABLE mi_smtp_config ( mi_id bigint, smtp_hostname text, smtp_forward_to text ); CREATE TABLE machine_instance ( mi_id bigserial, mi_name text not null, inservice_date date not null, retire_date date. .... ) INHERITS (mi_smtp_config, ...); The major advantage to this approach is that we can easily check and add which fields are set up to configure which software, without looking through a much larger, wider table. This also provides additional interfaces for related data, and the like. For example, "select * from mi_smtp_config" is directly equivalent of "select (mi::mi_smtp_config).* from machine_instance mi; Conclusions When we think of relations as specialized "fact classes" as opposed to "behavior classes" in the application world, the idea of the single responsibility principle works quite well with relational databases, particularly when paired with other encapsulation processes like stored procedures and views. In object-relational designs, the principle can be used as a general guide for further decomposing relations into mix-in classes, or creating intelligent data types for attributes, and it becomes possible to solve a number of problems in this regard without breaking normalization rules.

January 25, 2013

by Chris Travers

· 7,936 Views

OAuth 2.0 Bearer Token Profile Vs MAC Token Profile

Almost all the implementation I see today are based on OAuth 2.0 Bearer Token Profile. Of course its an RFC proposed standard today. OAuth 2.0 Bearer Token profile brings a simplified scheme for authentication. This specification describes how to use bearer tokens in HTTP requests to access OAuth 2.0 protected resources. Any party in possession of a bearer token (a "bearer") can use it to get access to the associated resources (without demonstrating possession of a cryptographic key). To prevent misuse, bearer tokens need to be protected from disclosure in storage and in transport. Before dig in to the OAuth 2.0 MAC profile lets have quick high-level overview of OAuth 2.0 message flow. OAuth 2.0 has mainly three phases. 1. Requesting an Authorization Grant. 2. Exchanging the Authorization Grant for an Access Token. 3. Access the resources with the Access Token. Where does the token type come in to action ? OAuth 2.0 core specification does not mandate any token type. At the same time at any point token requester - client - cannot decide which token type it needs. It's purely up to the Authorization Server to decide which token type to be returned in the Access Token response. So, the token type comes in to action in phase-2 when Authorization Server returning back the OAuth 2.0 Access Token. The access token type provides the client with the information required to successfully utilize the access token to make a protected resource request (along with type-specific attributes). The client must not use an access token if it does not understand the token type. Each access token type definition specifies the additional attributes (if any) sent to the client together with the "access_token" response parameter. It also defines the HTTP authentication method used to include the access token when making a protected resource request. For example following is what you get for Access Token response irrespective of which grant type you use. HTTP/1.1 200 OK Content-Type: application/json;charset=UTF-8 Cache-Control: no-store Pragma: no-cache { "access_token":"mF_9.B5f-4.1JqM", "token_type":"Bearer", "expires_in":3600, "refresh_token":"tGzv3JOkF0XG5Qx2TlKWIA" } The above is for Bearer - following is for MAC. HTTP/1.1 200 OK Content-Type: application/json Cache-Control: no-store { "access_token":"SlAV32hkKG", "token_type":"mac", "expires_in":3600, "refresh_token":"8xLOxBtZp8", "mac_key":"adijq39jdlaska9asud", "mac_algorithm":"hmac-sha-256" } Here you can see MAC Access Token response has two additional attributes. mac_key and the mac_algorithm. Let me rephrase this - "Each access token type definition specifies the additional attributes (if any) sent to the client together with the "access_token" response parameter". This MAC Token Profile defines the HTTP MAC access authentication scheme, providing a method for making authenticated HTTP requests with partial cryptographic verification of the request, covering the HTTP method, request URI, and host. In the above response access_token is the MAC key identifier. Unlike in Bearer, MAC token profile never passes it's top secret over the wire. The access_token or the MAC key identifier is a string identifying the MAC key used to calculate the request MAC. The string is usually opaque to the client. The server typically assigns a specific scope and lifetime to each set of MAC credentials. The identifier may denote a unique value used to retrieve the authorization information (e.g. from a database), or self-contain the authorization information in a verifiable manner (i.e. a string consisting of some data and a signature). The mac_key is a shared symmetric secret used as the MAC algorithm key. The server will not reissue a previously issued MAC key and MAC key identifier combination. Now let's see what happens in phase-3. Following shows how the Authorization HTTP header looks like when Bearer Token been used. Authorization: Bearer mF_9.B5f-4.1JqM This adds very low overhead on client side. It simply needs to pass the exact access_token it got from the Authorization Server in phase-2. Under MAC token profile, this is how it looks like. Authorization: MAC id="h480djs93hd8", ts="1336363200", nonce="dj83hs9s", mac="bhCQXTVyfj5cmA9uKkPFx1zeOXM=" This needs bit more attention. id is the MAC key identifier or the access_token from the phase-2. ts the request timestamp. The value is a positive integer set by the client when making each request to the number of seconds elapsed from a fixed point in time (e.g. January 1, 1970 00:00:00 GMT). This value is unique across all requests with the same timestamp and MAC key identifier combination. nonce is a unique string generated by the client. The value is unique across all requests with the same timestamp and MAC key identifier combination. The client uses the MAC algorithm and the MAC key to calculate the request mac. This is how you derive the normalized string to generate the HMAC. The normalized request string is a consistent, reproducible concatenation of several of the HTTP request elements into a single string. By normalizing the request into a reproducible string, the client and server can both calculate the request MAC over the exact same value. The string is constructed by concatenating together, in order, the following HTTP request elements, each followed by a new line character (%x0A): 1. The timestamp value calculated for the request. 2. The nonce value generated for the request. 3. The HTTP request method in upper case. For example: "HEAD", "GET", "POST", etc. 4. The HTTP request-URI as defined by [RFC2616] section 5.1.2. 5. The hostname included in the HTTP request using the "Host" request header field in lower case. 6. The port as included in the HTTP request using the "Host" request header field. If the header field does not include a port, the default value for the scheme MUST be used (e.g. 80 for HTTP and 443 for HTTPS). 7. The value of the "ext" "Authorization" request header field attribute if one was included in the request (this is optional), otherwise, an empty string. Each element is followed by a new line character (%x0A) including the last element and even when an element value is an empty string. Either you use Bearer of MAC - the end user or the resource owner is identified using the access_token. Authorization, throttling, monitoring or any other quality of service operations can be carried out against the access_token irrespective of which token profile you use.

January 24, 2013

by Prabath Siriwardena

· 37,173 Views

How to Publish Maven Site Docs to BitBucket or GitHub Pages

In this post we will Utilize GitHub and/or BitBucket's static web page hosting capabilities to publish our project's Maven 3 Site Documentation. Each of the two SCM providers offer a slightly different solution to host static pages. The approach spelled out in this post would also be a viable solution to "backup" your site documentation in a supported SCM like Git or SVN. This solution does not directly cover site documentation deployment covered by the maven-site-plugin and the Wagon library (scp, WebDAV or FTP). There is one main project hosted on GitHub that I have posted with the full solution. The project URL is https://github.com/mike-ensor/clickconcepts-master-pom/. The POM has been pushed to Maven Central and will continue to be updated and maintained. com.clickconcepts.project master-site-pom 0.16 GitHub Pages GitHub hosts static pages by using a special branch "gh-pages" available to each GitHub project. This special branch can host any HTML and local resources like JavaScript, images and CSS. There is no server side development. To navigate to your static pages, the URL structure is as follows: http://.github.com/ An example of the project I am using in this blog post: http://mike-ensor.github.com/clickconcepts-master-pom/ where the first bold URL segment is a username and the second bold URL segment is the project. GitHub does allow you to create a base static hosted static site for your username by creating a repository with your username.github.com. The contents would be all of your HTML and associated static resources. This is not required to post documentation for your project, unlike the BitBucket solution. There is a GitHub Site plugin that publishes site documentation via GitHub's object API but this is outside the scope of this blog post because it does not provide a single solution for GitHub and BitBucket projects using Maven 3. BitBucket BitBucket provides a similar service to GitHub in that it hosts static HTML pages and their associated static resources. However, there is one large difference in how those pages are stored. Unlike GitHub, BitBucket requires you to create a new repository with a name fitting the convention. The files will be located on the master branch and each project will need to be a directory off of the root. mikeensor.bitbucket.org/ /some-project +index.html +... /css /img /some-other-project +index.html +... /css /img index.html .git .gitignore The naming convention is as follows: .bitbucket.org An example of a BitBucket static pages repository for me would be: http://mikeensor.bitbucket.org/. The structure does not require that you create an index.html page at the root of the project, but it would be advisable to avoid 404s. Generating Site Documentation Maven provides the ability to post documentation for your project by using the maven-site-plugin. This plugin is difficult to use due to the many configuration options that oftentimes are not well documented. There are many blog posts that can help you write your documentation including my post on maven site documentation. I did not mention how to use "xdoc", "apt" or other templating technologies to create documentation pages, but not to fear, I have provided this in my GitHub project. Putting it all Together The Maven SCM Publish plugin (http://maven.apache.org/plugins/maven-scm-publish-plugin/ publishes site documentation to a supported SCM. In our case, we are going to use Git through BitBucket or GitHub. Maven SCM Plugin does allow you to publish multi-module site documentation through the various properties, but the scope of this blog post is to cover single/mono module projects and the process is a bit painful. Take a moment to look at the POM file located in the clickconcepts-master-pom project. This master POM is rather comprehensive and the site documentation is only one portion of the project, but we will focus on the site documentation. There are a few things to point out here, first, the scm-publish plugin and the idiosyncronies when implementing the plugin. In order to create the site documentation, the "site" plugin must first be run. This is accomplished by running site:site. The plugin will generate the documentation into the "target/site" folder by default. The SCM Publish Plugin, by default, looks for the site documents to be in "target/staging" and is controlled by the content parameter. As you can see, there is a mismatch between folders. NOTE: My first approach was to run the site:stage command which is supposed to put the site documents into the "target/staging" folder. This is not entirely correct, the site plugin combines with the distributionManagement.site.url property to stage the documents, but there is very strange behavior and it is not documented well. In order to get the site plugin's site documents and the SCM Publish's location to match up, use the content property and set that to the location of the Site Plugin output (). If you are using GitHub, there is no modification to the siteOutputDirectory needed, however, if you are using BitBucket, you will need to modify the property to add in a directory layer into the site documentation generation (see above for differences between GitHub and BitBucket pages). The second property will tell the SCM Publish Plugin to look at the root "site" folder so that when the files are copied into the repository, the project folder will be the containing folder. The property will look like: ${project.build.directory}/site/ ${project.artifactId} ${project.build.directory} /site Next we will take a look at the custom properties defined in the master POM and used by the SCM Publish Plugin above. Each project will need to define several properties to use the Master POM that are used within the plugins during the site publishing. Fill in the variables with your own settings. BitBucket ... ... master scm:git:[email protected]:mikeensor/mikeensor.bitbucket.org.git ${project.build.directory}/site/${project.artifactId} ${project.build.directory}/site ${changelog.bitbucket.fileUri} ${changelog.revision.bitbucket.fileUri} ... ... GitHub ... ... gh-pages scm:git:[email protected]:mikeensor/clickconcepts-master-pom.git ${changelog.github.fileUri} ${changelog.revision.github.fileUri} ... ... NOTE: changelog parameters are required to use the Master POM and are not directly related to publishing site docs to GitHub or BitBucket How to Generate If you are using the Master POM (or have abstracted out the Site Plugin and the SCM Plugin) then to generate and publish the documentation is simple. mvn clean site:site scm-publish:publish-scm mvn clean site:site scm-publish:publish-scm -Dscmpublish.dryRun=true Gotchas In the SCM Publish Plugin documentation's "tips" they recommend creating a location to place the repository so that the repo is not cloned each time. There is a risk here in that if there is a git repository already in the folder, the plugin will overwrite the repository with the new site documentation. This was discovered by publishing two different projects and having my root repository wiped out by documentation from the second project. There are ways to mitigate this by adding in another folder layer, but make sure you test often! Another gotcha is to use the -Dscmpublish.dryRun=true to test out the site documentation process without making the SCM commit and push Project and Documentation URLs Here is a list of the fully working projects used to create this blog post: Master POM with Site and SCM Publish plugins &ndash https://github.com/mike-ensor/clickconcepts-master-pom. Documentation URL: http://mike-ensor.github.com/clickconcepts-master-pom/ Child Project using Master Pom &ndash http://mikeensor.bitbucket.org/fest-expected-exception. Documentation URL: http://mikeensor.bitbucket.org/fest-expected-exception/

January 23, 2013

by Mike Ensor

· 13,443 Views

Spring Data JDBC Generic DAO Implementation: Most Lightweight ORM Ever

I am thrilled to announce first version of my Spring Data JDBC repository project. The purpose of this open source library is to provide generic, lightweight and easy to use DAO implementation for relational databases based on JdbcTemplate from Spring framework, compatible with Spring Data umbrella of projects. Design objectives Lightweight, fast and low-overhead. Only a handful of classes, no XML, annotations, reflection This is not full-blown ORM. No relationship handling, lazy loading, dirty checking, caching CRUD implemented in seconds For small applications where JPA is an overkill Use when simplicity is needed or when future migration e.g. to JPA is considered Minimalistic support for database dialect differences (e.g. transparent paging of results) Features Each DAO provides built-in support for: Mapping to/from domain objects through RowMapper abstraction Generated and user-defined primary keys Extracting generated key Compound (multi-column) primary keys Immutable domain objects Paging (requesting subset of results) Sorting over several columns (database agnostic) Optional support for many-to-one relationships Supported databases (continuously tested): MySQL PostgreSQL H2 HSQLDB Derby ...and most likely most of the others Easily extendable to other database dialects via SqlGenerator class. Easy retrieval of records by ID API Compatible with Spring Data PagingAndSortingRepository abstraction, all these methods are implemented for you: public interface PagingAndSortingRepository extends CrudRepository { T save(T entity); Iterable save(Iterable entities); T findOne(ID id); boolean exists(ID id); Iterable findAll(); long count(); void delete(ID id); void delete(T entity); void delete(Iterable entities); void deleteAll(); Iterable findAll(Sort sort); Page findAll(Pageable pageable); } Pageable and Sort parameters are also fully supported, which means you get paging and sorting by arbitrary properties for free. For example say you have userRepository extending PagingAndSortingRepository interface (implemented for you by the library) and you request 5th page of USERS table, 10 per page, after applying some sorting: Page page = userRepository.findAll( new PageRequest( 5, 10, new Sort( new Order(DESC, "reputation"), new Order(ASC, "user_name") ) ) ); Spring Data JDBC repository library will translate this call into (PostgreSQL syntax): SELECT * FROM USERS ORDER BY reputation DESC, user_name ASC LIMIT 50 OFFSET 10 ...or even (Derby syntax): SELECT * FROM ( SELECT ROW_NUMBER() OVER () AS ROW_NUM, t.* FROM ( SELECT * FROM USERS ORDER BY reputation DESC, user_name ASC ) AS t ) AS a WHERE ROW_NUM BETWEEN 51 AND 60 No matter which database you use, you'll get Page object in return (you still have to provide RowMapper yourself to translate from ResultSet to domain object. If you don't know Spring Data project yet, Page is a wonderful abstraction, not only encapsulating List , but also providing metadata such as total number of records, on which page we currently are, etc. Reasons to use You consider migration to JPA or even some NoSQL database in the future. Since your code will rely only on methods defined in PagingAndSortingRepository and CrudRepository from Spring Data Commons umbrella project you are free to switch from JdbcRepository implementation (from this project) to: JpaRepository, MongoRepository, GemfireRepository or GraphRepository. They all implement the same common API. Of course don't expect that switching from JDBC to JPA or MongoDB will be as simple as switching imported JAR dependencies - but at least you minimize the impact by using same DAO API. You need a fast, simple JDBC wrapper library. JPA or even MyBatis is an overkill You want to have full control over generated SQL if needed You want to work with objects, but don't need lazy loading, relationship handling, multi-level caching, dirty checking... You need CRUD and not much more You want to by DRY You are already using Spring or maybe even JdbcTemplate, but still feel like there is too much manual work You have very few database tables Getting started For more examples and working code don't forget to examine project tests. Prerequisites Maven coordinates: com.blogspot.nurkiewicz jdbcrepository 0.1 Unfortunately the project is not yet in maven central repository. For the time being you can install the library in your local repository by cloning it: $ git clone git://github.com/nurkiewicz/spring-data-jdbc-repository.git $ git checkout 0.1 $ mvn javadoc:jar source:jar install In order to start your project must have DataSource bean present and transaction management enabled. Here is a minimal MySQL configuration: @EnableTransactionManagement @Configuration public class MinimalConfig { @Bean public PlatformTransactionManager transactionManager() { return new DataSourceTransactionManager(dataSource()); } @Bean public DataSource dataSource() { MysqlConnectionPoolDataSource ds = new MysqlConnectionPoolDataSource(); ds.setUser("user"); ds.setPassword("secret"); ds.setDatabaseName("db_name"); return ds; } } Entity with auto-generated key Say you have a following database table with auto-generated key (MySQL syntax): CREATE TABLE COMMENTS ( id INT AUTO_INCREMENT, user_name varchar(256), contents varchar(1000), created_time TIMESTAMP NOT NULL, PRIMARY KEY (id) ); First you need to create domain object User mapping to that table (just like in any other ORM): public class Comment implements Persistable { private Integer id; private String userName; private String contents; private Date createdTime; @Override public Integer getId() { return id; } @Override public boolean isNew() { return id == null; } //getters/setters/constructors/... } Apart from standard Java boilerplate you should notice implementing Persistable where Integer is the type of primary key. Persistable is an interface coming from Spring Data project and it's the only requirement we place on your domain object. Finally we are ready to create our CommentRepository DAO: @Repository public class CommentRepository extends JdbcRepository { public CommentRepository() { super(ROW_MAPPER, ROW_UNMAPPER, "COMMENTS"); } public static final RowMapper ROW_MAPPER = //see below private static final RowUnmapper ROW_UNMAPPER = //see below @Override protected Comment postCreate(Comment entity, Number generatedId) { entity.setId(generatedId.intValue()); return entity; } } First of all we use @Repository annotation to mark DAO bean. It enables persistence exception translation. Also such annotated beans are discovered by CLASSPATH scanning. As you can see we extend JdbcRepository which is the central class of this library, providing implementations of all PagingAndSortingRepository methods. Its constructor has three required dependencies: RowMapper , RowUnmapper and table name. You may also provide ID column name, otherwise default "id" is used. If you ever used JdbcTemplate from Spring, you should be familiar with RowMapper interface. We need to somehow extract columns from ResultSet into an object. After all we don't want to work with raw JDBC results. It's quite straightforward: public static final RowMapper ROW_MAPPER = new RowMapper () { @Override public Comment mapRow(ResultSet rs, int rowNum) throws SQLException { return new Comment( rs.getInt("id"), rs.getString("user_name"), rs.getString("contents"), rs.getTimestamp("created_time") ); } }; RowUnmapper comes from this library and it's essentially the opposite of RowMapper : takes an object and turns it into a Map . This map is later used by the library to construct SQL CREATE / UPDATE queries: private static final RowUnmapper ROW_UNMAPPER = new RowUnmapper () { @Override public Map mapColumns(Comment comment) { Map mapping = new LinkedHashMap (); mapping.put("id", comment.getId()); mapping.put("user_name", comment.getUserName()); mapping.put("contents", comment.getContents()); mapping.put("created_time", new java.sql.Timestamp(comment.getCreatedTime().getTime())); return mapping; } }; If you never update your database table (just reading some reference data inserted elsewhere) you may skip RowUnmapper parameter or use MissingRowUnmapper. Last piece of the puzzle is the postCreate() callback method which is called after an object was inserted. You can use it to retrieve generated primary key and update your domain object (or return new one if your domain objects are immutable). If you don't need it, just don't override postCreate() . Check out JdbcRepositoryGeneratedKeyTest for a working code based on this example. By now you might have a feeling that, compared to JPA or Hibernate, there is quite a lot of manual work. However various JPA implementations and other ORM frameworks are notoriously known for introducing significant overhead and manifesting some learning curve. This tiny library intentionally leaves some responsibilities to the user in order to avoid complex mappings, reflection, annotations... all the implicitness that is not always desired. This project is not intending to replace mature and stable ORM frameworks. Instead it tries to fill in a niche between raw JDBC and ORM where simplicity and low overhead are key features. Entity with manually assigned key In this example we'll see how entities with user-defined primary keys are handled. Let's start from database model: CREATE TABLE USERS ( user_name varchar(255), date_of_birth TIMESTAMP NOT NULL, enabled BIT(1) NOT NULL, PRIMARY KEY (user_name) ); ...and User domain model: public class User implements Persistable { private transient boolean persisted; private String userName; private Date dateOfBirth; private boolean enabled; @Override public String getId() { return userName; } @Override public boolean isNew() { return !persisted; } public User withPersisted(boolean persisted) { this.persisted = persisted; return this; } //getters/setters/constructors/... } Notice that special persisted transient flag was added. Contract of CrudRepository.save() from Spring Data project requires that an entity knows whether it was already saved or not ( isNew() ) method - there are no separate create() and update() methods. Implementing isNew() is simple for auto-generated keys (see Comment above) but in this case we need an extra transient field. If you hate this workaround and you only insert data and never update, you'll get away with return true all the time from isNew() . And finally our DAO, UserRepository bean: @Repository public class UserRepository extends JdbcRepository { public UserRepository() { super(ROW_MAPPER, ROW_UNMAPPER, "USERS", "user_name"); } public static final RowMapper ROW_MAPPER = //... public static final RowUnmapper ROW_UNMAPPER = //... @Override protected User postUpdate(User entity) { return entity.withPersisted(true); } @Override protected User postCreate(User entity, Number generatedId) { return entity.withPersisted(true); } } "USERS" and "user_name" parameters designate table name and primary key column name. I'll leave the details of mapper and unmapper (see source code). But please notice postUpdate() and postCreate() methods. They ensure that once object was persisted, persisted flag is set so that subsequent calls to save() will update existing entity rather than trying to reinsert it. Check out JdbcRepositoryManualKeyTest for a working code based on this example. Compound primary key We also support compound primary keys (primary keys consisting of several columns). Take this table as an example: CREATE TABLE BOARDING_PASS ( flight_no VARCHAR(8) NOT NULL, seq_no INT NOT NULL, passenger VARCHAR(1000), seat CHAR(3), PRIMARY KEY (flight_no, seq_no) ); I would like you to notice the type of primary key in Peristable : public class BoardingPass implements Persistable { private transient boolean persisted; private String flightNo; private int seqNo; private String passenger; private String seat; @Override public Object[] getId() { return pk(flightNo, seqNo); } @Override public boolean isNew() { return !persisted; } //getters/setters/constructors/... } Unfortunately we don't support small value classes encapsulating all ID values in one object (like JPA does with @IdClass), so you have to live with Object[] array. Defining DAO class is similar to what we've already seen: public class BoardingPassRepository extends JdbcRepository { public BoardingPassRepository() { this("BOARDING_PASS"); } public BoardingPassRepository(String tableName) { super(MAPPER, UNMAPPER, new TableDescription(tableName, null, "flight_no", "seq_no") ); } public static final RowMapper ROW_MAPPER = //... public static final RowUnmapper UNMAPPER = //... } Two things to notice: we extend JdbcRepository and we provide two ID column names just as expected: "flight_no", "seq_no" . We query such DAO by providing both flight_no and seq_no (necessarily in that order) values wrapped by Object[] : BoardingPass pass = repository.findOne(new Object[] {"FOO-1022", 42}); No doubts, this is cumbersome in practice, so we provide tiny helper method which you can statically import: import static com.blogspot.nurkiewicz.jdbcrepository.JdbcRepository.pk; //... BoardingPass foundFlight = repository.findOne(pk("FOO-1022", 42)); Check out JdbcRepositoryCompoundPkTest for a working code based on this example. Transactions This library is completely orthogonal to transaction management. Every method of each repository requires running transaction and it's up to you to set it up. Typically you would place @Transactional on service layer (calling DAO beans). I don't recommend placing @Transactional over every DAO bean. Caching Spring Data JDBC repository library is not providing any caching abstraction or support. However adding @Cacheable layer on top of your DAOs or services using caching abstraction in Spring is quite straightforward. See also: @Cacheable overhead in Spring. Contributions ..are always welcome. Don't hesitate to submit bug reports and pull requests. Biggest missing feature now is support for MSSQL and Oracle databases. It would be terrific if someone could have a look at it. Testing This library is continuously tested using Travis (). Test suite consists of 265 tests (53 distinct tests each run against 5 different databases: MySQL, PostgreSQL, H2, HSQLDB and Derby. When filling bug reports or submitting new features please try including supporting test cases. Each pull request is automatically tested on a separate branch. Building After forking the official repository building is as simple as running: $ mvn install You'll notice plenty of exceptions during JUnit test execution. This is normal. Some of the tests run against MySQL and PostgreSQL available only on Travis CI server. When these database servers are unavailable, whole test is simply skipped: Results : Tests run: 265, Failures: 0, Errors: 0, Skipped: 106 Exception stack traces come from root AbstractIntegrationTest. Design Library consists of only a handful of classes, highlighted in the diagram below: JdbcRepository is the most important class that implements all PagingAndSortingRepository methods. Each user repository has to extend this class. Also each such repository must at least implement RowMapper and RowUnmapper (only if you want to modify table data). SQL generation is delegated to SqlGenerator. PostgreSqlGenerator. and DerbySqlGenerator are provided for databases that don't work with standard generator. License This project is released under version 2.0 of the Apache License (same as Spring framework).

January 22, 2013

by Tomasz Nurkiewicz

· 76,689 Views · 2 Likes

ActiveMQ: KahaDB Journal Files - More Than Just Message Content Bits

I recently came across an issue where the ActiveMQ KahaDB journal files were continually rolling despite the fact that only a small number of small persistent messages were occasionally being stored by the broker. This behavior seemed very strange being that the message sizes being persisted were only a couple of kilobytes and there was a relatively small amount of messages actually on a queue. In this scenario something was filling up the 32MB journal files, but I wasn't quite sure what it could be? Were there other messages somewhere in the broker? Did an index get corrupted that was actually causing messages to be written across multiple journal files? It was pretty strange behavior but it can be explained fairly easily. This post describes the actual cause of this behavior and I have created it to remind myself in the future that there is more in the journal file than just the message content bits. The KahaDB journal files are used to store persistent messages that have been sent to the broker. In addition to storing the message content, the journal files also store information on KahaDB commands and transactional information. There are several commands for which information is stored; KahaAddMessageCommand, KahaCommitCommand, KahaPrepareCommand, KahaProducerAuditCommand, KahaRemoveDestinationCommand, KahaRemoveMessageCommand, KahaRollbackCommand, KahaSubscriptionCommand, and KahaTraceCommand. In this particular case, it was the KahaProducerAuditCommand which was responsible for the behavior that was observed. This command stores information about producer ids and message ids which is used for duplicate detection. In this case information is stored in a map object which over time grows. This information is then stored in the journal file each time a checkpoint is run, which by default is every 5 seconds. Over time, this can begin to use up the space allocated by the journal file causing low volume smaller messages to roll to the next journal file which in turn prevents the broker from cleaning up journal files which still have referenced messages. Eventually this situation can lead to Producer Flow Control being trigger by the broker's store limit which prevents producers from sending new messages into the broker. This behavior can occur under the following conditions: Persistent messages are being sent to a queue The messages are not being consumed on a regular basis The rate of messages being sent to the broker is low These conditions allow for this behavior to be observed fairly easily. As the persistent messages do not get consumed they remain referenced and prevent the journal files from being cleaned up. The low message rate allows time to pass between each new message being stored in the journal file and in the meantime checkpoints are being run which cause KahaProducerAuditCommand information to space out the actual messages within the journal file. For this use case you can disable the duplicate detection by essentially limiting the growth of the KahaProducerAuditCommand using the following configuration on the persistent adapter in the broker configuration: This is something to think about when designing your system. Under normal circumstances, if you have consumers available to consume the persistent messages, this condition would probably never occur as the journal files roll and messages are consumed, the broker can begin to clean up old journal files. There is currently an enhancement request at Apache which will also help resolve this issue. AMQ-3833 has been opened to enhance the broker so it will only write the KahaProducerAuditCommand if a change has occurred since the last checkpoint. This will help reduce the amount of data that is written to the journal files in between message storage.

January 18, 2013

by Jason Sherman

· 19,833 Views · 1 Like

Why is MongoDB Wildly Popular? It’s a Data Structure Thing.

Curator's Note: The content of this article was originally published over at the MongoLab blog . “Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious.” - Eric Raymond, in The Cathedral and the Bazaar, 1997 Linguistic innovation The fundamental task of programming is telling a computer how to do something. Because of this, much of the innovation in the field of software development has been linguistic innovation; that is, innovation in the ease and effectiveness with which a programmer is able to instruct a computer system. While machines operate in binary, we don’t talk to them that way. Every decade has introduced higher-level programming languages, and with each, an advancement in the ability of programmers to express themselves. These advancements include improvements in how we express data structures as well as how we express algorithms. The Object-Relational impedance mismatch Almost all modern programming languages support OO, and when we model entities in our code, we usually model them using a composition of primitive types (ints, strings, etc…), arrays, and objects. While each language might handle the details differently, the idea of nested object structures has become our universal language for describing ‘things’. The data structures we use to persist data have not evolved at the same rate. For the past 30 years the primary data structure for persistent data has been the Table – a set of Rows comprised of Columns containing scalar values (ints, strings, etc…). This is the world of the relational database, popularized in the 1980′s by its transactionality, speedy queries, space efficiency over other contemporary database systems, and a meat-eating ORCL salesforce. The difference between the way we model things in code, via objects, and the way they are represented in persistent storage, via tables, has been the source of much difficulty for programmers. Millennia of man-effort have been put against solving the problem of changing the shape of data from the object form to the relational form and back. Tools called Object-Relational Mapping systems (ORMs) exist for every object-oriented language in existence, and even with these tools, almost any programmer will complain that doing O/R mapping in any meaningful way is a time-consuming chore. Ted Neward hit it spot on when he said: “Object-Relational mapping is the Vietnam of our industry” There were attempts made at object databases in the 90s, but there was no technology that ever became a real alternative to the relational database. The document database, and in particular MongoDB, is the first successful Web-era object store, and because of that, represents the first big linguistic innovation in persistent data structures in a very long time. Instead of flat, two-dimensional tables of records, we have collections of rich, recursive, N-dimensional objects (a.k.a. documents) for records. An Example: the Blog Post Consider the blog post. Most likely you would have a class / object structure for modeling blog posts in your code, but if you are using a relational database to store your blog data, each entry would be spread across a handful of tables. As a developer you, need to get know how to convert the each ‘BlogPost’ object to and from the set of tables that house them in the relational model. A different approach Using MongoDB, your blog posts can be stored in a single collection, with each entry looking like this: { _id: 1234, author: { name: "Bob Davis", email : "[email protected]" }, post: "In these troubled times I like to …", date: { $date: "2010-07-12 13:23UTC" }, location: [ -121.2322, 42.1223222 ], rating: 2.2, comments: [ { user: "[email protected]", upVotes: 22, downVotes: 14, text: "Great point! I agree" }, { user: "[email protected]", upVotes: 421, downVotes: 22, text: "You are a moron" } ], tags: [ "Politics", "Virginia" ] } With a document database your data is stored almost exactly as it is represented in your program. There is no complex mapping exercise (although one often chooses to bind objects to instances of particular classes in code). What’s MongoDB good for? MongoDB is great for modeling many of the entities that back most modern web-apps, either consumer or enterprise: Account and user profiles: can store arrays of addresses with ease CMS: the flexible schema of MongoDB is great for heterogeneous collections of content types Form data: MongoDB makes it easy to evolve structure of form data over time Blogs / user-generated content: can keep data with complex relationships together in one object Messaging: vary message meta-data easily per message or message type without needing to maintain separate collections or schemas System configuration: just a nice object graph of configuration values, which is very natural in MongoDB Log data of any kind: structured log data is the future Graphs: just objects and pointers – a perfect fit Location based data: MongoDB understands geo-spatial coordinates and natively supports geo-spatial indexing Looking forward: the data is the interface There is a famous quote by Eric Raymond, in The Cathedral and the Bazaar (rephrasing an earlier quote by Fred Brooks from the famous The Mythical Man-Month): “Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious.” Data structures embody the essence of our programs and our ideas. Therefore, as programmers, we are constantly inviting innovation in the ease with which we can define expressive data structures to model our application domain. People often ask me why MongoDB is so wildly popular. I tell them it’s a data structure thing. While MongoDB may have ridden onto the scene under the banner of scalability with the rest of the NoSQL database technologies, the disproportionate success of MongoDB is largely based on its innovation as a data structure store that lets us more easily and expressively model the ‘things’ at the heart of our applications. For this reason MongoDB, or something very like it, will become the dominant database paradigm for operational data storage, with relational databases filling the role of a specialized tool. Having the same basic data model in our code and in the database is the superior method for most use-cases, as it dramatically simplifies the task of application development, and eliminates the layers of complex mapping code that are otherwise required. While a JSON-based document database may in retrospect seem obvious (if it doesn’t yet, it will), doing it right, as the folks at 10gen have, represents a major innovation.

January 16, 2013

by Eric Genesky

· 6,705 Views

How Many Queues Are Best For Max Performance? RabbitMQ

A generally useful question posed by Charming asks how many queues one should use in RabbitMQ for the maximum message passing throughput/performance. I thought I'd distill the answers by Brian Kelly and RobotEyes here for anyone who's worried about their performance with RabbitMQ. And I'll bet that some of these pointers are relevant to other message queues as well: From the RabbitMQ blog: RabbitMQ's queues are fastest when they're empty. When a queue is empty, and it has consumers ready to receive messages, then as soon as a message is received by the queue, it goes straight out to the consumer. In the case of a persistent message in a durable queue, yes, it will also go to disk, but that's done in an asynchronous manner and is buffered heavily. The main point is that very little book-keeping needs to be done, very few data structures are modified, and very little additional memory needs allocating. From the rabbitmq-discuss mailing group: Use a larger prefetch count. Small values hurt performance. A topic exchange is slower than a direct or a fanout exchange. Make sure queues stay short. Longer queues impose more processing overhead. If you care about latency and message rates then use smaller messages. Use an efficient format (e.g. avoid XML) or compress the payload. Experiment with HiPE, which helps performance. Avoid transactions and persistence. Also avoid publishing in immediate or mandatory mode. Avoid HA. Clustering can also impact performance. You will achieve better throughput on a multi-core system if you have multiple queues and consumers. Use at least v2.8.1, which introduces flow control. Make sure the memory and disk space alarms never trigger. Virtualisation can impose a small performance penalty. Tune your OS and network stack. Make sure you provide more than enough RAM. Provide fast cores and RAM. Hope that gives you some ideas for improving your queueing performance.

January 16, 2013

by Mitch Pronschinske

· 19,103 Views

@Cacheable overhead in Spring

Spring 3.1 introduced great caching abstraction layer. Finally we can abandon all home-grown aspects, decorators and code polluting our business logic related to caching. Since then we can simply annotate heavyweight methods and let Spring and AOP machinery do the work: @Cacheable("books") public Book findBook(ISBN isbn) {...} "books" is a cache name, isbn parameter becomes cache key and returned Book object will be placed under that key. The meaning of cache name is dependant on the underlying cache manager (EhCache, concurrent map, etc.) - Spring makes it easy to plug different caching providers. But this post won't be about caching feature in Spring... Some time ago my teammate was optimizing quite low-level code and discovered an opportunity for caching. He quickly applied @Cacheable just to discover that the code performed worse then it used to. He got rid of the annotation and implemented caching himself manually, using good old java.util.ConcurrentHashMap. The performance was much better. He blamed @Cacheable and Spring AOP overhead and complexity. I couldn't believe that a caching layer can perform so poorly until I had to debug Spring caching aspects few times myself (some nasty bug in my code, you know, cache invalidation is one of the two hardest things in CS). Well, the caching abstraction code is much more complex than one would expect (after all it's just get and put!), but it doesn't necessarily mean it must be that slow? In science we don't believe and trust, we measure and benchmark. So I wrote a benchmark to precisely measure the overhead of @Cacheable layer. Caching abstraction layer in Spring is implemented on top of Spring AOP, which can further be implemented on top of Java proxies, CGLIB generated subclasses or AspectJ instrumentation. Thus I'll test the following configurations: no caching at all - to measure how fast the code is with no intermediate layer manual cache handling using ConcurrentHashMap in business code @Cacheable with CGLIB implementing AOP @Cacheable with java.lang.reflect.Proxy implementing AOP @Cacheable with AspectJ compile time weaving (as similar benchmark shows, CTW is slightly faster than LTW) Home-grown AspectJ caching aspect - something between manual caching in business code and Spring abstraction Let me reiterate: we are not measuring the performance gain of caching and we are not comparing various cache providers. That's why our test method is as fast as it can be and I will be using simplest ConcurrentMapCacheManager from Spring. So here is a method in question: public interface Calculator { int identity(int x); } public class PlainCalculator implements Calculator { @Cacheable("identity") @Override public int identity(int x) { return x; } } I know, I know there is no point in caching such a method. But I want to measure the overhead of caching layer (during cache hit to be specific). Each caching configuration will have its own ApplicationContext as you can't mix different proxying modes in one context: public abstract class BaseConfig { @Bean public Calculator calculator() { return new PlainCalculator(); } } @Configuration class NoCachingConfig extends BaseConfig {} @Configuration class ManualCachingConfig extends BaseConfig { @Bean @Override public Calculator calculator() { return new CachingCalculatorDecorator(super.calculator()); } } @Configuration abstract class CacheManagerConfig extends BaseConfig { @Bean public CacheManager cacheManager() { return new ConcurrentMapCacheManager(); } } @Configuration @EnableCaching(proxyTargetClass = true) class CacheableCglibConfig extends CacheManagerConfig {} @Configuration @EnableCaching(proxyTargetClass = false) class CacheableJdkProxyConfig extends CacheManagerConfig {} @Configuration @EnableCaching(mode = AdviceMode.ASPECTJ) class CacheableAspectJWeaving extends CacheManagerConfig { @Bean @Override public Calculator calculator() { return new SpringInstrumentedCalculator(); } } @Configuration @EnableCaching(mode = AdviceMode.ASPECTJ) class AspectJCustomAspect extends CacheManagerConfig { @Bean @Override public Calculator calculator() { return new ManuallyInstrumentedCalculator(); } } Each @Configuration class represents one application context. CachingCalculatorDecorator is a decorator around real calculator that does the caching (welcome to the 1990s): public class CachingCalculatorDecorator implements Calculator { private final Map cache = new java.util.concurrent.ConcurrentHashMap(); private final Calculator target; public CachingCalculatorDecorator(Calculator target) { this.target = target; } @Override public int identity(int x) { final Integer existing = cache.get(x); if (existing != null) { return existing; } final int newValue = target.identity(x); cache.put(x, newValue); return newValue; } } SpringInstrumentedCalculator and ManuallyInstrumentedCalculator are exactly the same as PlainCalculator but they are instrumented by AspectJ compile-time weaver with Spring and custom aspect accordingly. My custom caching aspect looks like this: public aspect ManualCachingAspect { private final Map cache = new ConcurrentHashMap(); pointcut cacheMethodExecution(int x): execution(int com.blogspot.nurkiewicz.cacheable.calculator.ManuallyInstrumentedCalculator.identity(int)) && args(x); Object around(int x): cacheMethodExecution(x) { final Integer existing = cache.get(x); if (existing != null) { return existing; } final Object newValue = proceed(x); cache.put(x, (Integer)newValue); return newValue; } } After all this preparation we can finally write the benchmark itself. At the beginning I start all the application contexts and fetch Calculator instances. Each instance is different. For example noCaching is a PlainCalculator instance with no wrappers, cacheableCglib is a CGLIB generated subclass while aspectJCustom is an instance of ManuallyInstrumentedCalculator with my custom aspect woven. private final Calculator noCaching = fromSpringContext(NoCachingConfig.class); private final Calculator manualCaching = fromSpringContext(ManualCachingConfig.class); private final Calculator cacheableCglib = fromSpringContext(CacheableCglibConfig.class); private final Calculator cacheableJdkProxy = fromSpringContext(CacheableJdkProxyConfig.class); private final Calculator cacheableAspectJ = fromSpringContext(CacheableAspectJWeaving.class); private final Calculator aspectJCustom = fromSpringContext(AspectJCustomAspect.class); private static Calculator fromSpringContext(Class config) { return new AnnotationConfigApplicationContext(config).getBean(Calculator.class); } I'm going to exercise each Calculator instance with the following test. The additional accumulator is necessary, otherwise JVM might optimize away the whole loop (!): private int benchmarkWith(Calculator calculator, int reps) { int accum = 0; for (int i = 0; i < reps; ++i) { accum += calculator.identity(i % 16); } return accum; } Here is the full caliper test without parts already discussed: public class CacheableBenchmark extends SimpleBenchmark { //... public int timeNoCaching(int reps) { return benchmarkWith(noCaching, reps); } public int timeManualCaching(int reps) { return benchmarkWith(manualCaching, reps); } public int timeCacheableWithCglib(int reps) { return benchmarkWith(cacheableCglib, reps); } public int timeCacheableWithJdkProxy(int reps) { return benchmarkWith(cacheableJdkProxy, reps); } public int timeCacheableWithAspectJWeaving(int reps) { return benchmarkWith(cacheableAspectJ, reps); } public int timeAspectJCustom(int reps) { return benchmarkWith(aspectJCustom, reps); } } I hope you are still following our experiment. We are now going to execute Calculate.identity() millions of times and see which caching configuration performs best. Since we only call identity() with 16 different arguments, we hardly ever touch the method itself as we always get cache hit. Curious to see the results? benchmark ns linear runtime NoCaching 1.77 = ManualCaching 23.84 = CacheableWithCglib 1576.42 ============================== CacheableWithJdkProxy 1551.03 ============================= CacheableWithAspectJWeaving 1514.83 ============================ AspectJCustom 22.98 = Interpretation Let's go step by step. First of all calling a method in Java is pretty darn fast! 1.77 nanoseconds, we are talking here about 3 CPU cycles on my Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz! If this doesn't convince you that Java is fast, I don't know what will. But back to our test. Hand-made caching decorator is also pretty fast. Of course it's slower by an order of magnitude compared to pure function call, but still blazingly fast compared to all @Scheduled benchmarks. We see a drop by 3 orders of magnitude, from 1.8 ns to 1.5 μs. I'm especially disappointed by the @Cacheable backed by AspectJ. After all caching aspect is precompiled directly into my Java .class file, I would expect it to be much faster compared to dynamic proxies and CGLIB. But that doesn't seem to be the case. All three Spring AOP techniques are similar. The greatest surprise is my custom AspectJ aspect. It's even faster than CachingCalculatorDecorator! maybe it's due to polymorphic call in the decorator? I strongly encourage you to clone this benchmark on GitHub and run it (mvn clean test, takes around 2 minutes) to compare your results. Conclusions You might be wondering why Spring abstraction layer is so slow? Well, first of all, check out the core implementation in CacheAspectSupport - it's actually quite complex. Secondly, is it really that slow? Do the math - you typically use Spring in business applications where database, network and external APIs are the bottleneck. What latencies do you typically see? Milliseconds? Tens or hundreds of milliseconds? Now add an overhead of 2 μs (worst case scenario). For caching database queries or REST calls this is completely negligible. It doesn't matter which technique you choose. But if you are caching very low-level methods close to the metal, like CPU-intensive, in-memory computations, Spring abstraction layer might be an overkill. The bottom line: measure! PS: both benchmark and contents of this article in Markdown format are freely available.

January 15, 2013

by Tomasz Nurkiewicz

· 29,799 Views · 2 Likes

Chunk Oriented Processing in Spring Batch

Big Data Sets’ Processing is one of the most important problem in the software world. Spring Batch is a lightweight and robust batch framework to process the data sets. Spring Batch Framework offers ‘TaskletStep Oriented’ and ‘Chunk Oriented’ processing style. In this article, Chunk Oriented Processing Model is explained. Also, TaskletStep Oriented Processing in Spring Batch Article is definitely suggested to investigate how to develop TaskletStep Oriented Processing in Spring Batch. Chunk Oriented Processing Feature has come with Spring Batch v2.0. It refers to reading the data one at a time, and creating ‘chunks’ that will be written out, within a transaction boundary. One item is read from an ItemReader, handed to an ItemProcessor, and written. Once the number of items read equals the commit interval, the entire chunk is written out via the ItemWriter, and then the transaction is committed. Basically, this feature should be used if at least one data item’ s reading and writing is required. Otherwise, TaskletStep Oriented processing can be used if the data item’ s only reading or writing is required. Chunk Oriented Processing model exposes three important interface as ItemReader, ItemProcessor and ItemWriter via org.springframework.batch.item package. ItemReader : This interface is used for providing the data. It reads the data which will be processed. ItemProcessor : This interface is used for item transformation. It processes input object and transforms to output object. ItemWriter : This interface is used for generic output operations. It writes the datas which are transformed by ItemProcessor. For example, the datas can be written to database, memory or outputstream (etc). In this sample application, we will write to database. Let us take a look how to develop Chunk Oriented Processing Model. Used Technologies : JDK 1.7.0_09 Spring 3.1.3 Spring Batch 2.1.9 Hibernate 4.1.8 Tomcat JDBC 7.0.27 MySQL 5.5.8 MySQL Connector 5.1.17 Maven 3.0.4 Step 1 : Create Maven Project A maven project is created as below. (It can be created by using Maven or IDE Plug-in). Step 2: Libraries A new USER Table is created by executing below script: CREATE TABLE ONLINETECHVISION.USER ( id int(11) NOT NULL AUTO_INCREMENT, name varchar(45) NOT NULL, surname varchar(45) NOT NULL, PRIMARY KEY (`id`) ); Step 3: Libraries Firstly, dependencies are added to Maven’ s pom.xml. 3.1.3.RELEASE 2.1.9.RELEASE org.springframework spring-core ${spring.version} org.springframework spring-context ${spring.version} org.springframework spring-tx ${spring.version} org.springframework spring-orm ${spring.version} org.springframework.batch spring-batch-core ${spring-batch.version} org.hibernate hibernate-core 4.1.8.Final org.apache.tomcat tomcat-jdbc 7.0.27 mysql mysql-connector-java 5.1.17 log4j log4j 1.2.16 maven-compiler-plugin(Maven Plugin) is used to compile the project with JDK 1.7 org.apache.maven.plugins maven-compiler-plugin 3.0 1.7 1.7 The following Maven plugin can be used to create runnable-jar, org.apache.maven.plugins maven-shade-plugin 2.0 package shade 1.7 1.7 com.onlinetechvision.exe.Application META-INF/spring.handlers META-INF/spring.schemas Step 4 : Create User Entity User Entity is created. This entity will be stored after processing. package com.onlinetechvision.user; import javax.persistence.Column; import javax.persistence.Entity; import javax.persistence.GeneratedValue; import javax.persistence.GenerationType; import javax.persistence.Id; import javax.persistence.Table; /** * User Entity * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ @Entity @Table(name="USER") public class User { private int id; private String name; private String surname; @Id @GeneratedValue(strategy=GenerationType.AUTO) @Column(name="ID", unique = true, nullable = false) public int getId() { return id; } public void setId(int id) { this.id = id; } @Column(name="NAME", unique = true, nullable = false) public String getName() { return name; } public void setName(String name) { this.name = name; } @Column(name="SURNAME", unique = true, nullable = false) public String getSurname() { return surname; } public void setSurname(String surname) { this.surname = surname; } @Override public String toString() { StringBuffer strBuff = new StringBuffer(); strBuff.append("id : ").append(getId()); strBuff.append(", name : ").append(getName()); strBuff.append(", surname : ").append(getSurname()); return strBuff.toString(); } } Step 5 : Create IUserDAO Interface IUserDAO Interface is created to expose data access functionality. package com.onlinetechvision.user.dao; import java.util.List; import com.onlinetechvision.user.User; /** * User DAO Interface * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public interface IUserDAO { /** * Adds User * * @param User user */ void addUser(User user); /** * Gets User List * */ List getUsers(); } Step 6 : Create UserDAO IMPL UserDAO Class is created by implementing IUserDAO Interface. package com.onlinetechvision.user.dao; import java.util.List; import org.hibernate.SessionFactory; import com.onlinetechvision.user.User; /** * User DAO * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class UserDAO implements IUserDAO { private SessionFactory sessionFactory; /** * Gets Hibernate Session Factory * * @return SessionFactory - Hibernate Session Factory */ public SessionFactory getSessionFactory() { return sessionFactory; } /** * Sets Hibernate Session Factory * * @param SessionFactory - Hibernate Session Factory */ public void setSessionFactory(SessionFactory sessionFactory) { this.sessionFactory = sessionFactory; } /** * Adds User * * @param User user */ @Override public void addUser(User user) { getSessionFactory().getCurrentSession().save(user); } /** * Gets User List * * @return List - User list */ @SuppressWarnings({ "unchecked" }) @Override public List getUsers() { List list = getSessionFactory().getCurrentSession().createQuery("from User").list(); return list; } } Step 7 : Create IUserService Interface IUserService Interface is created for service layer. package com.onlinetechvision.user.service; import java.util.List; import com.onlinetechvision.user.User; /** * * User Service Interface * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public interface IUserService { /** * Adds User * * @param User user */ void addUser(User user); /** * Gets User List * * @return List - User list */ List getUsers(); } Step 8 : Create UserService IMPL UserService Class is created by implementing IUserService Interface. package com.onlinetechvision.user.service; import java.util.List; import org.springframework.transaction.annotation.Transactional; import com.onlinetechvision.user.User; import com.onlinetechvision.user.dao.IUserDAO; /** * * User Service * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ @Transactional(readOnly = true) public class UserService implements IUserService { IUserDAO userDAO; /** * Adds User * * @param User user */ @Transactional(readOnly = false) @Override public void addUser(User user) { getUserDAO().addUser(user); } /** * Gets User List * */ @Override public List getUsers() { return getUserDAO().getUsers(); } public IUserDAO getUserDAO() { return userDAO; } public void setUserDAO(IUserDAO userDAO) { this.userDAO = userDAO; } } Step 9 : Create TestReader IMPL TestReader Class is created by implementing ItemReader Interface. This class is called in order to read items. When read method returns null, reading operation is completed. The following steps explains with details how to be executed firstJob. The commit-interval value of firstjob is 2 and the following steps are executed : 1) firstTestReader is called to read first item(firstname_0, firstsurname_0) 2) firstTestReader is called again to read second item(firstname_1, firstsurname_1) 3) testProcessor is called to process first item(FIRSTNAME_0, FIRSTSURNAME_0) 4) testProcessor is called to process second item(FIRSTNAME_1, FIRSTSURNAME_1) 5) testWriter is called to write first item(FIRSTNAME_0, FIRSTSURNAME_0) to database 6) testWriter is called to write second item(FIRSTNAME_1, FIRSTSURNAME_1) to database 7) first and second items are committed and the transaction is closed. 8) firstTestReader is called to read third item(firstname_2, firstsurname_2) 9) maxIndex value of firstTestReader is 3. read method returns null and item reading operation is completed. 10) testProcessor is called to process third item(FIRSTNAME_2, FIRSTSURNAME_2) 11) testWriter is called to write first item(FIRSTNAME_2, FIRSTSURNAME_2) to database 12) third item is committed and the transaction is closed. firstStep is completed with COMPLETED status and secondStep is started. secondJob and thirdJob are executed in the same way. package com.onlinetechvision.item; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.NonTransientResourceException; import org.springframework.batch.item.ParseException; import org.springframework.batch.item.UnexpectedInputException; import com.onlinetechvision.user.User; /** * TestReader Class is created to read items which will be processed * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class TestReader implements ItemReader { private int index; private int maxIndex; private String namePrefix; private String surnamePrefix; /** * Reads items one by one * * @return User * * @throws Exception * @throws UnexpectedInputException * @throws ParseException * @throws NonTransientResourceException * */ @Override public User read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException { User user = new User(); user.setName(getNamePrefix() + "_" + index); user.setSurname(getSurnamePrefix() + "_" + index); if(index > getMaxIndex()) { return null; } incrementIndex(); return user; } /** * Increments index which defines read-count * * @return int * */ private int incrementIndex() { return index++; } public int getMaxIndex() { return maxIndex; } public void setMaxIndex(int maxIndex) { this.maxIndex = maxIndex; } public String getNamePrefix() { return namePrefix; } public void setNamePrefix(String namePrefix) { this.namePrefix = namePrefix; } public String getSurnamePrefix() { return surnamePrefix; } public void setSurnamePrefix(String surnamePrefix) { this.surnamePrefix = surnamePrefix; } } Step 10 : Create FailedCaseTestReader IMPL FailedCaseTestReader Class is created in order to simulate the failed job status. In this sample application, when thirdJob is processed at fifthStep, failedCaseTestReader is called and exception is thrown so its status will be FAILED. package com.onlinetechvision.item; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.NonTransientResourceException; import org.springframework.batch.item.ParseException; import org.springframework.batch.item.UnexpectedInputException; import com.onlinetechvision.user.User; /** * FailedCaseTestReader Class is created in order to simulate the failed job status. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class FailedCaseTestReader implements ItemReader { private int index; private int maxIndex; private String namePrefix; private String surnamePrefix; /** * Reads items one by one * * @return User * * @throws Exception * @throws UnexpectedInputException * @throws ParseException * @throws NonTransientResourceException * */ @Override public User read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException { User user = new User(); user.setName(getNamePrefix() + "_" + index); user.setSurname(getSurnamePrefix() + "_" + index); if(index >= getMaxIndex()) { throw new Exception("Unexpected Error!"); } incrementIndex(); return user; } /** * Increments index which defines read-count * * @return int * */ private int incrementIndex() { return index++; } public int getMaxIndex() { return maxIndex; } public void setMaxIndex(int maxIndex) { this.maxIndex = maxIndex; } public String getNamePrefix() { return namePrefix; } public void setNamePrefix(String namePrefix) { this.namePrefix = namePrefix; } public String getSurnamePrefix() { return surnamePrefix; } public void setSurnamePrefix(String surnamePrefix) { this.surnamePrefix = surnamePrefix; } } Step 11 : Create TestProcessor IMPL TestProcessor Class is created by implementing ItemProcessor Interface. This class is called to process items. User item is received from TestReader, processed and returned to TestWriter. package com.onlinetechvision.item; import java.util.Locale; import org.springframework.batch.item.ItemProcessor; import com.onlinetechvision.user.User; /** * TestProcessor Class is created to process items. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class TestProcessor implements ItemProcessor { /** * Processes items one by one * * @param User user * @return User * @throws Exception * */ @Override public User process(User user) throws Exception { user.setName(user.getName().toUpperCase(Locale.ENGLISH)); user.setSurname(user.getSurname().toUpperCase(Locale.ENGLISH)); return user; } } Step 12 : Create TestWriter IMPL TestWriter Class is created by implementing ItemWriter Interface. This class is called to write items to DB, memory etc… package com.onlinetechvision.item; import java.util.List; import org.springframework.batch.item.ItemWriter; import com.onlinetechvision.user.User; import com.onlinetechvision.user.service.IUserService; /** * TestWriter Class is created to write items to DB, memory etc... * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class TestWriter implements ItemWriter { private IUserService userService; /** * Writes items via list * * @throws Exception * */ @Override public void write(List userList) throws Exception { for(User user : userList) { getUserService().addUser(user); } System.out.println("User List : " + getUserService().getUsers()); } public IUserService getUserService() { return userService; } public void setUserService(IUserService userService) { this.userService = userService; } } Step 13 : Create FailedStepTasklet Class FailedStepTasklet is created by implementing Tasklet Interface. It illustrates business logic in failed step. package com.onlinetechvision.tasklet; import org.apache.log4j.Logger; import org.springframework.batch.core.StepContribution; import org.springframework.batch.core.scope.context.ChunkContext; import org.springframework.batch.core.step.tasklet.Tasklet; import org.springframework.batch.repeat.RepeatStatus; /** * FailedStepTasklet Class illustrates a failed job. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class FailedStepTasklet implements Tasklet { private static final Logger logger = Logger.getLogger(FailedStepTasklet.class); private String taskResult; /** * Executes FailedStepTasklet * * @param StepContribution stepContribution * @param ChunkContext chunkContext * @return RepeatStatus * @throws Exception * */ public RepeatStatus execute(StepContribution stepContribution, ChunkContext chunkContext) throws Exception { logger.debug("Task Result : " + getTaskResult()); throw new Exception("Error occurred!"); } public String getTaskResult() { return taskResult; } public void setTaskResult(String taskResult) { this.taskResult = taskResult; } } Step 14 : Create BatchProcessStarter Class BatchProcessStarter Class is created to launch the jobs. Also, it logs their execution results. package com.onlinetechvision.spring.batch; import org.apache.log4j.Logger; import org.springframework.batch.core.Job; import org.springframework.batch.core.JobExecution; import org.springframework.batch.core.JobParametersBuilder; import org.springframework.batch.core.JobParametersInvalidException; import org.springframework.batch.core.launch.JobLauncher; import org.springframework.batch.core.repository.JobExecutionAlreadyRunningException; import org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException; import org.springframework.batch.core.repository.JobRepository; import org.springframework.batch.core.repository.JobRestartException; /** * BatchProcessStarter Class launches the jobs and logs their execution results. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class BatchProcessStarter { private static final Logger logger = Logger.getLogger(BatchProcessStarter.class); private Job firstJob; private Job secondJob; private Job thirdJob; private JobLauncher jobLauncher; private JobRepository jobRepository; /** * Starts the jobs and logs their execution results. * */ public void start() { JobExecution jobExecution = null; JobParametersBuilder builder = new JobParametersBuilder(); try { getJobLauncher().run(getFirstJob(), builder.toJobParameters()); jobExecution = getJobRepository().getLastJobExecution(getFirstJob().getName(), builder.toJobParameters()); logger.debug(jobExecution.toString()); getJobLauncher().run(getSecondJob(), builder.toJobParameters()); jobExecution = getJobRepository().getLastJobExecution(getSecondJob().getName(), builder.toJobParameters()); logger.debug(jobExecution.toString()); getJobLauncher().run(getThirdJob(), builder.toJobParameters()); jobExecution = getJobRepository().getLastJobExecution(getThirdJob().getName(), builder.toJobParameters()); logger.debug(jobExecution.toString()); } catch (JobExecutionAlreadyRunningException | JobRestartException | JobInstanceAlreadyCompleteException | JobParametersInvalidException e) { logger.error(e); } } public Job getFirstJob() { return firstJob; } public void setFirstJob(Job firstJob) { this.firstJob = firstJob; } public Job getSecondJob() { return secondJob; } public void setSecondJob(Job secondJob) { this.secondJob = secondJob; } public Job getThirdJob() { return thirdJob; } public void setThirdJob(Job thirdJob) { this.thirdJob = thirdJob; } public JobLauncher getJobLauncher() { return jobLauncher; } public void setJobLauncher(JobLauncher jobLauncher) { this.jobLauncher = jobLauncher; } public JobRepository getJobRepository() { return jobRepository; } public void setJobRepository(JobRepository jobRepository) { this.jobRepository = jobRepository; } } Step 15 : Create dataContext.xml jdbc.properties, is created. It defines data-source informations and is read via dataContext.xml jdbc.db.driverClassName=com.mysql.jdbc.Driver jdbc.db.url=jdbc:mysql://localhost:3306/onlinetechvision jdbc.db.username=root jdbc.db.password=root jdbc.db.initialSize=10 jdbc.db.minIdle=3 jdbc.db.maxIdle=10 jdbc.db.maxActive=10 jdbc.db.testWhileIdle=true jdbc.db.testOnBorrow=true jdbc.db.testOnReturn=true jdbc.db.initSQL=SELECT 1 FROM DUAL jdbc.db.validationQuery=SELECT 1 FROM DUAL jdbc.db.timeBetweenEvictionRunsMillis=30000 Step 16 : Create dataContext.xml Spring Configuration file, dataContext.xml, is created. It covers dataSource, sessionFactory and transactionManager definitions. com.onlinetechvision.user.User org.hibernate.dialect.MySQLDialect true Step 17 : Create jobContext.xml Spring Configuration file, jobContext.xml, is created. It covers jobRepository, jobLauncher, item reader, item processor, item writer, tasklet and job definitions. Step 18 : Create applicationContext.xml Spring Configuration file, applicationContext.xml, is created. It covers bean definitions. Step 19 : Create Application Class Application Class is created to run the application. package com.onlinetechvision.exe; import org.springframework.context.ApplicationContext; import org.springframework.context.support.ClassPathXmlApplicationContext; import com.onlinetechvision.spring.batch.BatchProcessStarter; /** * Application Class starts the application. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class Application { /** * Starts the application * * @param String[] args * */ public static void main(String[] args) { ApplicationContext appContext = new ClassPathXmlApplicationContext("applicationContext.xml"); BatchProcessStarter batchProcessStarter = (BatchProcessStarter)appContext.getBean("batchProcessStarter"); batchProcessStarter.start(); } } Step 20 : Build Project After OTV_SpringBatch_Chunk_Oriented_Processing Project is built, OTV_SpringBatch_Chunk_Oriented_Processing-0.0.1-SNAPSHOT.jar will be created. STEP 21 : RUN PROJECT After created OTV_SpringBatch_Chunk_Oriented_Processing-0.0.1-SNAPSHOT.jar file is run, the following database and console output logs will be shown : Database screenshot : First Job’ s console output : 16.12.2012 19:30:41 INFO (SimpleJobLauncher.java:118) - Job: [FlowJob: [name=firstJob]] launched with the following parameters: [{}] 16.12.2012 19:30:41 DEBUG (AbstractJob.java:278) - Job execution starting: JobExecution: id=0, version=0, startTime=null, endTime=null, lastUpdated=Sun Dec 16 19:30:41 GMT 2012, status=STARTING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=0, version=0, JobParameters=[{}], Job=[firstJob]] User List : [id : 181, name : FIRSTNAME_0, surname : FIRSTSURNAME_0, id : 182, name : FIRSTNAME_1, surname : FIRSTSURNAME_1, id : 183, name : FIRSTNAME_2, surname : FIRSTSURNAME_2, id : 184, name : SECONDNAME_0, surname : SECONDSURNAME_0, id : 185, name : SECONDNAME_1, surname : SECONDSURNAME_1, id : 186, name : SECONDNAME_2, surname : SECONDSURNAME_2] 16.12.2012 19:30:42 DEBUG (BatchProcessStarter.java:43) - JobExecution: id=0, version=2, startTime=Sun Dec 16 19:30:41 GMT 2012, endTime=Sun Dec 16 19:30:42 GMT 2012, lastUpdated=Sun Dec 16 19:30:42 GMT 2012, status=COMPLETED, exitStatus=exitCode=COMPLETED;exitDescription=, job=[JobInstance: id=0, version=0, JobParameters=[{}], Job=[firstJob]] Second Job’ s console output : 16.12.2012 19:30:42 INFO (SimpleJobLauncher.java:118) - Job: [FlowJob: [name=secondJob]] launched with the following parameters: [{}] 16.12.2012 19:30:42 DEBUG (AbstractJob.java:278) - Job execution starting: JobExecution: id=1, version=0, startTime=null, endTime=null, lastUpdated=Sun Dec 16 19:30:42 GMT 2012, status=STARTING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=1, version=0, JobParameters=[{}], Job=[secondJob]] User List : [id : 181, name : FIRSTNAME_0, surname : FIRSTSURNAME_0, id : 182, name : FIRSTNAME_1, surname : FIRSTSURNAME_1, id : 183, name : FIRSTNAME_2, surname : FIRSTSURNAME_2, id : 184, name : SECONDNAME_0, surname : SECONDSURNAME_0, id : 185, name : SECONDNAME_1, surname : SECONDSURNAME_1, id : 186, name : SECONDNAME_2, surname : SECONDSURNAME_2, id : 187, name : THIRDNAME_0, surname : THIRDSURNAME_0, id : 188, name : THIRDNAME_1, surname : THIRDSURNAME_1, id : 189, name : THIRDNAME_2, surname : THIRDSURNAME_2, id : 190, name : THIRDNAME_3, surname : THIRDSURNAME_3, id : 191, name : FOURTHNAME_0, surname : FOURTHSURNAME_0, id : 192, name : FOURTHNAME_1, surname : FOURTHSURNAME_1, id : 193, name : FOURTHNAME_2, surname : FOURTHSURNAME_2, id : 194, name : FOURTHNAME_3, surname : FOURTHSURNAME_3] 16.12.2012 19:30:42 DEBUG (BatchProcessStarter.java:47) - JobExecution: id=1, version=2, startTime=Sun Dec 16 19:30:42 GMT 2012, endTime=Sun Dec 16 19:30:42 GMT 2012, lastUpdated=Sun Dec 16 19:30:42 GMT 2012, status=COMPLETED, exitStatus=exitCode=COMPLETED;exitDescription=, job=[JobInstance: id=1, version=0, JobParameters=[{}], Job=[secondJob]] Third Job’ s console output : 16.12.2012 19:30:42 INFO (SimpleJobLauncher.java:118) - Job: [FlowJob: [name=thirdJob]] launched with the following parameters: [{}] 16.12.2012 19:30:42 DEBUG (AbstractJob.java:278) - Job execution starting: JobExecution: id=2, version=0, startTime=null, endTime=null, lastUpdated=Sun Dec 16 19:30:42 GMT 2012, status=STARTING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=2, version=0, JobParameters=[{}], Job=[thirdJob]] 16.12.2012 19:30:42 DEBUG (TransactionTemplate.java:159) - Initiating transaction rollback on application exception org.springframework.batch.repeat.RepeatException: Exception in batch process; nested exception is java.lang.Exception: Unexpected Error! ... 16.12.2012 19:30:43 DEBUG (BatchProcessStarter.java:51) - JobExecution: id=2, version=2, startTime=Sun Dec 16 19:30:42 GMT 2012, endTime=Sun Dec 16 19:30:43 GMT 2012, lastUpdated=Sun Dec 16 19:30:43 GMT 2012, status=FAILED, exitStatus=exitCode=FAILED;exitDescription=, job=[JobInstance: id=2, version=0, JobParameters=[{}], Job=[thirdJob]] Step 22 : Download https://github.com/erenavsarogullari/OTV_SpringBatch_Chunk_Oriented_Processing REFERENCES : Chunk Oriented Processing in Spring Batch

January 3, 2013

by Eren Avsarogullari

· 153,246 Views · 7 Likes

UML2 Class Diagram in Java

Learn all about this Unified Modeling Language diagram in Java.

December 31, 2012

by Mainak Goswami

· 200,485 Views · 10 Likes

Google Guava Cache with regular expression patterns

Hi! Merry Christmas everyone :) Quite recently I've seen a nice presentation about Google Guava and we came to the conclusion in our project that it could be really interesting to use the its Cache functionallity. Let us take a look at the regexp Pattern class and its compile function. Quite often in the code we can see that each time a regular expression is being used a programmer is repeatidly calling the aforementioned Pattern.compile() function with the same argument thus compiling the same regular expression over and over again. What could be done however is to cache the result of such compilations - let us take a look at the RegexpUtils utility class: RegexpUtils.java package pl.grzejszczak.marcin.guava.cache.utils; import com.google.common.cache.CacheBuilder; import com.google.common.cache.CacheLoader; import com.google.common.cache.LoadingCache; import java.util.concurrent.ExecutionException; import java.util.regex.Matcher; import java.util.regex.Pattern; import static java.lang.String.format; public final class RegexpUtils { private RegexpUtils() { throw new UnsupportedOperationException("RegexpUtils is a utility class - don't instantiate it!"); } private static final LoadingCache COMPILED_PATTERNS = CacheBuilder.newBuilder().build(new CacheLoader() { @Override public Pattern load(String regexp) throws Exception { return Pattern.compile(regexp); } }); public static Pattern getPattern(String regexp) { try { return COMPILED_PATTERNS.get(regexp); } catch (ExecutionException e) { throw new RuntimeException(format("Error when getting a pattern [%s] from cache", regexp), e); } } public static boolean matches(String stringToCheck, String regexp) { return doGetMatcher(stringToCheck, regexp).matches(); } public static Matcher getMatcher(String stringToCheck, String regexp) { return doGetMatcher(stringToCheck, regexp); } private static Matcher doGetMatcher(String stringToCheck, String regexp) { Pattern pattern = getPattern(regexp); return pattern.matcher(stringToCheck); } } As you can see the Guava's LoadingCache with the CacheBuilder is being used to populate a cache with a new compiled pattern if one is not found. Due to caching the compiled pattern if a compilation has already taken place it will not be repeated ever again (in our case since we dno't have any expiry set). Now a simple test GuavaCache.java package pl.grzejszczak.marcin.guava.cache; import com.google.common.base.Stopwatch; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import pl.grzejszczak.marcin.guava.cache.utils.RegexpUtils; import java.util.regex.Pattern; import static java.lang.String.format; public class GuavaCache { private static final Logger LOGGER = LoggerFactory.getLogger(GuavaCache.class); public static final String STRING_TO_MATCH = "something"; public static void main(String[] args) { runTestForManualCompilationAndOneUsingCache(1); runTestForManualCompilationAndOneUsingCache(10); runTestForManualCompilationAndOneUsingCache(100); runTestForManualCompilationAndOneUsingCache(1000); runTestForManualCompilationAndOneUsingCache(10000); runTestForManualCompilationAndOneUsingCache(100000); runTestForManualCompilationAndOneUsingCache(1000000); } private static void runTestForManualCompilationAndOneUsingCache(int firstNoOfRepetitions) { repeatManualCompilation(firstNoOfRepetitions); repeatCompilationWithCache(firstNoOfRepetitions); } private static void repeatManualCompilation(int noOfRepetitions) { Stopwatch stopwatch = new Stopwatch().start(); compileAndMatchPatternManually(noOfRepetitions); LOGGER.debug(format("Time needed to compile and check regexp expression [%d] ms, no of iterations [%d]", stopwatch.elapsedMillis(), noOfRepetitions)); } private static void repeatCompilationWithCache(int noOfRepetitions) { Stopwatch stopwatch = new Stopwatch().start(); compileAndMatchPatternUsingCache(noOfRepetitions); LOGGER.debug(format("Time needed to compile and check regexp expression using Cache [%d] ms, no of iterations [%d]", stopwatch.elapsedMillis(), noOfRepetitions)); } private static void compileAndMatchPatternManually(int limit) { for (int i = 0; i < limit; i++) { Pattern.compile("something").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something1").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something2").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something3").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something4").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something5").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something6").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something7").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something8").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something9").matcher(STRING_TO_MATCH).matches(); } } private static void compileAndMatchPatternUsingCache(int limit) { for (int i = 0; i < limit; i++) { RegexpUtils.matches(STRING_TO_MATCH, "something"); RegexpUtils.matches(STRING_TO_MATCH, "something1"); RegexpUtils.matches(STRING_TO_MATCH, "something2"); RegexpUtils.matches(STRING_TO_MATCH, "something3"); RegexpUtils.matches(STRING_TO_MATCH, "something4"); RegexpUtils.matches(STRING_TO_MATCH, "something5"); RegexpUtils.matches(STRING_TO_MATCH, "something6"); RegexpUtils.matches(STRING_TO_MATCH, "something7"); RegexpUtils.matches(STRING_TO_MATCH, "something8"); RegexpUtils.matches(STRING_TO_MATCH, "something9"); } } } We are running a series of tests and checking the time of their execution. Note that the results of these tests are not precise due to the fact that the application is not being run in isolation so numerous conditions can affect the time of the execution. We are interested in showing some degree of the problem rather than showing the precise execution time. For a given number of iterations (1,10,100,1000,10000,100000,1000000) we are either compiling 10 regular expressions or using a Guava's cache to retrieve the compiled Pattern and then we match them against a string to match. These are the logs: pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [1] ms, no of iterations [1] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [35] ms, no of iterations [1] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [1] ms, no of iterations [10] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [0] ms, no of iterations [10] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [8] ms, no of iterations [100] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [3] ms, no of iterations [100] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [10] ms, no of iterations [1000] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [10] ms, no of iterations [1000] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [83] ms, no of iterations [10000] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [33] ms, no of iterations [10000] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [800] ms, no of iterations [100000] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [279] ms, no of iterations [100000] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [7562] ms, no of iterations [1000000] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [3067] ms, no of iterations [1000000] You can find the sources over here under the Guava/Cache directory or go to the url https://bitbucket.org/gregorin1987/too-much-coding/src

December 28, 2012

by Marcin Grzejszczak

· 9,213 Views

HashMap.get High CPU – Case Study

This article will describe the complete root cause analysis and solution of a HashMap High CPU problem (infinite looping) affecting a Weblogic 10.0 environment running on the Java HotSpot VM 1.5. This case study will again demonstrate this importance of mastering Thread Dump analysis skill and CPU correlation techniques such as Solaris prstat. Environment specifications Java EE server: Oracle Weblogic Portal 10.0 Middleware OS: Solaris 10 Java VM: Java HotSpot VM 1.5 Platform type: Portal application Monitoring and troubleshooting tools JVM Thread Dump (HotSpot format) Solaris prstat (CPU contributors analysis) Problem overview Problem type: High CPUobserved from our Weblogic production environment A high CPU problem was observed from our Solaris physical servers hosting a Weblogic Portal 10 environment. Users also reporting major slowdown of the portal application. Gathering and validation of facts As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause: What is the client impact? HIGH Recent change of the affected platform? No Any recent traffic increase to the affected platform? Yes How does this high CPU manifest itself? A sudden CPU increase was observed and is not going down; even after load goes down e.g. near zero level. Did an Oracle OSB recycle resolve the problem? Yes, but problem is returning after few hours or few days (unpredictable pattern) - Conclusion #1: The high CPU problem appears to be intermittent vs. pure correlation with load - Conclusion #2: Since high CPU remains after load goes down, this typically indicates either the presence of some infinite looping or heavy processing Threads Solaris CPU analysis using prstat Solaris prstat is a powerful OS command allowing you to obtain the CPU per process but more importantly CPU per Thread within a process. As you can see below from our case study, the CPU utilization was confirmed to go up as high as 100% utilization (saturation level). ## PRSTAT (CPU per Java Thread analysis) prstat -L -p 8223 1 1 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID 8223 bea10 2809M 2592M sleep 59 0 14:52:59 38.6%java/494 8223 bea10 2809M 2592M sleep 57 0 12:28:05 22.3% java/325 8223 bea10 2809M 2592M sleep 59 0 11:52:02 28.3% java/412 8223 bea10 2809M 2592M sleep 59 0 5:50:00 0.3% java/84 8223 bea10 2809M 2592M sleep 58 0 2:27:20 0.2% java/43 8223 bea10 2809M 2592M sleep 59 0 1:39:42 0.2% java/41287 8223 bea10 2809M 2592M sleep 59 0 4:41:44 0.2% java/30503 8223 bea10 2809M 2592M sleep 59 0 5:58:32 0.2% java/36116 …………………………………………………………………………………… As you can see from above data, 3 Java Threads were found using together close to 100% of the CPU utilization. For our root cause analysis, we did focus on Thread #494 (decimal format) corresponding to 0x1ee (HEXA format). Thread Dump analysis and PRSTAT correlation Once the culprit Threads were identified, the next step was to correlate this data with the Thread Dump data (which was captured exactly at the same time as prstat). A quick search within the generated Thread Dump file did reveal the Thread Stack Trace (Weblogic Stuck Thread #125) for 0x1ee as per below. "[STUCK] ExecuteThread: '125' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=1 tid=0x014c5030 nid=0x1ee runnable [0x536fb000..0x536ffc70] at java.util.HashMap.get(HashMap.java:346) at org.apache.axis.encoding.TypeMappingImpl.getClassForQName(TypeMappingImpl.java:715) at org.apache.axis.encoding.TypeMappingDelegate.getClassForQName(TypeMappingDelegate.java:170) at org.apache.axis.encoding.TypeMappingDelegate.getClassForQName(TypeMappingDelegate.java:160) at org.apache.axis.encoding.TypeMappingImpl.getDeserializer(TypeMappingImpl.java:454) at org.apache.axis.encoding.TypeMappingDelegate.getDeserializer(TypeMappingDelegate.java:108) at org.apache.axis.encoding.TypeMappingDelegate.getDeserializer(TypeMappingDelegate.java:102) at org.apache.axis.encoding.DeserializationContext.getDeserializer(DeserializationContext.java:457) at org.apache.axis.encoding.DeserializationContext.getDeserializerForType(DeserializationContext.java:547) at org.apache.axis.encoding.ser.BeanDeserializer.getDeserializer(BeanDeserializer.java:514) at org.apache.axis.encoding.ser.BeanDeserializer.onStartChild(BeanDeserializer.java:286) at org.apache.axis.encoding.DeserializationContext.startElement(DeserializationContext.java:1035) at org.apache.axis.message.SAX2EventRecorder.replay(SAX2EventRecorder.java:165) at org.apache.axis.message.MessageElement.publishToHandler(MessageElement.java:1141) at org.apache.axis.message.RPCElement.deserialize(RPCElement.java:236) at org.apache.axis.message.RPCElement.getParams(RPCElement.java:384) at org.apache.axis.client.Call.invoke(Call.java:2467) at org.apache.axis.client.Call.invoke(Call.java:2366) at org.apache.axis.client.Call.invoke(Call.java:1812) Thread Dump analysis – HashMap.get() infinite loop condition! As you can see from the above Thread Stack Trace, the Thread is currently stuck in an infinite loop over a java.util.HashMap that originates from the Apache Axis TypeMappingImpl Java class. This finding was quite revealing. The 2 others Threads using high CPU also did reveal infinite looping condition within the same Apache Axis HashMap Object. Root cause: non Thread safe HashMap in Apache Axis 1.4 Additional research did reveal this known defect affecting Apache Axis 1.4; which is the version that our application was using. As you may already know, usage of non Thread safe / non synchronized HashMap under concurrent Threads condition is very dangerous and can easily lead to internal HashMap index corruption and / or infinite looping. This is also a golden rule for any middleware software such as Oracle Weblogic, IBM WAS, Red Hat JBoss which rely heavily on HashMap data structures from various Java EE and caching services. Such best practice is also applicable for any Open Source third party API such as Apache Axis. The most common solution is to use the ConcurrentHashMap data structure which is designed for that type of concurrent Thread execution context. Solution Our team did apply the proposed patch from Apache (synchronize the non Thread safe HashMap) which did resolve the problem. We are also currently looking at upgrading our application to a newer version of Apache Axis. Conclusion I hope this case study has helped you understand how to pinpoint the root cause of high CPU Threads and the importance of proper Thread safe data structure for high concurrent Thread / processing applications. Please don’t hesitate to post any comment or question. Find office supplies promo codes to save money for your department's bottom line, so you can spend more on the latest software.

December 27, 2012

by Pierre - Hugues Charbonneau

· 8,498 Views · 1 Like

MongoDB and the Concept of Identity in NoSQL Databases

in this article i deal with a different nosql database called mongodb a mature nosql engine born outside the .net world to clarify the concept of id in a typical no sql database. installation of mongo is really simple, just download, uncompress, locate the bin folder, and type this from an administrator console prompt to install mongo as service 1: mongod --install --logpath c:xxxx --dbpath c:yyyy you can find plenty of installation guide on the internet, but with the above install command you create a windows service that will automatically start mongodb on your machine using specified datafolder. now you should download the c# driver to connect from .net code, but if you like using linq you can install fluent mongo directly with nuget. figure 1: install fluent mongo with nuget. fluent mongo is a library that gave little linq capability over standard drivers, but adding a nuget reference to fluentmongo you get automatically a reference to the official drivers. now you are ready to insert your first record in mongodb with the above code. mongoserver server = mongoserver.create(); mongodatabase databasetest = server.getdatabase("test"); var untyped = databasetest.getcollection("untyped"); untyped.save(new bsondocument { { "name", "untyped1" } }); bsondocument seconddocument = bsondocument.parse("{name: 'untyped2', blabla: 'bla bla value'}"); untyped.save(seconddocument); in line1 i create a connection to mongodb server passing no parameter to connect to local mongodb server, then i obtain a reference to a mongodatabase object called “test” with mongoserver::getdatabase() method and finally i get a reference to a collection named “untyped” with the mongngodatabase::getcollection () method. this is quite similar to a sql server or other sql database, you have a server, the server contains several databases, and each database is composed by tables; in the same way mongo is divided into server/database/collection where a collection contains document. mongodb stores data in json format and to insert data inside a collection you can simply create a bsondocument, an object defined by c# driver assembly that is capable to represent a document composed by a series of key-value pair. to initialize a bsondocument you can pass a icollection (line 4) or if you feel more confortable with string json representation, you can user bsondocument.parse() to specify the document directly with a json string. after you inserted the above documents you can use mongovue to see what is contained inside the database. figure 2: use mongovue to see what is inside the database the interesting aspect is that each document has an unique id, even if i did not specified any special property in the code. this is a standard behavior for nosql databases, if you did not specify any id property the database engine will create unique id on his own to idendify the docuemnt. the id is a key factor for mongo and other nosql storage, if you try to store a document directly inside the collection specifying json content you will get an error. untyped.save("{name: 'json', attribute:'attribute content'}"); the mongocollection object contains a save object that accepts a string, but the above call will fail with the error subclass must implement getdocumentid . previous code works because one of the specific functionality that a bsondocument implements is the ability to manage id generation, but plain json does not have this capability. if you need to know the id generated by the database, you can query the bsondocument for its unique id after it was saved in a mongo collection . (remember that the id is not available until you save the document). bsondocument seconddocument = bsondocument.parse("{name: 'untyped2', blabla: 'bla bla value'}"); object id; type idtype; iidgenerator generator; untyped.save(seconddocument); seconddocument.getdocumentid(out id, out idtype, out generator); basically you are asking to your bsondocument to return you the generated id, as well as the type of the id and the generator that mongo used to generate that specific id. the result is represented into this snippet figure 3: the three object that you got with a call to getdocumentid: id, idtype and the generator. as you can see the id is an instance of type mongodb.bson.objectid, based on bsonvalue base class and the generator is an instance of objectidgenerator. this type of id is specific to mongo, and the documentation states that an objectid is a bson objectid is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter. note that the timestamp and counter fields must be stored big endian unlike the rest of bson. this is because they are compared byte-by-byte and we want to ensure a mostly increasing order. if you want to have a generator that creates integer id, like identity column in sql server , you will find that it is simply not available out of the box, because an int value is not guarantee to be unique if you use sharding. sharding is a technique that permits to partition data into different physical instances, so each instance should generates ids that are unique across all instances and this prevents the use of a simple int32 id. clearly in .net world a guid is guarantee to be unique and is more .net oriented, so mongo db has a guid id generator, that can be specified with the above snippet of code.. bsondocument thirddocument = bsondocument.parse("{name: 'untyped3', anotherproperty: 'xxxxxxxxxxxxxxxxxxxxxxx'}"); var id2 = mongodb.bson.serialization.idgenerators.guidgenerator.instance.generateid(untyped, thirddocument); thirddocument.setdocumentid(id2); the key is using the guidgenerator (in the mongodb.bson.serialization.idgenerators namespace) to generate a valid mongoid guid value, then call the setdocumentid method of bsondocument to manually set the id and not relay on automatic id generation. if you look at the db you will find that the document with guid id has really a different id type. figure 4: the document with guid id is represented in a different way in mongovue, but as you can verify there is no problem in having documents with different id types in the same collection. this demonstrates that a no sql database has a concept of document id that is similar to the concept of id of a standard sql server, you can use a native id generation of the engine that generates a valid id during insertion or you can assign your own id to the document, but basically the whole concept of id is more engine-related and has no business meaning, so i strongly discourage to use anything that has a business meaning as id of a document. noone prevents you to insert in the document a property called “myid” or something else that has a business meaning and can be used as logical id and let the engine handle the internal id by itself.

December 26, 2012

by Ricci Gian Maria

· 8,232 Views

When you should and should NOT use ENUM data type

ENUM is a new enumerated data type introduced in CUBRID 9.0. Like in all programming languages, the ENUM type is a data type composed of a set of static, ordered values. Users can define numeric and string values for ENUM columns. Working with ENUM types Creating an ENUM column is done by specifying a static list of possible values: CREATE TABLE person( name VARCHAR(255), gender ENUM('Male', 'Female') ); CUBRID understands the ENUM type as an ordered set of constants which, in the above example, is a set of {NULL: NULL, 1: 'Male', 2: 'Female”}. To assign a value to the gender column, users may either use the index of the value ({NULL, 1, 2}) or the actual constant literal ({NULL}, {'Male'}, {'Female'}). CUBRID restricts the values that can be assigned to this column to only values from this set + NULL. Moreover, ENUM column is case-sensitive, i.e. it will raise an error if you try to enter 'female' in lower case. Also, an empty string is allowed if it is defined as one of the elements of the ENUM column. In our examples, it is not allowed. csql> INSERT INTO person(name, gender) VALUES('Eugene', 'Male'); 1 row affected. 1 command(s) successfully processed. csql> INSERT INTO person(name, gender) VALUES('Anne', 2); 1 row affected. 1 command(s) successfully processed. csql> SELECT * FROM person; === === name gender ============================================ 'Anne' 'Female' 'Eugene' 'Male' 2 rows selected. Any attempt to insert a value outside of the defined set will result in a coercion error. In the below case, trying to insert an empty string raises an error because it is not in the set of allowed values defined in the person table. csql> INSERT INTO person(name, gender) VALUES('John', 'N/A'); IN line 1, COLUMN 44, ERROR: before ' ); ' Cannot coerce 'N/A' TO type enum. 0 command(s) successfully processed. csql> INSERT INTO person(name, gender) VALUES('John', 4); IN line 1, COLUMN 45, ERROR: before ' ); ' Cannot coerce 4 TO type enum. 0 command(s) successfully processed. csql> INSERT INTO person(name, gender) VALUES('John', ''); IN line 1, COLUMN 44, ERROR: before ' ); ' Cannot coerce '' TO type enum. 0 command(s) successfully processed. Why you should use the ENUM type There are three important reasons for which you should consider using the ENUM type: Reduce storage space. Reduce join complexity. Create cheap values constraints. Storage Space CUBRID uses only 1 byte per tuple when 255 or less ENUM elements are defined or 2 bytes for 256~65535 elements. This is because, rather that storing the constant literal of the value, CUBRID stores the index in the ordered set of that value. For very large tables, this might prove to be a significant storage space save. Take, for example, a table with 1,000,000,000 records which has an ENUM column defined as ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'). If you use a VARCHAR type instead of the ENUM type to store these values, the column would require anywhere between 5GB and 9GB of storage space. Using the ENUM type, you can reduce the required space to 2 bytes per tuple, adding up to a total of 2GB. Reduce join complexity JOIN way The same effect of the ENUM type can be achieved by creating a one to many relationship on two or more tables. Considering the example above, you can store values for days of the week like this: CREATE TABLE days_of_week( id SHORT PRIMARY KEY, name VARCHAR(9) ); CREATE TABLE opening_hours( week_day SHORT, opening_time TIME, closing_time TIME, FOREIGN KEY fk_dow (week_day) REFERENCES days_of_week(id) ); Then, when you wish to display the name of the week day, you would execute a query like: SELECT d.name day_name, o.opening_time, o.closing_time FROM days_of_week d, opening_hours o WHERE d.id = o.week_day ORDER BY d.id; === === day_name opening_time closing_time ================================================== 'Monday' 09:00:00 AM 06:00:00 PM 'Tuesday' 09:00:00 AM 06:00:00 PM 'Wednesday' 09:00:00 AM 06:00:00 PM ... ENUM way You can achieve the same effect using an ENUM column: CREATE TABLE opening_hours( week_day ENUM ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'), opening_time TIME, closing_time TIME ); And there’s no JOIN required to select opening hours: SELECT week_day, opening_time, closing_time FROM opening_hours ORDER BY week_day; === === week_day opening_time closing_time ================================================== 'Monday' 09:00:00 AM 06:00:00 PM 'Tuesday' 09:00:00 AM 06:00:00 PM 'Wednesday' 09:00:00 AM 06:00:00 PM ... This can prove to be very useful, especially if your queries join several tables. Value constraints ENUM columns behave like foreign key relationships in the sense that values from an ENUM column are restricted to the values specified in the column definition. For a short list of values, this is more efficient than creating foreign key relationships. While foreign key relationships use index scans to enforce the restriction, ENUM columns just have to go through a list of predefined values which is faster even for small indexes. Why/When you should NOT use the ENUM type Even though ENUM is a great feature, there are cases when you’d better not use it. For example: When ENUM type is not fixed When ENUM type has a long list of values When your application does not know the list of ENUM values ENUM type is not reusable Portability is a concern When ENUM type is not fixed If you’re not sure if the ENUM type holds all possible values for that column, you should consider using a one to many relationship instead. The only way in which an ENUM column can be changed to handle more values is by using an ALTER statement. This is a very expensive operation in any RDBMS and requires administrator rights. Also, ALTER statements are maintenance operations and should, as much as possible, be performed offline. When ENUM type has a long list of values ENUM types should not be used if you cannot limit a set of possible values to a few elements. When your application does not know the list of ENUM values There are only two ways of getting a list of values you have defined for an ENUM type: parsing the output of SHOW CREATE TABLE statement: csql> SHOW CREATE TABLE opening_hours; === === TABLE CREATE TABLE ============================================ 'opening_hours' 'CREATE TABLE [opening_hours] ([week_day] ENUM('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'), [opening_time] TIME, [closing_time] TIME) selecting information from CUBRID system tables: csql> SELECT d.enumeration FROM _db_domain d, _db_attribute a WHERE a.attr_name = 'week_day' AND a.class_of.class_name = 'opening_hours' AND d IN a.domains; === === enumeration ====================== {'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'} Both might require complex coding and selecting from system tables requires administrator privileges. ENUM type is not reusable If you have several tables which require the names of week days, you will have to create an ENUM type for each of them. If you create a table to hold week days names, you can join this table with whichever other table that requires this information. Portability is a concern The ENUM type is only supported by a few RDBMSs and each one has its own idea as to how ENUM type is supposed to work. Below is a list of a few notable differences between CUBRID, MySQL and PostgreSQL: CUBRID PostgreSQL MySQL Inserting out of range value Throws error Throws error Inserts special value index 0 Comparing to char literals Compare as strings Compare as ENUM elements Compare as strings Comparing to values outside of the ENUM domain Compare as strings Throws error Compare as strings These subtle differences will most probably break your application in interesting and hard to understand ways. If you’re migrating from PostgreSQL to CUBRID for example, and you expect comparisons with char literals to be performed as ENUM comparisons, you’ll have a hard time understanding why your query returns weird results.

December 19, 2012

by Esen Sagynov

· 60,370 Views

Checking DB Connection Using Groovy

Here is a simple Groovy script to verify Oracle database connection using JDBC. @GrabConfig(systemClassLoader=true) @Grab('com.oracle:ojdbc6:11g') url= "jdbc:oracle:thin:@localhost:1521:XE" username = "system" password = "mypassword123" driver = "oracle.jdbc.driver.OracleDriver" // Groovy Sql connection test import groovy.sql.* sql = Sql.newInstance(url, username, password, driver) try { sql.eachRow('select sysdate from dual'){ row -> println row } } finally { sql.close() } This script should let you test connection and perform any quick ad hoc queries programmatically. However, when you first run it, it would likely failed without finding the Maven dependency for JDBC driver jar. In this case, you would need to first install the Oracle JDBC jar into maven local repository. This is due to Oracle has not publish their JDBC jar into any public Maven repository. So we are left with manually steps by installing it. Here are the onetime setup steps: 1. Download Oracle JDBC jar from their site: http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html. 2. Unzip the file into C:/ojdbc directory. 3. Now you can install the jar file into Maven local repository using Cygwin. bash> cd /cygdrive/c/ojdbc bash> mvn install:install-file -DgroupId=com.oracle -DartifactId=ojdbc6 -Dversion=11g -Dpackaging=jar -Dfile=ojdbc6-11g.jar That should make your script run successfully. The Groovy way of using Sql has many sugarcoated methods that you let you quickly query and see data on screens. You can see more Groovy feature by studying their API doc. Note that you would need systemClassLoader=true to make Groovy load the JDBC jar into classpath and use it properly. Oh, BTW, if you are using Oracle DB production, you will likely using a RAC configuration. The JDBC url connection string for that should look something like this: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=MY_DB))) Update: 12/07/2012 It appears that the groovy.sql.Sql class has a static withInstance method. This let you run onetime DB work without writing try/finally block. See this example: @GrabConfig(systemClassLoader=true) @Grab('com.oracle:ojdbc6:11g') url= "jdbc:oracle:thin:@localhost:1521:XE" username = "system" password = "mypassword123" driver = "oracle.jdbc.driver.OracleDriver" import groovy.sql.* Sql.withInstance(url, username, password, driver) { sql -> sql.eachRow('select sysdate from dual'){ row -> println row } } It's much more compact. But be aware of performance if you run it multiple times, because you will open and close the a java.sql.Connection per each call! I have also collected couple other popular databases connection test examples. These should have their driver jars already in Maven central, so Groovy Grab should able to grab them just fine. // MySQL database test @GrabConfig(systemClassLoader=true) @Grab('mysql:mysql-connector-java:5.1.22') import groovy.sql.* Sql.withInstance("jdbc:mysql://localhost:3306/mysql", "root", "mypassword123", "com.mysql.jdbc.Driver"){ sql -> sql.eachRow('SELECT * FROM USER'){ row -> println row } } // H2Database @GrabConfig(systemClassLoader=true) @Grab('com.h2database:h2:1.3.170') import groovy.sql.* Sql.withInstance("jdbc:h2:~/test", "sa", "", "org.h2.Driver"){ sql -> sql.eachRow('SELECT * FROM INFORMATION_SCHEMA.TABLES'){ row -> println row } }

December 12, 2012

by Zemian Deng

· 29,413 Views

Using YAML for Java Application Configuration

YAML is well-known format within Ruby community, quite widely used for a long time now. But we as Java developers mostly deal with property files and XMLs in case we need some configuration for our apps. How many times we needed to express complicated configuration by inventing our own XML schema or imposing property names convention? Though JSON is becoming a popular format for web applications, using JSON files to describe the configuration is a bit cumbersome and, in my opinion, is not as expressive as YAML. Let's see what YAML can do for us to make our life easier. For sure, let's start with the problem. In order for our application to function properly, we need to feed it following data somehow: version and release date database connection parameters list of supported protocols list of users with their passwords This list of parameters sounds a bit weird, but the purpose is to demonstrate different data types in work: strings, numbers, dates, lists and maps. The Java model consists of two simple classes: Connection package com.example.yaml; public final class Connection { private String url; private int poolSize; public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } public int getPoolSize() { return poolSize; } public void setPoolSize(int poolSize) { this.poolSize = poolSize; } @Override public String toString() { return String.format( "'%s' with pool of %d", getUrl(), getPoolSize() ); } } and Configuration, both are typical Java POJOs, verbose because of property setters and getters (we get used to it, right?). package com.example.yaml; import static java.lang.String.format; import java.util.Date; import java.util.List; import java.util.Map; public final class Configuration { private Date released; private String version; private Connection connection; private List< String > protocols; private Map< String, String > users; public Date getReleased() { return released; } public String getVersion() { return version; } public void setReleased(Date released) { this.released = released; } public void setVersion(String version) { this.version = version; } public Connection getConnection() { return connection; } public void setConnection(Connection connection) { this.connection = connection; } public List< String > getProtocols() { return protocols; } public void setProtocols(List< String > protocols) { this.protocols = protocols; } public Map< String, String > getUsers() { return users; } public void setUsers(Map< String, String > users) { this.users = users; } @Override public String toString() { return new StringBuilder() .append( format( "Version: %s\n", version ) ) .append( format( "Released: %s\n", released ) ) .append( format( "Connecting to database: %s\n", connection ) ) .append( format( "Supported protocols: %s\n", protocols ) ) .append( format( "Users: %s\n", users ) ) .toString(); } } ow, as model is quite clear, let us try to express it as the human being normally does it. Looking back to our list of required configuration, let's try to write it down one by one. 1. version and release date version: 1.0 released: 2012-11-30 2. database connection parameters connection: url: jdbc:mysql://localhost:3306/db poolSize: 5 3. list of supported protocols protocols: - http - https 4. list of users with their passwords users: tom: passwd bob: passwd And this is it, our configuration expressed in YAML syntax is completed! The whole file sample.yml looks like this: version: 1.0 released: 2012-11-30 # Connection parameters connection: url: jdbc:mysql://localhost:3306/db poolSize: 5 # Protocols protocols: - http - https # Users users: tom: passwd bob: passwd To make it work in Java, we just need to use the awesome library called snakeyml, respectively the Maven POM file is quite simple: 4.0.0 com.example yaml 0.0.1-SNAPSHOT jar UTF-8 org.yaml snakeyaml 1.11 org.apache.maven.plugins maven-compiler-plugin 2.3.1 1.7 1.7 Please notice the usage of Java 1.7, the language extensions and additional libraries simplify a lot of regular tasks as we could see looking into YamlConfigRunner: package com.example.yaml; import java.io.IOException; import java.io.InputStream; import java.nio.file.Files; import java.nio.file.Paths; import org.yaml.snakeyaml.Yaml; public class YamlConfigRunner { public static void main(String[] args) throws IOException { if( args.length != 1 ) { System.out.println( "Usage: " ); return; } Yaml yaml = new Yaml(); try( InputStream in = Files.newInputStream( Paths.get( args[ 0 ] ) ) ) { Configuration config = yaml.loadAs( in, Configuration.class ); System.out.println( config.toString() ); } } } The code snippet here loads the configuration from file (args[ 0 ]), tries to parse it and fill up the Configuration class with meaningful data using JavaBeans conventions, converting to the declared types where possible. Running this class with sample.yml as an argument generates the following output: Version: 1.0 Released: Thu Nov 29 19:00:00 EST 2012 Connecting to database: 'jdbc:mysql://localhost:3306/db' with pool of 5 Supported protocols: [http, https] Users: {tom=passwd, bob=passwd} Totally identical to the values we have configured!

December 10, 2012

by Andriy Redko

· 240,389 Views · 6 Likes

C++ benchmark – std::vector VS std::list

a updated version of this article is available: c++ benchmark – std::vector vs std::list vs std::deque in c++, the two most used data structures are the std::vector and the std::list. in this article, we will compare the performance in practice of these two data structures on several different workloads. in this article, when i talk about a list it is the std::list implementation and vector refers to the std::vector implementation. it is generally said that a list should be used when random insert and remove will be performed (performed in o(1) versus o(n) for a vector). if we look only at the complexity, search in both data structures should be roughly equivalent, complexity being in o(n). when random insert/replace operations are performed on a vector, all the subsequent data needs to be moved so each element will be copied. that is why the size of the data type is an important factor when comparing those two data structures. however, in practice, there is a huge difference in the usage of the memory caches. all the data in a vector is contiguous where the std::list allocates memory separately for each element. how does that change the results in practice ? keep in mind that all the tests performed are made on vector and list even if other data structures could be better suited to the given workload. in the graphs and in the text, n is used to refer to the number of elements of the collection. all the tests performed have been performed on an intel core i7 q 820 @ 1.73ghz. the code has been compiled in 64 bits with gcc 4.7.2 with -02 and -march=native. the code has been compiled with c++11 support (-std=c++11). fill the first test is to fill the data structures by adding elements to the back of the container. two variations of vector are used, vector_pre being a std::vector with the size passed in parameters to the constructor, resulting in only one allocation of memory. fill (8 bytes)vector_prevectorlist100010000100000100000003006009001,200milliseconds fill (1024 bytes)vector_prevectorlist100010000100000100000006,00012,00018,00024,000milliseconds all data structures are impacted the same way when the data size increases because there will be more memory to allocate. the vector_pre is clearly the winner of this test, being one order of magnitude faster than a list and about twice as fast as a vector without pre-allocation. the results are directly linked to the allocations that have to be performed, allocation being slow. whatever the data size is, push_back to a vector will always be faster than to a list. this is logical because vector allocates more memory than necessary and so does not need to allocate memory for each element. but this test is not very interesting, generally building the data structure is not critical. what is critical is the operations that are performed on the data structure. that will be tested in the coming sections. random find the first operation is that is tested is the search. the container is filled with all the numbers in [0, n] and shuffled. then, each number in [0,n] is searched in the container with std::find that performs a simple linear search. yes, vector is represented in the graph, its line is the same as the x line ! performing a linear search in a vector is several orders of magnitude faster than in a list . the only reason is the usage of the cache line. when a data is accessed, the data is fetched from the main memory to the cache. not only the accessed data is accessed, but a whole cacheline is fetched. as the elements in a vector are contiguous, when you access an element, the next element is automatically in the cache. as the main memory is orders of magnitude slower than the cache, this makes a huge difference. in the list case, the processor spends its whole time waiting for data being fetched from memory to the cache. if we augment the size of the data type to 1kb, the results remain the same, but slower: random insert in the case of random insert, in theory, the list should be much faster, its insert operation being in o(1) versus o(n) for a vector. the container is filled with all the numbers in [0, n] and shuffled. then, 1000 random values are inserted at a random position in the container. the random position is found by linear search. in both cases, the complexity of the search is o(n), the only difference comes from the insert that follow the search. when, the vector should be slower than the list, it is almost an order of magnitude faster. again, this is because finding the position in a list is much slower than copying a lot of small elements. if we increase the size: the two lines are getting closer, but vector is still faster. increase it to 1kb: this time, list outperforms vector by an order of magnitude ! the performance of random insert in a list are not impacted much by the size of the data type, where vector suffers a lot when big sizes are used. we can also see that list doesn’t seem to care about the size of the collection. it is because the size of the collection only impact the search and not the insertion and as few search are performed, it does not change the results a lot. if the iterator was already known (no need for linear search), it would be faster to insert into a list than into the vector. random remove in theory, random remove is the same case than random insert. now that we’ve seen the results with random insert, we could expect the same behavior for random remove. the container is filled with all the numbers in [0, n] and shuffled. then, 1000 random values are removed from a random position in the container. again, vector is several times faster and looks to scale better. again, this is because it is very cheap to copy small elements. let’s increase it directly to 1kb element. the two lines have been reversed ! the behavior of random remove is the same as the behavior of random insert, for the same reasons. push front the next operation that we will compare is inserting elements in front of the collection. this is the worst case for vector, because after each insertion, all the previously inserted will be moved and copied. for a list, it does not make a difference compared to pushing to the back. the results are crystal-clear and as expected. vector is very bad at inserting elements to the front. this does not need further explanations. there is no need to change the data size, it will only make vector much slower. sort the next operation that is tested is the performance of sorting a vector or a list. for a vector std::sort is used and for a list the member function sort is used. we can see that sorting a list is several times slower. it comes from the poor usage of the cache. if we increase the size of the element to 1kb: this time the list is faster than the vector. it is not very clear on the graph, but the values for the list are almost the same as for the previous results. that is because std::list::sort() does not perform any copy, only pointers to the elements are changed. on the other hand, swapping two elements in a vector involves at least three copies, so the cost of sorting will increase as the cost of copying increases. number crunching finally, we can also test a number crunching operation. here, random elements are inserted into the container that is kept sorted. it means, that the position where the element has to be inserted is first searched by iterating through elements and the inserted. as we talk about number crunching, only 8 bytes elements are tested. we can clearly see that vector is more than an order of magnitude faster than list and this will only be more as the size of the collection increase. this is because traversing the list is much more expensive than copying the elements of the vector. conclusion to conclude, we can get some facts about each data structure: std::vector is insanely faster than std::list to find an element std::vector always performs faster than std::list with very small data std::vector is always faster to push elements at the back than std::list std::list handles large elements very well, especially for sorting or inserting in the front these are the simple conclusions on usage of each data structure: for number crunching : use std::vector for linear search : use std::vector for random insert/remove : use std::list (if data size very small (< 64b on my computer), use std::vector) for big data size : use std::list (not if intended for searching) if you have the time, in practice, the best way to decide is always to benchmark both versions, or even to try other data structures. i hope that you found this article interesting. if you have any comment or have an idea about another workload that you would like to test, don’t hesitate to post a comment if you have a question on results, don’t hesitate as well. the code source of the benchmark is available online: https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp

December 6, 2012

by Baptiste Wicht

· 45,065 Views

Enterprise-ready Tool Support for Apache Camel

apache camel is my favorite integration framework on the java platform due to great dsls, a huge community, and so many different components. camel is used by many developers from different companies all over the world. however, most guys are not aware that some really cool and – more important – enterprise-ready tooling is available for camel, too. many people ask me about camel tooling when i do talks at conferences. this is the reason for this short blog post about camel tooling. [fyi: i work for talend (one of the vendors).] ide support camel consists of a set of normal java libraries and is therefore usable with any java ide (such as eclipse, netbeans or intellij idea) or even a classic text editor. programming dsls are available for java, groovy, and scala. even a kotlin dsl is in the works, thanks to camel’s founder james strachan. all familiar ide features such as code completion or javadoc view are available for these dsls. in the spring xml dsl, the eclipse-based springsource tool suite (sts) should be emphasized, which provides the best support for the spring framework and xml configurations. camel-specific tooling besides classical ide support, further products are available to provide additional functionality. integration problems can be modeled with the help of enterprise integration patterns (eip, http://www.eaipatterns.com/). eips are implemented by camel. visual designers are available to help modeling integration problems with these eips. these tools even generate the corresponding source code automatically. ideally, developers do not have to write any source code by hand. camel tooling is offered by talend with talend esb (http://de.talend.com/products/esb) and jboss, formerly fusesource, with fuse ide (http://fusesource.com/products/fuse-ide). both companies also provide full-time committers for the apache camel project. let’s take a short look at these two products in the following. open studio for talend esb talend esb is an eclipse-based integration platform within the talend unified platform. the familiar “look and feel” and the intuitive use of eclipse remain. the esb is open source and freely available. the paid enterprise version offers additional features and support. the esb can be used independently or in combination with other parts of the talend unified platform, such as BPM, big data, or master data management. the great benefit is that everything can be done within one suite using the same gui and concepts, based on eclipse. the entire talend unified platform is based on the “zero-coding” approach. this way, a very efficient implementation of integration problems is possible using the eips and components. routes are modeled and configured with intuitive tool support, all source code is generated. of course, custom integration logic can still be written and included, for example, pojos, spring beans, scripts in different languages, or own camel components. plenty of other components besides camel’s ones are available for talend esb – for example connectors to alfresco, jasper, sap, salesforce, or host systems. figure 1: visual designer of talend’s esb fuse ide the fuse ide is an eclipse plugin, which is installed from the eclipse update site. the visual designer (see figure 2) generates camel routes as xml code using the spring xml dsl. the generated code is editable vice-versa, i.e. the developer can change the source code. the graphical model applies changes automatically. fuse ide is intuitive to use for creating camel routes. fusesource offers some other products, which can be used in combination with fuse ide – such as management console or fuse mq for messaging. under fusesource, fuse ide was a proprietary product. however, fusesource was recently taken over by redhat (http://www.redhat.com/about/news/press-archive/2012/6/red-hat-to-acquire-fusesource) and now belongs to the jboss division. in the new roadmap, the fuse ide is still included. it will probably be integrated into the jboss enterprise soa platform and become “open sourced”. the integration of fusesource will take at least a few more months time to complete (http://www.redhat.com/promo/jboss_integration_week/). jboss now “owns” three esb products (jboss esb, switchyard and fuse esb). probably, these will be merged into one product in the end (switchyard is also based on camel). nevertheless, the fusesource products will also be supported for some time – primarily in order to satisfy existing customers (my guess). figure 2: visual designer of fuse ide (jboss, former fusesource) enterprise-ready tooling is already available for apache camel! the bottom line is that enterprise-ready tooling is already available for apache camel. it is great to see different companies working on tooling for apache camel. the winner definitely is apache camel… and there is no loser! talend esb and fuse ide are two different approaches for different kinds of projects. if you like the „zero-coding“ approach, then take a closer look at talend’s esb. it is really easy and efficient to realize integration projects without writing source code – nevertheless, there is enough flexibility for customization and adding own source code. the combination with bpm, mdm or big data (based on hadoop) is also supported within the unified platform using the same open source and „zero-coding“ concepts. if you „insist“ on writing and refactoring all source code by yourself within the text editor of an ide, then take a look at fuse ide. your best would be to try out both and see which one fits best into your next enterprise integration project. if you know any other cool camel tooling (no matter if it is enterprise-ready or not), or if you have any other feedback, please write a comment. thank you. best regards, kai wähner (twitter: @kaiwaehner) content from my blog: http://www.kai-waehner.de/blog/2012/11/23/enterprise-ready-tool-support-for-apache-camel/

November 26, 2012

by Kai Wähner

CORE

· 15,596 Views