Data Engineering Resources

The Latest Data Engineering Topics

This post comes from Stephane Combaudon at the MySQL Performance Blog. Last time I wrote about a few tips that can make you more efficient when using the command line on Unix. Today I want to focus more on pager. The most common usage of pager is to set it to a Unix pager such as less. It can be very useful to view the result of a command spanning over many lines (for instance SHOW ENGINE INNODB STATUS): mysql> pager less PAGER set to 'less' mysql> show engine innodb status\G [...] Now you are inside less and you can easily navigate through the result set (use q to quit, space to scroll down, etc). Reminder: if you want to leave your custom pager, this is easy, just run pager: mysql> pager Default pager wasn't set, using stdout. Or \n: mysql> \n PAGER set to stdout But the pager command is not restricted to such basic usage! You can pass the output of queries to most Unix programs that are able to work on text. We have discussed the topic, but here are a few more examples. Discarding the result set Sometimes you don’t care about the result set, you only want to see timing information. This can be true if you are trying different execution plans for a query by changing indexes. Discarding the result is possible with pager: mysql> pager cat > /dev/null PAGER set to 'cat > /dev/null' # Trying an execution plan mysql> SELECT ... 1000 rows in set (0.91 sec) # Another execution plan mysql> SELECT ... 1000 rows in set (1.63 sec) Now it’s much easier to see all the timing information on one screen. Comparing result sets Let’s say you are rewriting a query and you want to check if the result set is the same before and after rewrite. Unfortunately, it has a lot of rows: mysql> SELECT ... [..] 989 rows in set (0.42 sec) Instead of manually comparing each row, you can calculate a checksum and only compare the checksum: mysql> pager md5sum PAGER set to 'md5sum' # Original query mysql> SELECT ... 32a1894d773c9b85172969c659175d2d - 1 row in set (0.40 sec) # Rewritten query - wrong mysql> SELECT ... fdb94521558684afedc8148ca724f578 - 1 row in set (0.16 sec) Hmmm, checksums don’t match, something is wrong. Let’s retry: # Rewritten query - correct mysql> SELECT ... 32a1894d773c9b85172969c659175d2d - 1 row in set (0.17 sec) Checksums are identical, the rewritten query is much likely to produce the same result as the original one. Cleaning up SHOW PROCESSLIST If you have lots of connections on your MySQL, it’s very difficult to read the output of SHOW PROCESSLIST. For instance, if you have several hundreds of connections and you want to know how many connections are sleeping, manually counting the rows from the output of SHOW PROCESSLIST is probably not the best solution. With pager, it is straightforward: mysql> pager grep Sleep | wc -l PAGER set to 'grep Sleep | wc -l' mysql> show processlist; 337 346 rows in set (0.00 sec) This should be read as ’337 out 346 connections are sleeping’. Slightly more complicated now: you want to know the number of connections for each status: mysql> pager awk -F '|' '{print $6}' | sort | uniq -c | sort -r PAGER set to 'awk -F '|' '{print $6}' | sort | uniq -c | sort -r' mysql> show processlist; 309 Sleep 3 2 Query 2 Binlog Dump 1 Command Astute readers will have noticed that these questions could have been solved by querying INFORMATION_SCHEMA. For instance, counting the number of sleeping connections can be done with: mysql> SELECT COUNT(*) FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND='Sleep'; +----------+ | COUNT(*) | +----------+ | 320 | +----------+ and counting the number of connection for each status can be done with: mysql> SELECT COMMAND,COUNT(*) TOTAL FROM INFORMATION_SCHEMA.PROCESSLIST GROUP BY COMMAND ORDER BY TOTAL DESC; +-------------+-------+ | COMMAND | TOTAL | +-------------+-------+ | Sleep | 344 | | Query | 5 | | Binlog Dump | 2 | +-------------+-------+ True, but: It’s nice to know several ways to get the same result Some of you may feel more comfortable with writing SQL queries, while others will prefer command line tools Conclusion As you can see, pager is your friend! It’s very easy to use and it can solve problems in an elegant and very efficient way. You can even write your custom script (if it is too complicated to fit in a single line) and pass it to the pager.

January 24, 2013

by Peter Zaitsev

· 14,269 Views

OAuth 2.0 Bearer Token Profile Vs MAC Token Profile

Almost all the implementation I see today are based on OAuth 2.0 Bearer Token Profile. Of course its an RFC proposed standard today. OAuth 2.0 Bearer Token profile brings a simplified scheme for authentication. This specification describes how to use bearer tokens in HTTP requests to access OAuth 2.0 protected resources. Any party in possession of a bearer token (a "bearer") can use it to get access to the associated resources (without demonstrating possession of a cryptographic key). To prevent misuse, bearer tokens need to be protected from disclosure in storage and in transport. Before dig in to the OAuth 2.0 MAC profile lets have quick high-level overview of OAuth 2.0 message flow. OAuth 2.0 has mainly three phases. 1. Requesting an Authorization Grant. 2. Exchanging the Authorization Grant for an Access Token. 3. Access the resources with the Access Token. Where does the token type come in to action ? OAuth 2.0 core specification does not mandate any token type. At the same time at any point token requester - client - cannot decide which token type it needs. It's purely up to the Authorization Server to decide which token type to be returned in the Access Token response. So, the token type comes in to action in phase-2 when Authorization Server returning back the OAuth 2.0 Access Token. The access token type provides the client with the information required to successfully utilize the access token to make a protected resource request (along with type-specific attributes). The client must not use an access token if it does not understand the token type. Each access token type definition specifies the additional attributes (if any) sent to the client together with the "access_token" response parameter. It also defines the HTTP authentication method used to include the access token when making a protected resource request. For example following is what you get for Access Token response irrespective of which grant type you use. HTTP/1.1 200 OK Content-Type: application/json;charset=UTF-8 Cache-Control: no-store Pragma: no-cache { "access_token":"mF_9.B5f-4.1JqM", "token_type":"Bearer", "expires_in":3600, "refresh_token":"tGzv3JOkF0XG5Qx2TlKWIA" } The above is for Bearer - following is for MAC. HTTP/1.1 200 OK Content-Type: application/json Cache-Control: no-store { "access_token":"SlAV32hkKG", "token_type":"mac", "expires_in":3600, "refresh_token":"8xLOxBtZp8", "mac_key":"adijq39jdlaska9asud", "mac_algorithm":"hmac-sha-256" } Here you can see MAC Access Token response has two additional attributes. mac_key and the mac_algorithm. Let me rephrase this - "Each access token type definition specifies the additional attributes (if any) sent to the client together with the "access_token" response parameter". This MAC Token Profile defines the HTTP MAC access authentication scheme, providing a method for making authenticated HTTP requests with partial cryptographic verification of the request, covering the HTTP method, request URI, and host. In the above response access_token is the MAC key identifier. Unlike in Bearer, MAC token profile never passes it's top secret over the wire. The access_token or the MAC key identifier is a string identifying the MAC key used to calculate the request MAC. The string is usually opaque to the client. The server typically assigns a specific scope and lifetime to each set of MAC credentials. The identifier may denote a unique value used to retrieve the authorization information (e.g. from a database), or self-contain the authorization information in a verifiable manner (i.e. a string consisting of some data and a signature). The mac_key is a shared symmetric secret used as the MAC algorithm key. The server will not reissue a previously issued MAC key and MAC key identifier combination. Now let's see what happens in phase-3. Following shows how the Authorization HTTP header looks like when Bearer Token been used. Authorization: Bearer mF_9.B5f-4.1JqM This adds very low overhead on client side. It simply needs to pass the exact access_token it got from the Authorization Server in phase-2. Under MAC token profile, this is how it looks like. Authorization: MAC id="h480djs93hd8", ts="1336363200", nonce="dj83hs9s", mac="bhCQXTVyfj5cmA9uKkPFx1zeOXM=" This needs bit more attention. id is the MAC key identifier or the access_token from the phase-2. ts the request timestamp. The value is a positive integer set by the client when making each request to the number of seconds elapsed from a fixed point in time (e.g. January 1, 1970 00:00:00 GMT). This value is unique across all requests with the same timestamp and MAC key identifier combination. nonce is a unique string generated by the client. The value is unique across all requests with the same timestamp and MAC key identifier combination. The client uses the MAC algorithm and the MAC key to calculate the request mac. This is how you derive the normalized string to generate the HMAC. The normalized request string is a consistent, reproducible concatenation of several of the HTTP request elements into a single string. By normalizing the request into a reproducible string, the client and server can both calculate the request MAC over the exact same value. The string is constructed by concatenating together, in order, the following HTTP request elements, each followed by a new line character (%x0A): 1. The timestamp value calculated for the request. 2. The nonce value generated for the request. 3. The HTTP request method in upper case. For example: "HEAD", "GET", "POST", etc. 4. The HTTP request-URI as defined by [RFC2616] section 5.1.2. 5. The hostname included in the HTTP request using the "Host" request header field in lower case. 6. The port as included in the HTTP request using the "Host" request header field. If the header field does not include a port, the default value for the scheme MUST be used (e.g. 80 for HTTP and 443 for HTTPS). 7. The value of the "ext" "Authorization" request header field attribute if one was included in the request (this is optional), otherwise, an empty string. Each element is followed by a new line character (%x0A) including the last element and even when an element value is an empty string. Either you use Bearer of MAC - the end user or the resource owner is identified using the access_token. Authorization, throttling, monitoring or any other quality of service operations can be carried out against the access_token irrespective of which token profile you use.

January 24, 2013

by Prabath Siriwardena

· 37,143 Views

How to Publish Maven Site Docs to BitBucket or GitHub Pages

In this post we will Utilize GitHub and/or BitBucket's static web page hosting capabilities to publish our project's Maven 3 Site Documentation. Each of the two SCM providers offer a slightly different solution to host static pages. The approach spelled out in this post would also be a viable solution to "backup" your site documentation in a supported SCM like Git or SVN. This solution does not directly cover site documentation deployment covered by the maven-site-plugin and the Wagon library (scp, WebDAV or FTP). There is one main project hosted on GitHub that I have posted with the full solution. The project URL is https://github.com/mike-ensor/clickconcepts-master-pom/. The POM has been pushed to Maven Central and will continue to be updated and maintained. com.clickconcepts.project master-site-pom 0.16 GitHub Pages GitHub hosts static pages by using a special branch "gh-pages" available to each GitHub project. This special branch can host any HTML and local resources like JavaScript, images and CSS. There is no server side development. To navigate to your static pages, the URL structure is as follows: http://.github.com/ An example of the project I am using in this blog post: http://mike-ensor.github.com/clickconcepts-master-pom/ where the first bold URL segment is a username and the second bold URL segment is the project. GitHub does allow you to create a base static hosted static site for your username by creating a repository with your username.github.com. The contents would be all of your HTML and associated static resources. This is not required to post documentation for your project, unlike the BitBucket solution. There is a GitHub Site plugin that publishes site documentation via GitHub's object API but this is outside the scope of this blog post because it does not provide a single solution for GitHub and BitBucket projects using Maven 3. BitBucket BitBucket provides a similar service to GitHub in that it hosts static HTML pages and their associated static resources. However, there is one large difference in how those pages are stored. Unlike GitHub, BitBucket requires you to create a new repository with a name fitting the convention. The files will be located on the master branch and each project will need to be a directory off of the root. mikeensor.bitbucket.org/ /some-project +index.html +... /css /img /some-other-project +index.html +... /css /img index.html .git .gitignore The naming convention is as follows: .bitbucket.org An example of a BitBucket static pages repository for me would be: http://mikeensor.bitbucket.org/. The structure does not require that you create an index.html page at the root of the project, but it would be advisable to avoid 404s. Generating Site Documentation Maven provides the ability to post documentation for your project by using the maven-site-plugin. This plugin is difficult to use due to the many configuration options that oftentimes are not well documented. There are many blog posts that can help you write your documentation including my post on maven site documentation. I did not mention how to use "xdoc", "apt" or other templating technologies to create documentation pages, but not to fear, I have provided this in my GitHub project. Putting it all Together The Maven SCM Publish plugin (http://maven.apache.org/plugins/maven-scm-publish-plugin/ publishes site documentation to a supported SCM. In our case, we are going to use Git through BitBucket or GitHub. Maven SCM Plugin does allow you to publish multi-module site documentation through the various properties, but the scope of this blog post is to cover single/mono module projects and the process is a bit painful. Take a moment to look at the POM file located in the clickconcepts-master-pom project. This master POM is rather comprehensive and the site documentation is only one portion of the project, but we will focus on the site documentation. There are a few things to point out here, first, the scm-publish plugin and the idiosyncronies when implementing the plugin. In order to create the site documentation, the "site" plugin must first be run. This is accomplished by running site:site. The plugin will generate the documentation into the "target/site" folder by default. The SCM Publish Plugin, by default, looks for the site documents to be in "target/staging" and is controlled by the content parameter. As you can see, there is a mismatch between folders. NOTE: My first approach was to run the site:stage command which is supposed to put the site documents into the "target/staging" folder. This is not entirely correct, the site plugin combines with the distributionManagement.site.url property to stage the documents, but there is very strange behavior and it is not documented well. In order to get the site plugin's site documents and the SCM Publish's location to match up, use the content property and set that to the location of the Site Plugin output (). If you are using GitHub, there is no modification to the siteOutputDirectory needed, however, if you are using BitBucket, you will need to modify the property to add in a directory layer into the site documentation generation (see above for differences between GitHub and BitBucket pages). The second property will tell the SCM Publish Plugin to look at the root "site" folder so that when the files are copied into the repository, the project folder will be the containing folder. The property will look like: ${project.build.directory}/site/ ${project.artifactId} ${project.build.directory} /site Next we will take a look at the custom properties defined in the master POM and used by the SCM Publish Plugin above. Each project will need to define several properties to use the Master POM that are used within the plugins during the site publishing. Fill in the variables with your own settings. BitBucket ... ... master scm:git:[email protected]:mikeensor/mikeensor.bitbucket.org.git ${project.build.directory}/site/${project.artifactId} ${project.build.directory}/site ${changelog.bitbucket.fileUri} ${changelog.revision.bitbucket.fileUri} ... ... GitHub ... ... gh-pages scm:git:[email protected]:mikeensor/clickconcepts-master-pom.git ${changelog.github.fileUri} ${changelog.revision.github.fileUri} ... ... NOTE: changelog parameters are required to use the Master POM and are not directly related to publishing site docs to GitHub or BitBucket How to Generate If you are using the Master POM (or have abstracted out the Site Plugin and the SCM Plugin) then to generate and publish the documentation is simple. mvn clean site:site scm-publish:publish-scm mvn clean site:site scm-publish:publish-scm -Dscmpublish.dryRun=true Gotchas In the SCM Publish Plugin documentation's "tips" they recommend creating a location to place the repository so that the repo is not cloned each time. There is a risk here in that if there is a git repository already in the folder, the plugin will overwrite the repository with the new site documentation. This was discovered by publishing two different projects and having my root repository wiped out by documentation from the second project. There are ways to mitigate this by adding in another folder layer, but make sure you test often! Another gotcha is to use the -Dscmpublish.dryRun=true to test out the site documentation process without making the SCM commit and push Project and Documentation URLs Here is a list of the fully working projects used to create this blog post: Master POM with Site and SCM Publish plugins &ndash https://github.com/mike-ensor/clickconcepts-master-pom. Documentation URL: http://mike-ensor.github.com/clickconcepts-master-pom/ Child Project using Master Pom &ndash http://mikeensor.bitbucket.org/fest-expected-exception. Documentation URL: http://mikeensor.bitbucket.org/fest-expected-exception/

January 23, 2013

by Mike Ensor

· 13,424 Views

Spring Data JDBC Generic DAO Implementation: Most Lightweight ORM Ever

I am thrilled to announce first version of my Spring Data JDBC repository project. The purpose of this open source library is to provide generic, lightweight and easy to use DAO implementation for relational databases based on JdbcTemplate from Spring framework, compatible with Spring Data umbrella of projects. Design objectives Lightweight, fast and low-overhead. Only a handful of classes, no XML, annotations, reflection This is not full-blown ORM. No relationship handling, lazy loading, dirty checking, caching CRUD implemented in seconds For small applications where JPA is an overkill Use when simplicity is needed or when future migration e.g. to JPA is considered Minimalistic support for database dialect differences (e.g. transparent paging of results) Features Each DAO provides built-in support for: Mapping to/from domain objects through RowMapper abstraction Generated and user-defined primary keys Extracting generated key Compound (multi-column) primary keys Immutable domain objects Paging (requesting subset of results) Sorting over several columns (database agnostic) Optional support for many-to-one relationships Supported databases (continuously tested): MySQL PostgreSQL H2 HSQLDB Derby ...and most likely most of the others Easily extendable to other database dialects via SqlGenerator class. Easy retrieval of records by ID API Compatible with Spring Data PagingAndSortingRepository abstraction, all these methods are implemented for you: public interface PagingAndSortingRepository extends CrudRepository { T save(T entity); Iterable save(Iterable entities); T findOne(ID id); boolean exists(ID id); Iterable findAll(); long count(); void delete(ID id); void delete(T entity); void delete(Iterable entities); void deleteAll(); Iterable findAll(Sort sort); Page findAll(Pageable pageable); } Pageable and Sort parameters are also fully supported, which means you get paging and sorting by arbitrary properties for free. For example say you have userRepository extending PagingAndSortingRepository interface (implemented for you by the library) and you request 5th page of USERS table, 10 per page, after applying some sorting: Page page = userRepository.findAll( new PageRequest( 5, 10, new Sort( new Order(DESC, "reputation"), new Order(ASC, "user_name") ) ) ); Spring Data JDBC repository library will translate this call into (PostgreSQL syntax): SELECT * FROM USERS ORDER BY reputation DESC, user_name ASC LIMIT 50 OFFSET 10 ...or even (Derby syntax): SELECT * FROM ( SELECT ROW_NUMBER() OVER () AS ROW_NUM, t.* FROM ( SELECT * FROM USERS ORDER BY reputation DESC, user_name ASC ) AS t ) AS a WHERE ROW_NUM BETWEEN 51 AND 60 No matter which database you use, you'll get Page object in return (you still have to provide RowMapper yourself to translate from ResultSet to domain object. If you don't know Spring Data project yet, Page is a wonderful abstraction, not only encapsulating List , but also providing metadata such as total number of records, on which page we currently are, etc. Reasons to use You consider migration to JPA or even some NoSQL database in the future. Since your code will rely only on methods defined in PagingAndSortingRepository and CrudRepository from Spring Data Commons umbrella project you are free to switch from JdbcRepository implementation (from this project) to: JpaRepository, MongoRepository, GemfireRepository or GraphRepository. They all implement the same common API. Of course don't expect that switching from JDBC to JPA or MongoDB will be as simple as switching imported JAR dependencies - but at least you minimize the impact by using same DAO API. You need a fast, simple JDBC wrapper library. JPA or even MyBatis is an overkill You want to have full control over generated SQL if needed You want to work with objects, but don't need lazy loading, relationship handling, multi-level caching, dirty checking... You need CRUD and not much more You want to by DRY You are already using Spring or maybe even JdbcTemplate, but still feel like there is too much manual work You have very few database tables Getting started For more examples and working code don't forget to examine project tests. Prerequisites Maven coordinates: com.blogspot.nurkiewicz jdbcrepository 0.1 Unfortunately the project is not yet in maven central repository. For the time being you can install the library in your local repository by cloning it: $ git clone git://github.com/nurkiewicz/spring-data-jdbc-repository.git $ git checkout 0.1 $ mvn javadoc:jar source:jar install In order to start your project must have DataSource bean present and transaction management enabled. Here is a minimal MySQL configuration: @EnableTransactionManagement @Configuration public class MinimalConfig { @Bean public PlatformTransactionManager transactionManager() { return new DataSourceTransactionManager(dataSource()); } @Bean public DataSource dataSource() { MysqlConnectionPoolDataSource ds = new MysqlConnectionPoolDataSource(); ds.setUser("user"); ds.setPassword("secret"); ds.setDatabaseName("db_name"); return ds; } } Entity with auto-generated key Say you have a following database table with auto-generated key (MySQL syntax): CREATE TABLE COMMENTS ( id INT AUTO_INCREMENT, user_name varchar(256), contents varchar(1000), created_time TIMESTAMP NOT NULL, PRIMARY KEY (id) ); First you need to create domain object User mapping to that table (just like in any other ORM): public class Comment implements Persistable { private Integer id; private String userName; private String contents; private Date createdTime; @Override public Integer getId() { return id; } @Override public boolean isNew() { return id == null; } //getters/setters/constructors/... } Apart from standard Java boilerplate you should notice implementing Persistable where Integer is the type of primary key. Persistable is an interface coming from Spring Data project and it's the only requirement we place on your domain object. Finally we are ready to create our CommentRepository DAO: @Repository public class CommentRepository extends JdbcRepository { public CommentRepository() { super(ROW_MAPPER, ROW_UNMAPPER, "COMMENTS"); } public static final RowMapper ROW_MAPPER = //see below private static final RowUnmapper ROW_UNMAPPER = //see below @Override protected Comment postCreate(Comment entity, Number generatedId) { entity.setId(generatedId.intValue()); return entity; } } First of all we use @Repository annotation to mark DAO bean. It enables persistence exception translation. Also such annotated beans are discovered by CLASSPATH scanning. As you can see we extend JdbcRepository which is the central class of this library, providing implementations of all PagingAndSortingRepository methods. Its constructor has three required dependencies: RowMapper , RowUnmapper and table name. You may also provide ID column name, otherwise default "id" is used. If you ever used JdbcTemplate from Spring, you should be familiar with RowMapper interface. We need to somehow extract columns from ResultSet into an object. After all we don't want to work with raw JDBC results. It's quite straightforward: public static final RowMapper ROW_MAPPER = new RowMapper () { @Override public Comment mapRow(ResultSet rs, int rowNum) throws SQLException { return new Comment( rs.getInt("id"), rs.getString("user_name"), rs.getString("contents"), rs.getTimestamp("created_time") ); } }; RowUnmapper comes from this library and it's essentially the opposite of RowMapper : takes an object and turns it into a Map . This map is later used by the library to construct SQL CREATE / UPDATE queries: private static final RowUnmapper ROW_UNMAPPER = new RowUnmapper () { @Override public Map mapColumns(Comment comment) { Map mapping = new LinkedHashMap (); mapping.put("id", comment.getId()); mapping.put("user_name", comment.getUserName()); mapping.put("contents", comment.getContents()); mapping.put("created_time", new java.sql.Timestamp(comment.getCreatedTime().getTime())); return mapping; } }; If you never update your database table (just reading some reference data inserted elsewhere) you may skip RowUnmapper parameter or use MissingRowUnmapper. Last piece of the puzzle is the postCreate() callback method which is called after an object was inserted. You can use it to retrieve generated primary key and update your domain object (or return new one if your domain objects are immutable). If you don't need it, just don't override postCreate() . Check out JdbcRepositoryGeneratedKeyTest for a working code based on this example. By now you might have a feeling that, compared to JPA or Hibernate, there is quite a lot of manual work. However various JPA implementations and other ORM frameworks are notoriously known for introducing significant overhead and manifesting some learning curve. This tiny library intentionally leaves some responsibilities to the user in order to avoid complex mappings, reflection, annotations... all the implicitness that is not always desired. This project is not intending to replace mature and stable ORM frameworks. Instead it tries to fill in a niche between raw JDBC and ORM where simplicity and low overhead are key features. Entity with manually assigned key In this example we'll see how entities with user-defined primary keys are handled. Let's start from database model: CREATE TABLE USERS ( user_name varchar(255), date_of_birth TIMESTAMP NOT NULL, enabled BIT(1) NOT NULL, PRIMARY KEY (user_name) ); ...and User domain model: public class User implements Persistable { private transient boolean persisted; private String userName; private Date dateOfBirth; private boolean enabled; @Override public String getId() { return userName; } @Override public boolean isNew() { return !persisted; } public User withPersisted(boolean persisted) { this.persisted = persisted; return this; } //getters/setters/constructors/... } Notice that special persisted transient flag was added. Contract of CrudRepository.save() from Spring Data project requires that an entity knows whether it was already saved or not ( isNew() ) method - there are no separate create() and update() methods. Implementing isNew() is simple for auto-generated keys (see Comment above) but in this case we need an extra transient field. If you hate this workaround and you only insert data and never update, you'll get away with return true all the time from isNew() . And finally our DAO, UserRepository bean: @Repository public class UserRepository extends JdbcRepository { public UserRepository() { super(ROW_MAPPER, ROW_UNMAPPER, "USERS", "user_name"); } public static final RowMapper ROW_MAPPER = //... public static final RowUnmapper ROW_UNMAPPER = //... @Override protected User postUpdate(User entity) { return entity.withPersisted(true); } @Override protected User postCreate(User entity, Number generatedId) { return entity.withPersisted(true); } } "USERS" and "user_name" parameters designate table name and primary key column name. I'll leave the details of mapper and unmapper (see source code). But please notice postUpdate() and postCreate() methods. They ensure that once object was persisted, persisted flag is set so that subsequent calls to save() will update existing entity rather than trying to reinsert it. Check out JdbcRepositoryManualKeyTest for a working code based on this example. Compound primary key We also support compound primary keys (primary keys consisting of several columns). Take this table as an example: CREATE TABLE BOARDING_PASS ( flight_no VARCHAR(8) NOT NULL, seq_no INT NOT NULL, passenger VARCHAR(1000), seat CHAR(3), PRIMARY KEY (flight_no, seq_no) ); I would like you to notice the type of primary key in Peristable : public class BoardingPass implements Persistable { private transient boolean persisted; private String flightNo; private int seqNo; private String passenger; private String seat; @Override public Object[] getId() { return pk(flightNo, seqNo); } @Override public boolean isNew() { return !persisted; } //getters/setters/constructors/... } Unfortunately we don't support small value classes encapsulating all ID values in one object (like JPA does with @IdClass), so you have to live with Object[] array. Defining DAO class is similar to what we've already seen: public class BoardingPassRepository extends JdbcRepository { public BoardingPassRepository() { this("BOARDING_PASS"); } public BoardingPassRepository(String tableName) { super(MAPPER, UNMAPPER, new TableDescription(tableName, null, "flight_no", "seq_no") ); } public static final RowMapper ROW_MAPPER = //... public static final RowUnmapper UNMAPPER = //... } Two things to notice: we extend JdbcRepository and we provide two ID column names just as expected: "flight_no", "seq_no" . We query such DAO by providing both flight_no and seq_no (necessarily in that order) values wrapped by Object[] : BoardingPass pass = repository.findOne(new Object[] {"FOO-1022", 42}); No doubts, this is cumbersome in practice, so we provide tiny helper method which you can statically import: import static com.blogspot.nurkiewicz.jdbcrepository.JdbcRepository.pk; //... BoardingPass foundFlight = repository.findOne(pk("FOO-1022", 42)); Check out JdbcRepositoryCompoundPkTest for a working code based on this example. Transactions This library is completely orthogonal to transaction management. Every method of each repository requires running transaction and it's up to you to set it up. Typically you would place @Transactional on service layer (calling DAO beans). I don't recommend placing @Transactional over every DAO bean. Caching Spring Data JDBC repository library is not providing any caching abstraction or support. However adding @Cacheable layer on top of your DAOs or services using caching abstraction in Spring is quite straightforward. See also: @Cacheable overhead in Spring. Contributions ..are always welcome. Don't hesitate to submit bug reports and pull requests. Biggest missing feature now is support for MSSQL and Oracle databases. It would be terrific if someone could have a look at it. Testing This library is continuously tested using Travis (). Test suite consists of 265 tests (53 distinct tests each run against 5 different databases: MySQL, PostgreSQL, H2, HSQLDB and Derby. When filling bug reports or submitting new features please try including supporting test cases. Each pull request is automatically tested on a separate branch. Building After forking the official repository building is as simple as running: $ mvn install You'll notice plenty of exceptions during JUnit test execution. This is normal. Some of the tests run against MySQL and PostgreSQL available only on Travis CI server. When these database servers are unavailable, whole test is simply skipped: Results : Tests run: 265, Failures: 0, Errors: 0, Skipped: 106 Exception stack traces come from root AbstractIntegrationTest. Design Library consists of only a handful of classes, highlighted in the diagram below: JdbcRepository is the most important class that implements all PagingAndSortingRepository methods. Each user repository has to extend this class. Also each such repository must at least implement RowMapper and RowUnmapper (only if you want to modify table data). SQL generation is delegated to SqlGenerator. PostgreSqlGenerator. and DerbySqlGenerator are provided for databases that don't work with standard generator. License This project is released under version 2.0 of the Apache License (same as Spring framework).

January 22, 2013

by Tomasz Nurkiewicz

· 76,633 Views · 2 Likes

ActiveMQ: Securing the ActiveMQ Web Console in Tomcat

This post will demonstrate how to secure the ActiveMQ WebConsole with a username and password when deployed in the Apache Tomcat web server. The Apache ActiveMQ documentation on the Web Console provides a good example of how this is done for Jetty, which is the default web server shipped with ActiveMQ, and this post will show how this is done when deploying the web console in Tomcat. To demonstrate, the first thing you will need to do is grab the latest distribution of ActiveMQ. For the purpose of this demonstration I will be using the 5.5.1-fuse-09-16 release which can be obtained via the Red Hat Support Portal or via the FuseSource repository: Red Hat Support Portal ActiveMQ ActiveMQ Web Console Tomcat mysql-connector-java-5.1.18-bin.jar Once you have the distributions, extract and start the broker. If you don't already have Tomcat installed you can grab it from the link above as well. I am using Tomcat 6.0.36 in this demonstration. Next, create a directory called activemq-console in the Tomcat webapps directory and extract the ActiveMQ Web Console war by using the jar -xf command. With all the binaries installed and our broker running we can begin configuring our web app and Tomcat to secure the Web Console. First, open the ActiveMQ Web Console's web descriptor, this can be found in the following location: activemq-console/WEB-INF/web.xml, and add the following configuration: Authenticate entire app /* GET POST activemq NONE BASIC This configuration enables the security constraint on the entire application as noted with /* url-pattern. Another point to notice is the auth-constraint which has been set to the activemq role, we will define this shortly. And lastly, note that this is configured for basic authentication. This means the username password are base64 encoded but not truly encrypted. To improve the security further you could enable a secure transport such as SSL. Now lets configure the Tomcat server to validate our activemq role we just specified in the web app. Out-of-the-box Tomcat is configured to use the UserDataBaseRealm. This is configured in [TOMCAT_HOME]/conf/server.xml. This instructs the web server to validate against the tomcat-users.xml file which can be found in [TOMCAT_HOME]/conf as well. Open the tomcat-users.xml file and add the following: This defines our activemq role and configures a user with that role. The last thing we need to do before starting our Tomcat server is add the required configuration to communicate with the broker. First, copy the activemq-all jar into the Tomcat lib directory. Next, open the catalina.sh/catalina.bat startup script and add the following configuration to initialize the JAVA_OPTS variable: JAVA_OPTS="-Dwebconsole.jms.url=tcp://localhost:61616 -Dwebconsole.jmx.url=service:jmx:rmi:///jndi/rmi://localhost:1099/jmxrmi -Dwebconsole.jmx.user= -Dwebconsole.jmx.password=" Now we are ready to start the Tomcat server. Once started, you should be able to access the ActiveMQ Web Console at the following URL: http://localhost:8080/activemq-console. You should be prompted with something similar to this dialog: Once you enter the user name and password you should get logged into the ActiveMQ Web Console. As I mentioned before the user name and password are base64 encoded and each request is authenticated against the UserDataBaseRealm. The browser will retain your username and password in memory so you will need to exit the browser to end the session. What you have seen so far is a simple authentication using the UserDataBaseRealm which contains a list of users in a text file. Next we will look at configuring the ActiveMQ Web Console to use a JDBCRealm which will authenticate against users stored in a database. Lets first create a new database as follows using a MySQL database: mysql> CREATE DATABASE tomcat_users; Query OK, 1 row affected (0.00 sec) mysql> Provide the appropriate permissions for this database to a database user: mysql> GRANT ALL ON tomcat_users.* TO 'activemq'@'localhost'; Query OK, 0 rows affected (0.02 sec) mysql> Then you can login to the database and create the following tables: mysql> USE tomcat_users; Database changed mysql> CREATE TABLE tomcat_users ( -> user_name varchar(20) NOT NULL PRIMARY KEY, -> password varchar(32) NOT NULL -> ); Query OK, 0 rows affected (0.10 sec) mysql> CREATE TABLE tomcat_roles ( -> role_name varchar(20) NOT NULL PRIMARY KEY -> ); Query OK, 0 rows affected (0.05 sec) mysql> CREATE TABLE tomcat_users_roles ( -> user_name varchar(20) NOT NULL, -> role_name varchar(20) NOT NULL, -> PRIMARY KEY (user_name, role_name), -> CONSTRAINT tomcat_users_roles_foreign_key_1 FOREIGN KEY (user_name) REFERENCES tomcat_users (user_name), -> CONSTRAINT tomcat_users_roles_foreign_key_2 FOREIGN KEY (role_name) REFERENCES tomcat_roles (role_name) -> ); Query OK, 0 rows affected (0.06 sec) mysql> Next seed the tables with the user and role information: mysql> INSERT INTO tomcat_users (user_name, password) VALUES ('admin', 'dbpass'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO tomcat_roles (role_name) VALUES ('activemq'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO tomcat_users_roles (user_name, role_name) VALUES ('admin', 'activemq'); Query OK, 1 row affected (0.00 sec) mysql> Now we can verify the information in our database: mysql> select * from tomcat_users; +-----------+----------+ | user_name | password | +-----------+----------+ | admin | dbpass | +-----------+----------+ 1 row in set (0.00 sec) mysql> select * from tomcat_users_roles; +-----------+-----------+ | user_name | role_name | +-----------+-----------+ | admin | activemq | +-----------+-----------+ 1 row in set (0.00 sec) mysql> If you left the Tomcat server running from the first part of this demonstration shut it down at this time so we can change the configuration to use the JDBCRealm. In the server.xml file, located in [TOMCAT_HOME]/conf, we need to comment out the existing UserDataBaseRealm and add the JDBCRealm: Looking at the JDBCRealm, you can see we are using the mysql JDBC driver, the connection URL is configured to connect to the tomcat_users database using the specified credentials, and the table and column names used in our database have been specified. Now the Tomcat server can be started again. This time when you login to the ActiveMQ Web Console use the username and password specified when loading the database tables. That's all there is to it, you now know how to configure the ActiveMQ Web Console to use Tomcat's UserDatabaseRealm and JDBCRealm. The following sites were helpful in gathering this information: http://activemq.apache.org/web-console.html http://www.avajava.com/tutorials/lessons/how-do-i-use-a-jdbc-realm-with-tomcat-and-mysql.html?page=1 http://oreilly.com/pub/a/java/archive/tomcat-tips.html?page=1

January 21, 2013

by Jason Sherman

· 12,241 Views

Assign a Fixed IP to an AWS EC2 Instance

as described in my previous post the ip (and dns) of your running ec2 ami will change after a reboot of that instance. of course this makes it very hard to make your applications on that machine available for the outside world, like in this case our wordpress blog. that is where elastic ip comes to the rescue. with this feature you can assign a static ip to your instance. assign one to your application as follows: click on the elastic ips link in the aws console allocate a new address associate the address with a running instance right click to associate the ip with an instance: pick the instance to assign this ip to: note the ip being assigned to your instance if you go to the ip address you were assigned then you see the home page of your server: and the nicest thing is that if you stop and start your instance you will receive a new public dns but your instance is still assigned to the elastic ip address: one important note: as long as an elastic ip address is associated with a running instance, there is no charge for it. however an address that is not associated with a running instance costs $0.01/hour. this prevents users from ‘reserving’ addresses while they are not being used.

January 20, 2013

by Eric Genesky

· 22,940 Views

ActiveMQ: KahaDB Journal Files - More Than Just Message Content Bits

I recently came across an issue where the ActiveMQ KahaDB journal files were continually rolling despite the fact that only a small number of small persistent messages were occasionally being stored by the broker. This behavior seemed very strange being that the message sizes being persisted were only a couple of kilobytes and there was a relatively small amount of messages actually on a queue. In this scenario something was filling up the 32MB journal files, but I wasn't quite sure what it could be? Were there other messages somewhere in the broker? Did an index get corrupted that was actually causing messages to be written across multiple journal files? It was pretty strange behavior but it can be explained fairly easily. This post describes the actual cause of this behavior and I have created it to remind myself in the future that there is more in the journal file than just the message content bits. The KahaDB journal files are used to store persistent messages that have been sent to the broker. In addition to storing the message content, the journal files also store information on KahaDB commands and transactional information. There are several commands for which information is stored; KahaAddMessageCommand, KahaCommitCommand, KahaPrepareCommand, KahaProducerAuditCommand, KahaRemoveDestinationCommand, KahaRemoveMessageCommand, KahaRollbackCommand, KahaSubscriptionCommand, and KahaTraceCommand. In this particular case, it was the KahaProducerAuditCommand which was responsible for the behavior that was observed. This command stores information about producer ids and message ids which is used for duplicate detection. In this case information is stored in a map object which over time grows. This information is then stored in the journal file each time a checkpoint is run, which by default is every 5 seconds. Over time, this can begin to use up the space allocated by the journal file causing low volume smaller messages to roll to the next journal file which in turn prevents the broker from cleaning up journal files which still have referenced messages. Eventually this situation can lead to Producer Flow Control being trigger by the broker's store limit which prevents producers from sending new messages into the broker. This behavior can occur under the following conditions: Persistent messages are being sent to a queue The messages are not being consumed on a regular basis The rate of messages being sent to the broker is low These conditions allow for this behavior to be observed fairly easily. As the persistent messages do not get consumed they remain referenced and prevent the journal files from being cleaned up. The low message rate allows time to pass between each new message being stored in the journal file and in the meantime checkpoints are being run which cause KahaProducerAuditCommand information to space out the actual messages within the journal file. For this use case you can disable the duplicate detection by essentially limiting the growth of the KahaProducerAuditCommand using the following configuration on the persistent adapter in the broker configuration: This is something to think about when designing your system. Under normal circumstances, if you have consumers available to consume the persistent messages, this condition would probably never occur as the journal files roll and messages are consumed, the broker can begin to clean up old journal files. There is currently an enhancement request at Apache which will also help resolve this issue. AMQ-3833 has been opened to enhance the broker so it will only write the KahaProducerAuditCommand if a change has occurred since the last checkpoint. This will help reduce the amount of data that is written to the journal files in between message storage.

January 18, 2013

by Jason Sherman

· 19,800 Views · 1 Like

Why is MongoDB Wildly Popular? It’s a Data Structure Thing.

Curator's Note: The content of this article was originally published over at the MongoLab blog . “Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious.” - Eric Raymond, in The Cathedral and the Bazaar, 1997 Linguistic innovation The fundamental task of programming is telling a computer how to do something. Because of this, much of the innovation in the field of software development has been linguistic innovation; that is, innovation in the ease and effectiveness with which a programmer is able to instruct a computer system. While machines operate in binary, we don’t talk to them that way. Every decade has introduced higher-level programming languages, and with each, an advancement in the ability of programmers to express themselves. These advancements include improvements in how we express data structures as well as how we express algorithms. The Object-Relational impedance mismatch Almost all modern programming languages support OO, and when we model entities in our code, we usually model them using a composition of primitive types (ints, strings, etc…), arrays, and objects. While each language might handle the details differently, the idea of nested object structures has become our universal language for describing ‘things’. The data structures we use to persist data have not evolved at the same rate. For the past 30 years the primary data structure for persistent data has been the Table – a set of Rows comprised of Columns containing scalar values (ints, strings, etc…). This is the world of the relational database, popularized in the 1980′s by its transactionality, speedy queries, space efficiency over other contemporary database systems, and a meat-eating ORCL salesforce. The difference between the way we model things in code, via objects, and the way they are represented in persistent storage, via tables, has been the source of much difficulty for programmers. Millennia of man-effort have been put against solving the problem of changing the shape of data from the object form to the relational form and back. Tools called Object-Relational Mapping systems (ORMs) exist for every object-oriented language in existence, and even with these tools, almost any programmer will complain that doing O/R mapping in any meaningful way is a time-consuming chore. Ted Neward hit it spot on when he said: “Object-Relational mapping is the Vietnam of our industry” There were attempts made at object databases in the 90s, but there was no technology that ever became a real alternative to the relational database. The document database, and in particular MongoDB, is the first successful Web-era object store, and because of that, represents the first big linguistic innovation in persistent data structures in a very long time. Instead of flat, two-dimensional tables of records, we have collections of rich, recursive, N-dimensional objects (a.k.a. documents) for records. An Example: the Blog Post Consider the blog post. Most likely you would have a class / object structure for modeling blog posts in your code, but if you are using a relational database to store your blog data, each entry would be spread across a handful of tables. As a developer you, need to get know how to convert the each ‘BlogPost’ object to and from the set of tables that house them in the relational model. A different approach Using MongoDB, your blog posts can be stored in a single collection, with each entry looking like this: { _id: 1234, author: { name: "Bob Davis", email : "[email protected]" }, post: "In these troubled times I like to …", date: { $date: "2010-07-12 13:23UTC" }, location: [ -121.2322, 42.1223222 ], rating: 2.2, comments: [ { user: "[email protected]", upVotes: 22, downVotes: 14, text: "Great point! I agree" }, { user: "[email protected]", upVotes: 421, downVotes: 22, text: "You are a moron" } ], tags: [ "Politics", "Virginia" ] } With a document database your data is stored almost exactly as it is represented in your program. There is no complex mapping exercise (although one often chooses to bind objects to instances of particular classes in code). What’s MongoDB good for? MongoDB is great for modeling many of the entities that back most modern web-apps, either consumer or enterprise: Account and user profiles: can store arrays of addresses with ease CMS: the flexible schema of MongoDB is great for heterogeneous collections of content types Form data: MongoDB makes it easy to evolve structure of form data over time Blogs / user-generated content: can keep data with complex relationships together in one object Messaging: vary message meta-data easily per message or message type without needing to maintain separate collections or schemas System configuration: just a nice object graph of configuration values, which is very natural in MongoDB Log data of any kind: structured log data is the future Graphs: just objects and pointers – a perfect fit Location based data: MongoDB understands geo-spatial coordinates and natively supports geo-spatial indexing Looking forward: the data is the interface There is a famous quote by Eric Raymond, in The Cathedral and the Bazaar (rephrasing an earlier quote by Fred Brooks from the famous The Mythical Man-Month): “Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won’t usually need your code; it’ll be obvious.” Data structures embody the essence of our programs and our ideas. Therefore, as programmers, we are constantly inviting innovation in the ease with which we can define expressive data structures to model our application domain. People often ask me why MongoDB is so wildly popular. I tell them it’s a data structure thing. While MongoDB may have ridden onto the scene under the banner of scalability with the rest of the NoSQL database technologies, the disproportionate success of MongoDB is largely based on its innovation as a data structure store that lets us more easily and expressively model the ‘things’ at the heart of our applications. For this reason MongoDB, or something very like it, will become the dominant database paradigm for operational data storage, with relational databases filling the role of a specialized tool. Having the same basic data model in our code and in the database is the superior method for most use-cases, as it dramatically simplifies the task of application development, and eliminates the layers of complex mapping code that are otherwise required. While a JSON-based document database may in retrospect seem obvious (if it doesn’t yet, it will), doing it right, as the folks at 10gen have, represents a major innovation.

January 16, 2013

by Eric Genesky

· 6,686 Views

How Many Queues Are Best For Max Performance? RabbitMQ

A generally useful question posed by Charming asks how many queues one should use in RabbitMQ for the maximum message passing throughput/performance. I thought I'd distill the answers by Brian Kelly and RobotEyes here for anyone who's worried about their performance with RabbitMQ. And I'll bet that some of these pointers are relevant to other message queues as well: From the RabbitMQ blog: RabbitMQ's queues are fastest when they're empty. When a queue is empty, and it has consumers ready to receive messages, then as soon as a message is received by the queue, it goes straight out to the consumer. In the case of a persistent message in a durable queue, yes, it will also go to disk, but that's done in an asynchronous manner and is buffered heavily. The main point is that very little book-keeping needs to be done, very few data structures are modified, and very little additional memory needs allocating. From the rabbitmq-discuss mailing group: Use a larger prefetch count. Small values hurt performance. A topic exchange is slower than a direct or a fanout exchange. Make sure queues stay short. Longer queues impose more processing overhead. If you care about latency and message rates then use smaller messages. Use an efficient format (e.g. avoid XML) or compress the payload. Experiment with HiPE, which helps performance. Avoid transactions and persistence. Also avoid publishing in immediate or mandatory mode. Avoid HA. Clustering can also impact performance. You will achieve better throughput on a multi-core system if you have multiple queues and consumers. Use at least v2.8.1, which introduces flow control. Make sure the memory and disk space alarms never trigger. Virtualisation can impose a small performance penalty. Tune your OS and network stack. Make sure you provide more than enough RAM. Provide fast cores and RAM. Hope that gives you some ideas for improving your queueing performance.

January 16, 2013

by Mitch Pronschinske

· 19,063 Views

@Cacheable overhead in Spring

Spring 3.1 introduced great caching abstraction layer. Finally we can abandon all home-grown aspects, decorators and code polluting our business logic related to caching. Since then we can simply annotate heavyweight methods and let Spring and AOP machinery do the work: @Cacheable("books") public Book findBook(ISBN isbn) {...} "books" is a cache name, isbn parameter becomes cache key and returned Book object will be placed under that key. The meaning of cache name is dependant on the underlying cache manager (EhCache, concurrent map, etc.) - Spring makes it easy to plug different caching providers. But this post won't be about caching feature in Spring... Some time ago my teammate was optimizing quite low-level code and discovered an opportunity for caching. He quickly applied @Cacheable just to discover that the code performed worse then it used to. He got rid of the annotation and implemented caching himself manually, using good old java.util.ConcurrentHashMap. The performance was much better. He blamed @Cacheable and Spring AOP overhead and complexity. I couldn't believe that a caching layer can perform so poorly until I had to debug Spring caching aspects few times myself (some nasty bug in my code, you know, cache invalidation is one of the two hardest things in CS). Well, the caching abstraction code is much more complex than one would expect (after all it's just get and put!), but it doesn't necessarily mean it must be that slow? In science we don't believe and trust, we measure and benchmark. So I wrote a benchmark to precisely measure the overhead of @Cacheable layer. Caching abstraction layer in Spring is implemented on top of Spring AOP, which can further be implemented on top of Java proxies, CGLIB generated subclasses or AspectJ instrumentation. Thus I'll test the following configurations: no caching at all - to measure how fast the code is with no intermediate layer manual cache handling using ConcurrentHashMap in business code @Cacheable with CGLIB implementing AOP @Cacheable with java.lang.reflect.Proxy implementing AOP @Cacheable with AspectJ compile time weaving (as similar benchmark shows, CTW is slightly faster than LTW) Home-grown AspectJ caching aspect - something between manual caching in business code and Spring abstraction Let me reiterate: we are not measuring the performance gain of caching and we are not comparing various cache providers. That's why our test method is as fast as it can be and I will be using simplest ConcurrentMapCacheManager from Spring. So here is a method in question: public interface Calculator { int identity(int x); } public class PlainCalculator implements Calculator { @Cacheable("identity") @Override public int identity(int x) { return x; } } I know, I know there is no point in caching such a method. But I want to measure the overhead of caching layer (during cache hit to be specific). Each caching configuration will have its own ApplicationContext as you can't mix different proxying modes in one context: public abstract class BaseConfig { @Bean public Calculator calculator() { return new PlainCalculator(); } } @Configuration class NoCachingConfig extends BaseConfig {} @Configuration class ManualCachingConfig extends BaseConfig { @Bean @Override public Calculator calculator() { return new CachingCalculatorDecorator(super.calculator()); } } @Configuration abstract class CacheManagerConfig extends BaseConfig { @Bean public CacheManager cacheManager() { return new ConcurrentMapCacheManager(); } } @Configuration @EnableCaching(proxyTargetClass = true) class CacheableCglibConfig extends CacheManagerConfig {} @Configuration @EnableCaching(proxyTargetClass = false) class CacheableJdkProxyConfig extends CacheManagerConfig {} @Configuration @EnableCaching(mode = AdviceMode.ASPECTJ) class CacheableAspectJWeaving extends CacheManagerConfig { @Bean @Override public Calculator calculator() { return new SpringInstrumentedCalculator(); } } @Configuration @EnableCaching(mode = AdviceMode.ASPECTJ) class AspectJCustomAspect extends CacheManagerConfig { @Bean @Override public Calculator calculator() { return new ManuallyInstrumentedCalculator(); } } Each @Configuration class represents one application context. CachingCalculatorDecorator is a decorator around real calculator that does the caching (welcome to the 1990s): public class CachingCalculatorDecorator implements Calculator { private final Map cache = new java.util.concurrent.ConcurrentHashMap(); private final Calculator target; public CachingCalculatorDecorator(Calculator target) { this.target = target; } @Override public int identity(int x) { final Integer existing = cache.get(x); if (existing != null) { return existing; } final int newValue = target.identity(x); cache.put(x, newValue); return newValue; } } SpringInstrumentedCalculator and ManuallyInstrumentedCalculator are exactly the same as PlainCalculator but they are instrumented by AspectJ compile-time weaver with Spring and custom aspect accordingly. My custom caching aspect looks like this: public aspect ManualCachingAspect { private final Map cache = new ConcurrentHashMap(); pointcut cacheMethodExecution(int x): execution(int com.blogspot.nurkiewicz.cacheable.calculator.ManuallyInstrumentedCalculator.identity(int)) && args(x); Object around(int x): cacheMethodExecution(x) { final Integer existing = cache.get(x); if (existing != null) { return existing; } final Object newValue = proceed(x); cache.put(x, (Integer)newValue); return newValue; } } After all this preparation we can finally write the benchmark itself. At the beginning I start all the application contexts and fetch Calculator instances. Each instance is different. For example noCaching is a PlainCalculator instance with no wrappers, cacheableCglib is a CGLIB generated subclass while aspectJCustom is an instance of ManuallyInstrumentedCalculator with my custom aspect woven. private final Calculator noCaching = fromSpringContext(NoCachingConfig.class); private final Calculator manualCaching = fromSpringContext(ManualCachingConfig.class); private final Calculator cacheableCglib = fromSpringContext(CacheableCglibConfig.class); private final Calculator cacheableJdkProxy = fromSpringContext(CacheableJdkProxyConfig.class); private final Calculator cacheableAspectJ = fromSpringContext(CacheableAspectJWeaving.class); private final Calculator aspectJCustom = fromSpringContext(AspectJCustomAspect.class); private static Calculator fromSpringContext(Class config) { return new AnnotationConfigApplicationContext(config).getBean(Calculator.class); } I'm going to exercise each Calculator instance with the following test. The additional accumulator is necessary, otherwise JVM might optimize away the whole loop (!): private int benchmarkWith(Calculator calculator, int reps) { int accum = 0; for (int i = 0; i < reps; ++i) { accum += calculator.identity(i % 16); } return accum; } Here is the full caliper test without parts already discussed: public class CacheableBenchmark extends SimpleBenchmark { //... public int timeNoCaching(int reps) { return benchmarkWith(noCaching, reps); } public int timeManualCaching(int reps) { return benchmarkWith(manualCaching, reps); } public int timeCacheableWithCglib(int reps) { return benchmarkWith(cacheableCglib, reps); } public int timeCacheableWithJdkProxy(int reps) { return benchmarkWith(cacheableJdkProxy, reps); } public int timeCacheableWithAspectJWeaving(int reps) { return benchmarkWith(cacheableAspectJ, reps); } public int timeAspectJCustom(int reps) { return benchmarkWith(aspectJCustom, reps); } } I hope you are still following our experiment. We are now going to execute Calculate.identity() millions of times and see which caching configuration performs best. Since we only call identity() with 16 different arguments, we hardly ever touch the method itself as we always get cache hit. Curious to see the results? benchmark ns linear runtime NoCaching 1.77 = ManualCaching 23.84 = CacheableWithCglib 1576.42 ============================== CacheableWithJdkProxy 1551.03 ============================= CacheableWithAspectJWeaving 1514.83 ============================ AspectJCustom 22.98 = Interpretation Let's go step by step. First of all calling a method in Java is pretty darn fast! 1.77 nanoseconds, we are talking here about 3 CPU cycles on my Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz! If this doesn't convince you that Java is fast, I don't know what will. But back to our test. Hand-made caching decorator is also pretty fast. Of course it's slower by an order of magnitude compared to pure function call, but still blazingly fast compared to all @Scheduled benchmarks. We see a drop by 3 orders of magnitude, from 1.8 ns to 1.5 μs. I'm especially disappointed by the @Cacheable backed by AspectJ. After all caching aspect is precompiled directly into my Java .class file, I would expect it to be much faster compared to dynamic proxies and CGLIB. But that doesn't seem to be the case. All three Spring AOP techniques are similar. The greatest surprise is my custom AspectJ aspect. It's even faster than CachingCalculatorDecorator! maybe it's due to polymorphic call in the decorator? I strongly encourage you to clone this benchmark on GitHub and run it (mvn clean test, takes around 2 minutes) to compare your results. Conclusions You might be wondering why Spring abstraction layer is so slow? Well, first of all, check out the core implementation in CacheAspectSupport - it's actually quite complex. Secondly, is it really that slow? Do the math - you typically use Spring in business applications where database, network and external APIs are the bottleneck. What latencies do you typically see? Milliseconds? Tens or hundreds of milliseconds? Now add an overhead of 2 μs (worst case scenario). For caching database queries or REST calls this is completely negligible. It doesn't matter which technique you choose. But if you are caching very low-level methods close to the metal, like CPU-intensive, in-memory computations, Spring abstraction layer might be an overkill. The bottom line: measure! PS: both benchmark and contents of this article in Markdown format are freely available.

January 15, 2013

by Tomasz Nurkiewicz

· 29,755 Views · 2 Likes

Reading Hive Tables from MapReduce

This article is by Stephen Mouring Jr, appearing courtesy of Scott Leberknight. This is part two of a two part blog series on how to read/write Apache Hive data from MapReduce jos. Part one (Writing Hive Tables from MapReduce) is here. So just as sometimes you need to write data to Hive with a custom MapReduce job, sometimes you need to read that data back from Hive with a custom MapReduce job. As covered in part one, Hive is a layer that sits on HDFS and imposes a standard convention on the structure of the files so it can interpret them as columns and rows. Reading data out of Hive is just a matter of parsing the files correctly. Recall that files processed by MapReduce (and by extension, Hive) are output as key value pairs. Hive ignores the keys (read as a BytesWritable with a value of null) and reads/writes the values as Text objects. The value of the Text object for each row is the concatenation of all the column values delimited by the delimiter of the table (which Hive defaults to the "char 1" ASCII character). Seems like a simple problem, so my first thought was to just using String.split() in the map() method of the MapReduce job. String SEPARATOR_FIELD = new String(new char[] {1}); String[] rowColumns = new String (rowTextObject.getBytes()).split(SEPARATOR_FIELD); In theory this should have worked perfectly, but unfortunately I have found that String.split() actually consumes repeated delimiters. This is a problem if any of the values in the row are blank, since split() will shift the positions of your columns and you will be unable to match up what values belong with which columns. An alternative would be to create a String from the Text object and iterate through it using indexOf(). This approach however requires extra object creation and depending on the scale of your MapReduce job and the size of your rows, may slow you down needlessly. So an alternative is to use the Text object's find() method. String SEPARATOR_FIELD = new String(new char[] {1}); String[] rowColumns = new String[NUMBER_OF_COLUMNS_IN_YOUR_HIVE_TABLE]; int start = 0; int end = 0; for (int i = 0; i < rowColumns.length; ++i) { end = rowTextObject.find(SEPARATOR_FIELD, start); if (end == -1) { end = rowString.getLength(); } rowColumns[i] = new String(rowTextObject.getBytes(), start, end-start); start = end + 1; } This will parse out each value into the appropriately index of the rowColumns array. Blank values will also be handled correctly and result in blank strings being inserted into the rowColumns array.

January 11, 2013

by Scott Leberknight

· 6,617 Views · 1 Like

Bash Magic: List Hive Table Sizes in GB

How to list the sizes of Hive tables in Hadoop in GBs.

January 10, 2013

by Jakub Holý

· 42,853 Views · 3 Likes

Chunk Oriented Processing in Spring Batch

Big Data Sets’ Processing is one of the most important problem in the software world. Spring Batch is a lightweight and robust batch framework to process the data sets. Spring Batch Framework offers ‘TaskletStep Oriented’ and ‘Chunk Oriented’ processing style. In this article, Chunk Oriented Processing Model is explained. Also, TaskletStep Oriented Processing in Spring Batch Article is definitely suggested to investigate how to develop TaskletStep Oriented Processing in Spring Batch. Chunk Oriented Processing Feature has come with Spring Batch v2.0. It refers to reading the data one at a time, and creating ‘chunks’ that will be written out, within a transaction boundary. One item is read from an ItemReader, handed to an ItemProcessor, and written. Once the number of items read equals the commit interval, the entire chunk is written out via the ItemWriter, and then the transaction is committed. Basically, this feature should be used if at least one data item’ s reading and writing is required. Otherwise, TaskletStep Oriented processing can be used if the data item’ s only reading or writing is required. Chunk Oriented Processing model exposes three important interface as ItemReader, ItemProcessor and ItemWriter via org.springframework.batch.item package. ItemReader : This interface is used for providing the data. It reads the data which will be processed. ItemProcessor : This interface is used for item transformation. It processes input object and transforms to output object. ItemWriter : This interface is used for generic output operations. It writes the datas which are transformed by ItemProcessor. For example, the datas can be written to database, memory or outputstream (etc). In this sample application, we will write to database. Let us take a look how to develop Chunk Oriented Processing Model. Used Technologies : JDK 1.7.0_09 Spring 3.1.3 Spring Batch 2.1.9 Hibernate 4.1.8 Tomcat JDBC 7.0.27 MySQL 5.5.8 MySQL Connector 5.1.17 Maven 3.0.4 Step 1 : Create Maven Project A maven project is created as below. (It can be created by using Maven or IDE Plug-in). Step 2: Libraries A new USER Table is created by executing below script: CREATE TABLE ONLINETECHVISION.USER ( id int(11) NOT NULL AUTO_INCREMENT, name varchar(45) NOT NULL, surname varchar(45) NOT NULL, PRIMARY KEY (`id`) ); Step 3: Libraries Firstly, dependencies are added to Maven’ s pom.xml. 3.1.3.RELEASE 2.1.9.RELEASE org.springframework spring-core ${spring.version} org.springframework spring-context ${spring.version} org.springframework spring-tx ${spring.version} org.springframework spring-orm ${spring.version} org.springframework.batch spring-batch-core ${spring-batch.version} org.hibernate hibernate-core 4.1.8.Final org.apache.tomcat tomcat-jdbc 7.0.27 mysql mysql-connector-java 5.1.17 log4j log4j 1.2.16 maven-compiler-plugin(Maven Plugin) is used to compile the project with JDK 1.7 org.apache.maven.plugins maven-compiler-plugin 3.0 1.7 1.7 The following Maven plugin can be used to create runnable-jar, org.apache.maven.plugins maven-shade-plugin 2.0 package shade 1.7 1.7 com.onlinetechvision.exe.Application META-INF/spring.handlers META-INF/spring.schemas Step 4 : Create User Entity User Entity is created. This entity will be stored after processing. package com.onlinetechvision.user; import javax.persistence.Column; import javax.persistence.Entity; import javax.persistence.GeneratedValue; import javax.persistence.GenerationType; import javax.persistence.Id; import javax.persistence.Table; /** * User Entity * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ @Entity @Table(name="USER") public class User { private int id; private String name; private String surname; @Id @GeneratedValue(strategy=GenerationType.AUTO) @Column(name="ID", unique = true, nullable = false) public int getId() { return id; } public void setId(int id) { this.id = id; } @Column(name="NAME", unique = true, nullable = false) public String getName() { return name; } public void setName(String name) { this.name = name; } @Column(name="SURNAME", unique = true, nullable = false) public String getSurname() { return surname; } public void setSurname(String surname) { this.surname = surname; } @Override public String toString() { StringBuffer strBuff = new StringBuffer(); strBuff.append("id : ").append(getId()); strBuff.append(", name : ").append(getName()); strBuff.append(", surname : ").append(getSurname()); return strBuff.toString(); } } Step 5 : Create IUserDAO Interface IUserDAO Interface is created to expose data access functionality. package com.onlinetechvision.user.dao; import java.util.List; import com.onlinetechvision.user.User; /** * User DAO Interface * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public interface IUserDAO { /** * Adds User * * @param User user */ void addUser(User user); /** * Gets User List * */ List getUsers(); } Step 6 : Create UserDAO IMPL UserDAO Class is created by implementing IUserDAO Interface. package com.onlinetechvision.user.dao; import java.util.List; import org.hibernate.SessionFactory; import com.onlinetechvision.user.User; /** * User DAO * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class UserDAO implements IUserDAO { private SessionFactory sessionFactory; /** * Gets Hibernate Session Factory * * @return SessionFactory - Hibernate Session Factory */ public SessionFactory getSessionFactory() { return sessionFactory; } /** * Sets Hibernate Session Factory * * @param SessionFactory - Hibernate Session Factory */ public void setSessionFactory(SessionFactory sessionFactory) { this.sessionFactory = sessionFactory; } /** * Adds User * * @param User user */ @Override public void addUser(User user) { getSessionFactory().getCurrentSession().save(user); } /** * Gets User List * * @return List - User list */ @SuppressWarnings({ "unchecked" }) @Override public List getUsers() { List list = getSessionFactory().getCurrentSession().createQuery("from User").list(); return list; } } Step 7 : Create IUserService Interface IUserService Interface is created for service layer. package com.onlinetechvision.user.service; import java.util.List; import com.onlinetechvision.user.User; /** * * User Service Interface * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public interface IUserService { /** * Adds User * * @param User user */ void addUser(User user); /** * Gets User List * * @return List - User list */ List getUsers(); } Step 8 : Create UserService IMPL UserService Class is created by implementing IUserService Interface. package com.onlinetechvision.user.service; import java.util.List; import org.springframework.transaction.annotation.Transactional; import com.onlinetechvision.user.User; import com.onlinetechvision.user.dao.IUserDAO; /** * * User Service * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ @Transactional(readOnly = true) public class UserService implements IUserService { IUserDAO userDAO; /** * Adds User * * @param User user */ @Transactional(readOnly = false) @Override public void addUser(User user) { getUserDAO().addUser(user); } /** * Gets User List * */ @Override public List getUsers() { return getUserDAO().getUsers(); } public IUserDAO getUserDAO() { return userDAO; } public void setUserDAO(IUserDAO userDAO) { this.userDAO = userDAO; } } Step 9 : Create TestReader IMPL TestReader Class is created by implementing ItemReader Interface. This class is called in order to read items. When read method returns null, reading operation is completed. The following steps explains with details how to be executed firstJob. The commit-interval value of firstjob is 2 and the following steps are executed : 1) firstTestReader is called to read first item(firstname_0, firstsurname_0) 2) firstTestReader is called again to read second item(firstname_1, firstsurname_1) 3) testProcessor is called to process first item(FIRSTNAME_0, FIRSTSURNAME_0) 4) testProcessor is called to process second item(FIRSTNAME_1, FIRSTSURNAME_1) 5) testWriter is called to write first item(FIRSTNAME_0, FIRSTSURNAME_0) to database 6) testWriter is called to write second item(FIRSTNAME_1, FIRSTSURNAME_1) to database 7) first and second items are committed and the transaction is closed. 8) firstTestReader is called to read third item(firstname_2, firstsurname_2) 9) maxIndex value of firstTestReader is 3. read method returns null and item reading operation is completed. 10) testProcessor is called to process third item(FIRSTNAME_2, FIRSTSURNAME_2) 11) testWriter is called to write first item(FIRSTNAME_2, FIRSTSURNAME_2) to database 12) third item is committed and the transaction is closed. firstStep is completed with COMPLETED status and secondStep is started. secondJob and thirdJob are executed in the same way. package com.onlinetechvision.item; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.NonTransientResourceException; import org.springframework.batch.item.ParseException; import org.springframework.batch.item.UnexpectedInputException; import com.onlinetechvision.user.User; /** * TestReader Class is created to read items which will be processed * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class TestReader implements ItemReader { private int index; private int maxIndex; private String namePrefix; private String surnamePrefix; /** * Reads items one by one * * @return User * * @throws Exception * @throws UnexpectedInputException * @throws ParseException * @throws NonTransientResourceException * */ @Override public User read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException { User user = new User(); user.setName(getNamePrefix() + "_" + index); user.setSurname(getSurnamePrefix() + "_" + index); if(index > getMaxIndex()) { return null; } incrementIndex(); return user; } /** * Increments index which defines read-count * * @return int * */ private int incrementIndex() { return index++; } public int getMaxIndex() { return maxIndex; } public void setMaxIndex(int maxIndex) { this.maxIndex = maxIndex; } public String getNamePrefix() { return namePrefix; } public void setNamePrefix(String namePrefix) { this.namePrefix = namePrefix; } public String getSurnamePrefix() { return surnamePrefix; } public void setSurnamePrefix(String surnamePrefix) { this.surnamePrefix = surnamePrefix; } } Step 10 : Create FailedCaseTestReader IMPL FailedCaseTestReader Class is created in order to simulate the failed job status. In this sample application, when thirdJob is processed at fifthStep, failedCaseTestReader is called and exception is thrown so its status will be FAILED. package com.onlinetechvision.item; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.NonTransientResourceException; import org.springframework.batch.item.ParseException; import org.springframework.batch.item.UnexpectedInputException; import com.onlinetechvision.user.User; /** * FailedCaseTestReader Class is created in order to simulate the failed job status. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class FailedCaseTestReader implements ItemReader { private int index; private int maxIndex; private String namePrefix; private String surnamePrefix; /** * Reads items one by one * * @return User * * @throws Exception * @throws UnexpectedInputException * @throws ParseException * @throws NonTransientResourceException * */ @Override public User read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException { User user = new User(); user.setName(getNamePrefix() + "_" + index); user.setSurname(getSurnamePrefix() + "_" + index); if(index >= getMaxIndex()) { throw new Exception("Unexpected Error!"); } incrementIndex(); return user; } /** * Increments index which defines read-count * * @return int * */ private int incrementIndex() { return index++; } public int getMaxIndex() { return maxIndex; } public void setMaxIndex(int maxIndex) { this.maxIndex = maxIndex; } public String getNamePrefix() { return namePrefix; } public void setNamePrefix(String namePrefix) { this.namePrefix = namePrefix; } public String getSurnamePrefix() { return surnamePrefix; } public void setSurnamePrefix(String surnamePrefix) { this.surnamePrefix = surnamePrefix; } } Step 11 : Create TestProcessor IMPL TestProcessor Class is created by implementing ItemProcessor Interface. This class is called to process items. User item is received from TestReader, processed and returned to TestWriter. package com.onlinetechvision.item; import java.util.Locale; import org.springframework.batch.item.ItemProcessor; import com.onlinetechvision.user.User; /** * TestProcessor Class is created to process items. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class TestProcessor implements ItemProcessor { /** * Processes items one by one * * @param User user * @return User * @throws Exception * */ @Override public User process(User user) throws Exception { user.setName(user.getName().toUpperCase(Locale.ENGLISH)); user.setSurname(user.getSurname().toUpperCase(Locale.ENGLISH)); return user; } } Step 12 : Create TestWriter IMPL TestWriter Class is created by implementing ItemWriter Interface. This class is called to write items to DB, memory etc… package com.onlinetechvision.item; import java.util.List; import org.springframework.batch.item.ItemWriter; import com.onlinetechvision.user.User; import com.onlinetechvision.user.service.IUserService; /** * TestWriter Class is created to write items to DB, memory etc... * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class TestWriter implements ItemWriter { private IUserService userService; /** * Writes items via list * * @throws Exception * */ @Override public void write(List userList) throws Exception { for(User user : userList) { getUserService().addUser(user); } System.out.println("User List : " + getUserService().getUsers()); } public IUserService getUserService() { return userService; } public void setUserService(IUserService userService) { this.userService = userService; } } Step 13 : Create FailedStepTasklet Class FailedStepTasklet is created by implementing Tasklet Interface. It illustrates business logic in failed step. package com.onlinetechvision.tasklet; import org.apache.log4j.Logger; import org.springframework.batch.core.StepContribution; import org.springframework.batch.core.scope.context.ChunkContext; import org.springframework.batch.core.step.tasklet.Tasklet; import org.springframework.batch.repeat.RepeatStatus; /** * FailedStepTasklet Class illustrates a failed job. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class FailedStepTasklet implements Tasklet { private static final Logger logger = Logger.getLogger(FailedStepTasklet.class); private String taskResult; /** * Executes FailedStepTasklet * * @param StepContribution stepContribution * @param ChunkContext chunkContext * @return RepeatStatus * @throws Exception * */ public RepeatStatus execute(StepContribution stepContribution, ChunkContext chunkContext) throws Exception { logger.debug("Task Result : " + getTaskResult()); throw new Exception("Error occurred!"); } public String getTaskResult() { return taskResult; } public void setTaskResult(String taskResult) { this.taskResult = taskResult; } } Step 14 : Create BatchProcessStarter Class BatchProcessStarter Class is created to launch the jobs. Also, it logs their execution results. package com.onlinetechvision.spring.batch; import org.apache.log4j.Logger; import org.springframework.batch.core.Job; import org.springframework.batch.core.JobExecution; import org.springframework.batch.core.JobParametersBuilder; import org.springframework.batch.core.JobParametersInvalidException; import org.springframework.batch.core.launch.JobLauncher; import org.springframework.batch.core.repository.JobExecutionAlreadyRunningException; import org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException; import org.springframework.batch.core.repository.JobRepository; import org.springframework.batch.core.repository.JobRestartException; /** * BatchProcessStarter Class launches the jobs and logs their execution results. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class BatchProcessStarter { private static final Logger logger = Logger.getLogger(BatchProcessStarter.class); private Job firstJob; private Job secondJob; private Job thirdJob; private JobLauncher jobLauncher; private JobRepository jobRepository; /** * Starts the jobs and logs their execution results. * */ public void start() { JobExecution jobExecution = null; JobParametersBuilder builder = new JobParametersBuilder(); try { getJobLauncher().run(getFirstJob(), builder.toJobParameters()); jobExecution = getJobRepository().getLastJobExecution(getFirstJob().getName(), builder.toJobParameters()); logger.debug(jobExecution.toString()); getJobLauncher().run(getSecondJob(), builder.toJobParameters()); jobExecution = getJobRepository().getLastJobExecution(getSecondJob().getName(), builder.toJobParameters()); logger.debug(jobExecution.toString()); getJobLauncher().run(getThirdJob(), builder.toJobParameters()); jobExecution = getJobRepository().getLastJobExecution(getThirdJob().getName(), builder.toJobParameters()); logger.debug(jobExecution.toString()); } catch (JobExecutionAlreadyRunningException | JobRestartException | JobInstanceAlreadyCompleteException | JobParametersInvalidException e) { logger.error(e); } } public Job getFirstJob() { return firstJob; } public void setFirstJob(Job firstJob) { this.firstJob = firstJob; } public Job getSecondJob() { return secondJob; } public void setSecondJob(Job secondJob) { this.secondJob = secondJob; } public Job getThirdJob() { return thirdJob; } public void setThirdJob(Job thirdJob) { this.thirdJob = thirdJob; } public JobLauncher getJobLauncher() { return jobLauncher; } public void setJobLauncher(JobLauncher jobLauncher) { this.jobLauncher = jobLauncher; } public JobRepository getJobRepository() { return jobRepository; } public void setJobRepository(JobRepository jobRepository) { this.jobRepository = jobRepository; } } Step 15 : Create dataContext.xml jdbc.properties, is created. It defines data-source informations and is read via dataContext.xml jdbc.db.driverClassName=com.mysql.jdbc.Driver jdbc.db.url=jdbc:mysql://localhost:3306/onlinetechvision jdbc.db.username=root jdbc.db.password=root jdbc.db.initialSize=10 jdbc.db.minIdle=3 jdbc.db.maxIdle=10 jdbc.db.maxActive=10 jdbc.db.testWhileIdle=true jdbc.db.testOnBorrow=true jdbc.db.testOnReturn=true jdbc.db.initSQL=SELECT 1 FROM DUAL jdbc.db.validationQuery=SELECT 1 FROM DUAL jdbc.db.timeBetweenEvictionRunsMillis=30000 Step 16 : Create dataContext.xml Spring Configuration file, dataContext.xml, is created. It covers dataSource, sessionFactory and transactionManager definitions. com.onlinetechvision.user.User org.hibernate.dialect.MySQLDialect true Step 17 : Create jobContext.xml Spring Configuration file, jobContext.xml, is created. It covers jobRepository, jobLauncher, item reader, item processor, item writer, tasklet and job definitions. Step 18 : Create applicationContext.xml Spring Configuration file, applicationContext.xml, is created. It covers bean definitions. Step 19 : Create Application Class Application Class is created to run the application. package com.onlinetechvision.exe; import org.springframework.context.ApplicationContext; import org.springframework.context.support.ClassPathXmlApplicationContext; import com.onlinetechvision.spring.batch.BatchProcessStarter; /** * Application Class starts the application. * * @author onlinetechvision.com * @since 10 Dec 2012 * @version 1.0.0 * */ public class Application { /** * Starts the application * * @param String[] args * */ public static void main(String[] args) { ApplicationContext appContext = new ClassPathXmlApplicationContext("applicationContext.xml"); BatchProcessStarter batchProcessStarter = (BatchProcessStarter)appContext.getBean("batchProcessStarter"); batchProcessStarter.start(); } } Step 20 : Build Project After OTV_SpringBatch_Chunk_Oriented_Processing Project is built, OTV_SpringBatch_Chunk_Oriented_Processing-0.0.1-SNAPSHOT.jar will be created. STEP 21 : RUN PROJECT After created OTV_SpringBatch_Chunk_Oriented_Processing-0.0.1-SNAPSHOT.jar file is run, the following database and console output logs will be shown : Database screenshot : First Job’ s console output : 16.12.2012 19:30:41 INFO (SimpleJobLauncher.java:118) - Job: [FlowJob: [name=firstJob]] launched with the following parameters: [{}] 16.12.2012 19:30:41 DEBUG (AbstractJob.java:278) - Job execution starting: JobExecution: id=0, version=0, startTime=null, endTime=null, lastUpdated=Sun Dec 16 19:30:41 GMT 2012, status=STARTING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=0, version=0, JobParameters=[{}], Job=[firstJob]] User List : [id : 181, name : FIRSTNAME_0, surname : FIRSTSURNAME_0, id : 182, name : FIRSTNAME_1, surname : FIRSTSURNAME_1, id : 183, name : FIRSTNAME_2, surname : FIRSTSURNAME_2, id : 184, name : SECONDNAME_0, surname : SECONDSURNAME_0, id : 185, name : SECONDNAME_1, surname : SECONDSURNAME_1, id : 186, name : SECONDNAME_2, surname : SECONDSURNAME_2] 16.12.2012 19:30:42 DEBUG (BatchProcessStarter.java:43) - JobExecution: id=0, version=2, startTime=Sun Dec 16 19:30:41 GMT 2012, endTime=Sun Dec 16 19:30:42 GMT 2012, lastUpdated=Sun Dec 16 19:30:42 GMT 2012, status=COMPLETED, exitStatus=exitCode=COMPLETED;exitDescription=, job=[JobInstance: id=0, version=0, JobParameters=[{}], Job=[firstJob]] Second Job’ s console output : 16.12.2012 19:30:42 INFO (SimpleJobLauncher.java:118) - Job: [FlowJob: [name=secondJob]] launched with the following parameters: [{}] 16.12.2012 19:30:42 DEBUG (AbstractJob.java:278) - Job execution starting: JobExecution: id=1, version=0, startTime=null, endTime=null, lastUpdated=Sun Dec 16 19:30:42 GMT 2012, status=STARTING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=1, version=0, JobParameters=[{}], Job=[secondJob]] User List : [id : 181, name : FIRSTNAME_0, surname : FIRSTSURNAME_0, id : 182, name : FIRSTNAME_1, surname : FIRSTSURNAME_1, id : 183, name : FIRSTNAME_2, surname : FIRSTSURNAME_2, id : 184, name : SECONDNAME_0, surname : SECONDSURNAME_0, id : 185, name : SECONDNAME_1, surname : SECONDSURNAME_1, id : 186, name : SECONDNAME_2, surname : SECONDSURNAME_2, id : 187, name : THIRDNAME_0, surname : THIRDSURNAME_0, id : 188, name : THIRDNAME_1, surname : THIRDSURNAME_1, id : 189, name : THIRDNAME_2, surname : THIRDSURNAME_2, id : 190, name : THIRDNAME_3, surname : THIRDSURNAME_3, id : 191, name : FOURTHNAME_0, surname : FOURTHSURNAME_0, id : 192, name : FOURTHNAME_1, surname : FOURTHSURNAME_1, id : 193, name : FOURTHNAME_2, surname : FOURTHSURNAME_2, id : 194, name : FOURTHNAME_3, surname : FOURTHSURNAME_3] 16.12.2012 19:30:42 DEBUG (BatchProcessStarter.java:47) - JobExecution: id=1, version=2, startTime=Sun Dec 16 19:30:42 GMT 2012, endTime=Sun Dec 16 19:30:42 GMT 2012, lastUpdated=Sun Dec 16 19:30:42 GMT 2012, status=COMPLETED, exitStatus=exitCode=COMPLETED;exitDescription=, job=[JobInstance: id=1, version=0, JobParameters=[{}], Job=[secondJob]] Third Job’ s console output : 16.12.2012 19:30:42 INFO (SimpleJobLauncher.java:118) - Job: [FlowJob: [name=thirdJob]] launched with the following parameters: [{}] 16.12.2012 19:30:42 DEBUG (AbstractJob.java:278) - Job execution starting: JobExecution: id=2, version=0, startTime=null, endTime=null, lastUpdated=Sun Dec 16 19:30:42 GMT 2012, status=STARTING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=2, version=0, JobParameters=[{}], Job=[thirdJob]] 16.12.2012 19:30:42 DEBUG (TransactionTemplate.java:159) - Initiating transaction rollback on application exception org.springframework.batch.repeat.RepeatException: Exception in batch process; nested exception is java.lang.Exception: Unexpected Error! ... 16.12.2012 19:30:43 DEBUG (BatchProcessStarter.java:51) - JobExecution: id=2, version=2, startTime=Sun Dec 16 19:30:42 GMT 2012, endTime=Sun Dec 16 19:30:43 GMT 2012, lastUpdated=Sun Dec 16 19:30:43 GMT 2012, status=FAILED, exitStatus=exitCode=FAILED;exitDescription=, job=[JobInstance: id=2, version=0, JobParameters=[{}], Job=[thirdJob]] Step 22 : Download https://github.com/erenavsarogullari/OTV_SpringBatch_Chunk_Oriented_Processing REFERENCES : Chunk Oriented Processing in Spring Batch

January 3, 2013

by Eren Avsarogullari

· 153,141 Views · 7 Likes

This is a recurring question made by our MySQL Support customers: How can I audit the login attempts in MySQL? Logging all the attempts or just the failed ones is a very important task on some scenarios. Unfortunately there are not too many audit capabilities in MySQL Community so the first option to audit MySQL’s authentication process is to get all the information we need from logs. General Query Log The first option is the General Query Log. Let’s see an example: Enable the log: general_log_file = /var/log/mysql/mysql.log general_log = 1 User correctly authenticated: 121227 8:31:49 38 Connect root@localhost on 38 Query select @@version_comment limit 1 User not correctly authenticated: 121227 8:32:18 39 Connect root@localhost on 39 Connect Access denied for user 'root'@'localhost' (using password: YES) The problem of the General Query Log is that it will log everything so it can cause performance degradation and you will have to deal with very large files on high loaded servers. general_log variable is dynamic so a solution could be enabling and disabling the log just when it’s needed. Error log If you only care about failed attempts to login then there is another different and less problematic approach. From 5.5 it’s possible to log access denied messages to the error log. We just need to enable log_warnings with a value greater than 1: log_warnings = 2 Then check the error log: 121227 8:44:21 [Warning] Access denied for user 'root'@'localhost' (using password: YES) User Statistics If you are using Percona Server then there is a third option to get information about our users, the User Statistics. As with the previous options we can get the number of connections and failed connections made by a particular user but not the date and time of those attempts. Besides that information we can get other statistics that can be very useful if MySQL is running on a multi-tenant environment or we need to control how resources are used. Let’s seen an example, first we enable User Statistics in my.cnf: 5.5 userstat = 1 5.1 userstat_running = 1 Then we get the information about a particular user: mysql> select * from user_statistics where user='root'\G *************************** 1. row *************************** USER: root TOTAL_CONNECTIONS: 25 CONCURRENT_CONNECTIONS: 0 CONNECTED_TIME: 464 BUSY_TIME: 96 CPU_TIME: 19 BYTES_RECEIVED: 62869617 BYTES_SENT: 14520 BINLOG_BYTES_WRITTEN: 0 ROWS_FETCHED: 783051 ROWS_UPDATED: 1017714 TABLE_ROWS_READ: 1484751 SELECT_COMMANDS: 14 UPDATE_COMMANDS: 103 OTHER_COMMANDS: 3556 COMMIT_TRANSACTIONS: 0 ROLLBACK_TRANSACTIONS: 0 DENIED_CONNECTIONS: 2 LOST_CONNECTIONS: 16 ACCESS_DENIED: 0 EMPTY_QUERIES: 0 TOTAL_SSL_CONNECTIONS: 0 Here we can see that root has done 25 total connections. Two denied connections (bad password) and 16 lost connections (not closed properly). Apart from that information we get the connection time, bytes received and sent, rows accessed, commands executed and so on. Very valuable information. It is important to mention that these tables are stored in INFORMATION_SCHEMA and that means that after a mysqld restart all the information will be lost. So if you really need that information you should copy it to another table or export to a csv for further analysis. Conclusion We don’t have too many audit capabilities in MySQL Community so logging all events and then filter them with custom-made scripts is the best solution we have nowadays. If you are using Percona Server you can get more detailed information about what a particular user is doing. All options can be combined to meet your needs.

January 2, 2013

by Peter Zaitsev

· 18,138 Views

How ACID is MongoDB?

Find out how MongoDB ranks with atomicity, consistency, isolation and durability (ACID).

January 1, 2013

by Giorgio Sironi

· 81,959 Views · 5 Likes

UML2 Class Diagram in Java

Learn all about this Unified Modeling Language diagram in Java.

December 31, 2012

by Mainak Goswami

· 200,466 Views · 10 Likes

Google Guava Cache with regular expression patterns

Hi! Merry Christmas everyone :) Quite recently I've seen a nice presentation about Google Guava and we came to the conclusion in our project that it could be really interesting to use the its Cache functionallity. Let us take a look at the regexp Pattern class and its compile function. Quite often in the code we can see that each time a regular expression is being used a programmer is repeatidly calling the aforementioned Pattern.compile() function with the same argument thus compiling the same regular expression over and over again. What could be done however is to cache the result of such compilations - let us take a look at the RegexpUtils utility class: RegexpUtils.java package pl.grzejszczak.marcin.guava.cache.utils; import com.google.common.cache.CacheBuilder; import com.google.common.cache.CacheLoader; import com.google.common.cache.LoadingCache; import java.util.concurrent.ExecutionException; import java.util.regex.Matcher; import java.util.regex.Pattern; import static java.lang.String.format; public final class RegexpUtils { private RegexpUtils() { throw new UnsupportedOperationException("RegexpUtils is a utility class - don't instantiate it!"); } private static final LoadingCache COMPILED_PATTERNS = CacheBuilder.newBuilder().build(new CacheLoader() { @Override public Pattern load(String regexp) throws Exception { return Pattern.compile(regexp); } }); public static Pattern getPattern(String regexp) { try { return COMPILED_PATTERNS.get(regexp); } catch (ExecutionException e) { throw new RuntimeException(format("Error when getting a pattern [%s] from cache", regexp), e); } } public static boolean matches(String stringToCheck, String regexp) { return doGetMatcher(stringToCheck, regexp).matches(); } public static Matcher getMatcher(String stringToCheck, String regexp) { return doGetMatcher(stringToCheck, regexp); } private static Matcher doGetMatcher(String stringToCheck, String regexp) { Pattern pattern = getPattern(regexp); return pattern.matcher(stringToCheck); } } As you can see the Guava's LoadingCache with the CacheBuilder is being used to populate a cache with a new compiled pattern if one is not found. Due to caching the compiled pattern if a compilation has already taken place it will not be repeated ever again (in our case since we dno't have any expiry set). Now a simple test GuavaCache.java package pl.grzejszczak.marcin.guava.cache; import com.google.common.base.Stopwatch; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import pl.grzejszczak.marcin.guava.cache.utils.RegexpUtils; import java.util.regex.Pattern; import static java.lang.String.format; public class GuavaCache { private static final Logger LOGGER = LoggerFactory.getLogger(GuavaCache.class); public static final String STRING_TO_MATCH = "something"; public static void main(String[] args) { runTestForManualCompilationAndOneUsingCache(1); runTestForManualCompilationAndOneUsingCache(10); runTestForManualCompilationAndOneUsingCache(100); runTestForManualCompilationAndOneUsingCache(1000); runTestForManualCompilationAndOneUsingCache(10000); runTestForManualCompilationAndOneUsingCache(100000); runTestForManualCompilationAndOneUsingCache(1000000); } private static void runTestForManualCompilationAndOneUsingCache(int firstNoOfRepetitions) { repeatManualCompilation(firstNoOfRepetitions); repeatCompilationWithCache(firstNoOfRepetitions); } private static void repeatManualCompilation(int noOfRepetitions) { Stopwatch stopwatch = new Stopwatch().start(); compileAndMatchPatternManually(noOfRepetitions); LOGGER.debug(format("Time needed to compile and check regexp expression [%d] ms, no of iterations [%d]", stopwatch.elapsedMillis(), noOfRepetitions)); } private static void repeatCompilationWithCache(int noOfRepetitions) { Stopwatch stopwatch = new Stopwatch().start(); compileAndMatchPatternUsingCache(noOfRepetitions); LOGGER.debug(format("Time needed to compile and check regexp expression using Cache [%d] ms, no of iterations [%d]", stopwatch.elapsedMillis(), noOfRepetitions)); } private static void compileAndMatchPatternManually(int limit) { for (int i = 0; i < limit; i++) { Pattern.compile("something").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something1").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something2").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something3").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something4").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something5").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something6").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something7").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something8").matcher(STRING_TO_MATCH).matches(); Pattern.compile("something9").matcher(STRING_TO_MATCH).matches(); } } private static void compileAndMatchPatternUsingCache(int limit) { for (int i = 0; i < limit; i++) { RegexpUtils.matches(STRING_TO_MATCH, "something"); RegexpUtils.matches(STRING_TO_MATCH, "something1"); RegexpUtils.matches(STRING_TO_MATCH, "something2"); RegexpUtils.matches(STRING_TO_MATCH, "something3"); RegexpUtils.matches(STRING_TO_MATCH, "something4"); RegexpUtils.matches(STRING_TO_MATCH, "something5"); RegexpUtils.matches(STRING_TO_MATCH, "something6"); RegexpUtils.matches(STRING_TO_MATCH, "something7"); RegexpUtils.matches(STRING_TO_MATCH, "something8"); RegexpUtils.matches(STRING_TO_MATCH, "something9"); } } } We are running a series of tests and checking the time of their execution. Note that the results of these tests are not precise due to the fact that the application is not being run in isolation so numerous conditions can affect the time of the execution. We are interested in showing some degree of the problem rather than showing the precise execution time. For a given number of iterations (1,10,100,1000,10000,100000,1000000) we are either compiling 10 regular expressions or using a Guava's cache to retrieve the compiled Pattern and then we match them against a string to match. These are the logs: pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [1] ms, no of iterations [1] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [35] ms, no of iterations [1] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [1] ms, no of iterations [10] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [0] ms, no of iterations [10] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [8] ms, no of iterations [100] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [3] ms, no of iterations [100] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [10] ms, no of iterations [1000] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [10] ms, no of iterations [1000] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [83] ms, no of iterations [10000] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [33] ms, no of iterations [10000] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [800] ms, no of iterations [100000] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [279] ms, no of iterations [100000] pl.grzejszczak.marcin.guava.cache.GuavaCache:34 Time needed to compile and check regexp expression [7562] ms, no of iterations [1000000] pl.grzejszczak.marcin.guava.cache.GuavaCache:40 Time needed to compile and check regexp expression using Cache [3067] ms, no of iterations [1000000] You can find the sources over here under the Guava/Cache directory or go to the url https://bitbucket.org/gregorin1987/too-much-coding/src

December 28, 2012

by Marcin Grzejszczak

· 9,183 Views

HashMap.get High CPU – Case Study

This article will describe the complete root cause analysis and solution of a HashMap High CPU problem (infinite looping) affecting a Weblogic 10.0 environment running on the Java HotSpot VM 1.5. This case study will again demonstrate this importance of mastering Thread Dump analysis skill and CPU correlation techniques such as Solaris prstat. Environment specifications Java EE server: Oracle Weblogic Portal 10.0 Middleware OS: Solaris 10 Java VM: Java HotSpot VM 1.5 Platform type: Portal application Monitoring and troubleshooting tools JVM Thread Dump (HotSpot format) Solaris prstat (CPU contributors analysis) Problem overview Problem type: High CPUobserved from our Weblogic production environment A high CPU problem was observed from our Solaris physical servers hosting a Weblogic Portal 10 environment. Users also reporting major slowdown of the portal application. Gathering and validation of facts As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause: What is the client impact? HIGH Recent change of the affected platform? No Any recent traffic increase to the affected platform? Yes How does this high CPU manifest itself? A sudden CPU increase was observed and is not going down; even after load goes down e.g. near zero level. Did an Oracle OSB recycle resolve the problem? Yes, but problem is returning after few hours or few days (unpredictable pattern) - Conclusion #1: The high CPU problem appears to be intermittent vs. pure correlation with load - Conclusion #2: Since high CPU remains after load goes down, this typically indicates either the presence of some infinite looping or heavy processing Threads Solaris CPU analysis using prstat Solaris prstat is a powerful OS command allowing you to obtain the CPU per process but more importantly CPU per Thread within a process. As you can see below from our case study, the CPU utilization was confirmed to go up as high as 100% utilization (saturation level). ## PRSTAT (CPU per Java Thread analysis) prstat -L -p 8223 1 1 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID 8223 bea10 2809M 2592M sleep 59 0 14:52:59 38.6%java/494 8223 bea10 2809M 2592M sleep 57 0 12:28:05 22.3% java/325 8223 bea10 2809M 2592M sleep 59 0 11:52:02 28.3% java/412 8223 bea10 2809M 2592M sleep 59 0 5:50:00 0.3% java/84 8223 bea10 2809M 2592M sleep 58 0 2:27:20 0.2% java/43 8223 bea10 2809M 2592M sleep 59 0 1:39:42 0.2% java/41287 8223 bea10 2809M 2592M sleep 59 0 4:41:44 0.2% java/30503 8223 bea10 2809M 2592M sleep 59 0 5:58:32 0.2% java/36116 …………………………………………………………………………………… As you can see from above data, 3 Java Threads were found using together close to 100% of the CPU utilization. For our root cause analysis, we did focus on Thread #494 (decimal format) corresponding to 0x1ee (HEXA format). Thread Dump analysis and PRSTAT correlation Once the culprit Threads were identified, the next step was to correlate this data with the Thread Dump data (which was captured exactly at the same time as prstat). A quick search within the generated Thread Dump file did reveal the Thread Stack Trace (Weblogic Stuck Thread #125) for 0x1ee as per below. "[STUCK] ExecuteThread: '125' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=1 tid=0x014c5030 nid=0x1ee runnable [0x536fb000..0x536ffc70] at java.util.HashMap.get(HashMap.java:346) at org.apache.axis.encoding.TypeMappingImpl.getClassForQName(TypeMappingImpl.java:715) at org.apache.axis.encoding.TypeMappingDelegate.getClassForQName(TypeMappingDelegate.java:170) at org.apache.axis.encoding.TypeMappingDelegate.getClassForQName(TypeMappingDelegate.java:160) at org.apache.axis.encoding.TypeMappingImpl.getDeserializer(TypeMappingImpl.java:454) at org.apache.axis.encoding.TypeMappingDelegate.getDeserializer(TypeMappingDelegate.java:108) at org.apache.axis.encoding.TypeMappingDelegate.getDeserializer(TypeMappingDelegate.java:102) at org.apache.axis.encoding.DeserializationContext.getDeserializer(DeserializationContext.java:457) at org.apache.axis.encoding.DeserializationContext.getDeserializerForType(DeserializationContext.java:547) at org.apache.axis.encoding.ser.BeanDeserializer.getDeserializer(BeanDeserializer.java:514) at org.apache.axis.encoding.ser.BeanDeserializer.onStartChild(BeanDeserializer.java:286) at org.apache.axis.encoding.DeserializationContext.startElement(DeserializationContext.java:1035) at org.apache.axis.message.SAX2EventRecorder.replay(SAX2EventRecorder.java:165) at org.apache.axis.message.MessageElement.publishToHandler(MessageElement.java:1141) at org.apache.axis.message.RPCElement.deserialize(RPCElement.java:236) at org.apache.axis.message.RPCElement.getParams(RPCElement.java:384) at org.apache.axis.client.Call.invoke(Call.java:2467) at org.apache.axis.client.Call.invoke(Call.java:2366) at org.apache.axis.client.Call.invoke(Call.java:1812) Thread Dump analysis – HashMap.get() infinite loop condition! As you can see from the above Thread Stack Trace, the Thread is currently stuck in an infinite loop over a java.util.HashMap that originates from the Apache Axis TypeMappingImpl Java class. This finding was quite revealing. The 2 others Threads using high CPU also did reveal infinite looping condition within the same Apache Axis HashMap Object. Root cause: non Thread safe HashMap in Apache Axis 1.4 Additional research did reveal this known defect affecting Apache Axis 1.4; which is the version that our application was using. As you may already know, usage of non Thread safe / non synchronized HashMap under concurrent Threads condition is very dangerous and can easily lead to internal HashMap index corruption and / or infinite looping. This is also a golden rule for any middleware software such as Oracle Weblogic, IBM WAS, Red Hat JBoss which rely heavily on HashMap data structures from various Java EE and caching services. Such best practice is also applicable for any Open Source third party API such as Apache Axis. The most common solution is to use the ConcurrentHashMap data structure which is designed for that type of concurrent Thread execution context. Solution Our team did apply the proposed patch from Apache (synchronize the non Thread safe HashMap) which did resolve the problem. We are also currently looking at upgrading our application to a newer version of Apache Axis. Conclusion I hope this case study has helped you understand how to pinpoint the root cause of high CPU Threads and the importance of proper Thread safe data structure for high concurrent Thread / processing applications. Please don’t hesitate to post any comment or question. Find office supplies promo codes to save money for your department's bottom line, so you can spend more on the latest software.

December 27, 2012

by Pierre - Hugues Charbonneau

· 8,498 Views · 1 Like

MongoDB and the Concept of Identity in NoSQL Databases

in this article i deal with a different nosql database called mongodb a mature nosql engine born outside the .net world to clarify the concept of id in a typical no sql database. installation of mongo is really simple, just download, uncompress, locate the bin folder, and type this from an administrator console prompt to install mongo as service 1: mongod --install --logpath c:xxxx --dbpath c:yyyy you can find plenty of installation guide on the internet, but with the above install command you create a windows service that will automatically start mongodb on your machine using specified datafolder. now you should download the c# driver to connect from .net code, but if you like using linq you can install fluent mongo directly with nuget. figure 1: install fluent mongo with nuget. fluent mongo is a library that gave little linq capability over standard drivers, but adding a nuget reference to fluentmongo you get automatically a reference to the official drivers. now you are ready to insert your first record in mongodb with the above code. mongoserver server = mongoserver.create(); mongodatabase databasetest = server.getdatabase("test"); var untyped = databasetest.getcollection("untyped"); untyped.save(new bsondocument { { "name", "untyped1" } }); bsondocument seconddocument = bsondocument.parse("{name: 'untyped2', blabla: 'bla bla value'}"); untyped.save(seconddocument); in line1 i create a connection to mongodb server passing no parameter to connect to local mongodb server, then i obtain a reference to a mongodatabase object called “test” with mongoserver::getdatabase() method and finally i get a reference to a collection named “untyped” with the mongngodatabase::getcollection () method. this is quite similar to a sql server or other sql database, you have a server, the server contains several databases, and each database is composed by tables; in the same way mongo is divided into server/database/collection where a collection contains document. mongodb stores data in json format and to insert data inside a collection you can simply create a bsondocument, an object defined by c# driver assembly that is capable to represent a document composed by a series of key-value pair. to initialize a bsondocument you can pass a icollection (line 4) or if you feel more confortable with string json representation, you can user bsondocument.parse() to specify the document directly with a json string. after you inserted the above documents you can use mongovue to see what is contained inside the database. figure 2: use mongovue to see what is inside the database the interesting aspect is that each document has an unique id, even if i did not specified any special property in the code. this is a standard behavior for nosql databases, if you did not specify any id property the database engine will create unique id on his own to idendify the docuemnt. the id is a key factor for mongo and other nosql storage, if you try to store a document directly inside the collection specifying json content you will get an error. untyped.save("{name: 'json', attribute:'attribute content'}"); the mongocollection object contains a save object that accepts a string, but the above call will fail with the error subclass must implement getdocumentid . previous code works because one of the specific functionality that a bsondocument implements is the ability to manage id generation, but plain json does not have this capability. if you need to know the id generated by the database, you can query the bsondocument for its unique id after it was saved in a mongo collection . (remember that the id is not available until you save the document). bsondocument seconddocument = bsondocument.parse("{name: 'untyped2', blabla: 'bla bla value'}"); object id; type idtype; iidgenerator generator; untyped.save(seconddocument); seconddocument.getdocumentid(out id, out idtype, out generator); basically you are asking to your bsondocument to return you the generated id, as well as the type of the id and the generator that mongo used to generate that specific id. the result is represented into this snippet figure 3: the three object that you got with a call to getdocumentid: id, idtype and the generator. as you can see the id is an instance of type mongodb.bson.objectid, based on bsonvalue base class and the generator is an instance of objectidgenerator. this type of id is specific to mongo, and the documentation states that an objectid is a bson objectid is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter. note that the timestamp and counter fields must be stored big endian unlike the rest of bson. this is because they are compared byte-by-byte and we want to ensure a mostly increasing order. if you want to have a generator that creates integer id, like identity column in sql server , you will find that it is simply not available out of the box, because an int value is not guarantee to be unique if you use sharding. sharding is a technique that permits to partition data into different physical instances, so each instance should generates ids that are unique across all instances and this prevents the use of a simple int32 id. clearly in .net world a guid is guarantee to be unique and is more .net oriented, so mongo db has a guid id generator, that can be specified with the above snippet of code.. bsondocument thirddocument = bsondocument.parse("{name: 'untyped3', anotherproperty: 'xxxxxxxxxxxxxxxxxxxxxxx'}"); var id2 = mongodb.bson.serialization.idgenerators.guidgenerator.instance.generateid(untyped, thirddocument); thirddocument.setdocumentid(id2); the key is using the guidgenerator (in the mongodb.bson.serialization.idgenerators namespace) to generate a valid mongoid guid value, then call the setdocumentid method of bsondocument to manually set the id and not relay on automatic id generation. if you look at the db you will find that the document with guid id has really a different id type. figure 4: the document with guid id is represented in a different way in mongovue, but as you can verify there is no problem in having documents with different id types in the same collection. this demonstrates that a no sql database has a concept of document id that is similar to the concept of id of a standard sql server, you can use a native id generation of the engine that generates a valid id during insertion or you can assign your own id to the document, but basically the whole concept of id is more engine-related and has no business meaning, so i strongly discourage to use anything that has a business meaning as id of a document. noone prevents you to insert in the document a property called “myid” or something else that has a business meaning and can be used as logical id and let the engine handle the internal id by itself.

December 26, 2012

by Ricci Gian Maria

· 8,209 Views

Getting Started with Quartz Scheduler on MySQL Database

Here are some simple steps to get you fully started with Quartz Scheduler on MySQL database using Groovy. The script below will allow you to quickly experiment different Quartz configuration settings using an external file. First step is to setup the database with tables. Assuming you already have installed MySQL and have access to create database and tables. bash> mysql -u root -p sql> create database quartz2; sql> create user 'quartz2'@'localhost' identified by 'quartz2123'; sql> grant all privileges on quartz2.* to 'quartz2'@'localhost'; sql> exit; bash> mysql -u root -p quartz2 < /path/to/quartz-dist/docs/dbTables/tables_mysql.sql The tables_mysql.sql can be found from Quartz distribution download, or directly from their source here. Once the database is up, you need to write some code to start up the Quartz Scheduler. Here is a simply Groovy script quartzServer.groovy that will run as a tiny scheduler server. // Run Quartz Scheduler as a server // Author: Author: Zemian Deng, Date: 2012-12-15_16:46:09 @GrabConfig(systemClassLoader=true) @Grab('mysql:mysql-connector-java:5.1.22') @Grab('org.slf4j:slf4j-simple:1.7.1') @Grab('org.quartz-scheduler:quartz:2.1.6') import org.quartz.* import org.quartz.impl.* import org.quartz.jobs.* config = args.length > 0 ? args[0] : "quartz.properties" scheduler = new StdSchedulerFactory(config).getScheduler() scheduler.start() // Register shutdown addShutdownHook { scheduler.shutdown() } // Quartz has its own thread, so now put this script thread to sleep until // user hit CTRL+C while (!scheduler.isShutdown()) { Thread.sleep(Long.MAX_VALUE) } And now you just need a config file quartz-mysql.properties that looks like this: # Main Quartz configuration org.quartz.scheduler.skipUpdateCheck = true org.quartz.scheduler.instanceName = DatabaseScheduler org.quartz.scheduler.instanceId = NON_CLUSTERED org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.StdJDBCDelegate org.quartz.jobStore.dataSource = quartzDataSource org.quartz.jobStore.tablePrefix = QRTZ_ org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool org.quartz.threadPool.threadCount = 5 # JobStore: JDBC jobStoreTX org.quartz.dataSource.quartzDataSource.driver = com.mysql.jdbc.Driver org.quartz.dataSource.quartzDataSource.URL = jdbc:mysql://localhost:3306/quartz2 org.quartz.dataSource.quartzDataSource.user = quartz2 org.quartz.dataSource.quartzDataSource.password = quartz2123 org.quartz.dataSource.quartzDataSource.maxConnections = 8 You can run the Groovy script as usual bash> groovy quartzServer.groovy quartz-mysql.properties Dec 15, 2012 6:20:26 PM com.mchange.v2.log.MLog INFO: MLog clients using java 1.4+ standard logging. Dec 15, 2012 6:20:27 PM com.mchange.v2.c3p0.C3P0Registry banner INFO: Initializing c3p0-0.9.1.1 [built 15-March-2007 01:32:31; debug? true; trace:10] [main] INFO org.quartz.impl.StdSchedulerFactory - Using default implementation for ThreadExecutor [main] INFO org.quartz.core.SchedulerSignalerImpl - Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl [main] INFO org.quartz.core.QuartzScheduler - Quartz Scheduler v.2.1.6 created. [main] INFO org.quartz.core.QuartzScheduler - JobFactory set to: org.quartz.simpl.SimpleJobFactory@1a40247 [main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - Using thread monitor-based data access locking (synchronization). [main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - JobStoreTX initialized. [main] INFO org.quartz.core.QuartzScheduler - Scheduler meta-data: Quartz Scheduler (v2.1.6) 'DatabaseScheduler' with instanceId 'NON_CLUSTERED' Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally. NOT STARTED. Currently in standby mode. Number of jobs executed: 0 Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 5 threads. Using job-store 'org.quartz.impl.jdbcjobstore.JobStoreTX' - which supports persistence. and is not clustered. [main] INFO org.quartz.impl.StdSchedulerFactory - Quartz scheduler 'DatabaseScheduler' initialized from the specified file : 'quartz-mysql.properties' from the class resource path. [main] INFO org.quartz.impl.StdSchedulerFactory - Quartz scheduler version: 2.1.6 Dec 15, 2012 6:20:27 PM com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource getPoolManager INFO: Initializing c3p0 pool... com.mchange.v2.c3p0.ComboPooledDataSource [ acquireIncrement -> 3, acquireRetryAttempts -> 30, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null, connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, dataSourceName -> 1hge16k8r18mveoq1iqtotg|1486306, debugUnreturnedConnectionStackTraces -> fals e, description -> null, driverClass -> com.mysql.jdbc.Driver, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false, identityToken -> 1hge16k8r18mveoq1iqtotg|1486306, idleConnectionTestPeriod -> 0, initialPoolSize -> 3, jdbcUrl -> jdbc:mysql://localhost:3306/quartz2, lastAcquisitionFailureDefaultUser -> null, maxAdministrativeTaskTime -> 0 , maxConnectionAge -> 0, maxIdleTime -> 0, maxIdleTimeExcessConnections -> 0, maxPoolSize -> 8, maxStatements -> 0, maxStatementsPerConnection -> 120, minPoolSize -> 1, numHelperThreads -> 3, numThreadsAwaitingCheckoutDefaultUser -> 0, pref erredTestQuery -> null, properties -> {user=******, password=******}, propertyCycle -> 0, testConnectionOnCheckin -> false, testConnectionOnCheckout -> false, unreturnedConnectionTimeout -> 0, usesTraditionalReflectiveProxies -> false ] [main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - Freed 0 triggers from 'acquired' / 'blocked' state.[main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - Recovering 0 jobs that were in-progress at the time of the last shut-down. [main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - Recovery complete. [main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - Removed 0 'complete' triggers. [main] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - Removed 0 stale fired job entries. [main] INFO org.quartz.core.QuartzScheduler - Scheduler DatabaseScheduler_$_NON_CLUSTERED started. ... CTRL+C [Thread-6] INFO org.quartz.core.QuartzScheduler - Scheduler DatabaseScheduler_$_NON_CLUSTERED shutting down. [Thread-6] INFO org.quartz.core.QuartzScheduler - Scheduler DatabaseScheduler_$_NON_CLUSTERED paused. [Thread-6] INFO org.quartz.core.QuartzScheduler - Scheduler DatabaseScheduler_$_NON_CLUSTERED shutdown complete. That's a full run of above setup. Go ahead and play with different config. Read http://quartz-scheduler.org/documentation/quartz-2.1.x/configuration for more details. Here I will post couple more easy config that will get you started in a commonly used config set. A MySQL cluster enabled configuration. With this, you can start one or more shell terminal and run different instance of quartzServer.groovy with the same config. All the quartz scheduler instances should cluster themselve and distribute your jobs evenly. # Main Quartz configuration org.quartz.scheduler.skipUpdateCheck = true org.quartz.scheduler.instanceName = DatabaseClusteredScheduler org.quartz.scheduler.instanceId = AUTO org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.StdJDBCDelegate org.quartz.jobStore.dataSource = quartzDataSource org.quartz.jobStore.tablePrefix = QRTZ_ org.quartz.jobStore.isClustered = true org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool org.quartz.threadPool.threadCount = 5 # JobStore: JDBC jobStoreTX org.quartz.dataSource.quartzDataSource.driver = com.mysql.jdbc.Driver org.quartz.dataSource.quartzDataSource.URL = jdbc:mysql://localhost:3306/quartz2 org.quartz.dataSource.quartzDataSource.user = quartz2 org.quartz.dataSource.quartzDataSource.password = quartz2123 org.quartz.dataSource.quartzDataSource.maxConnections = 8 Here is another config set for a simple in-memory scheduler. # Main Quartz configuration org.quartz.scheduler.skipUpdateCheck = true org.quartz.scheduler.instanceName = InMemoryScheduler org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool org.quartz.threadPool.threadCount = 5 Now, if you need more fancy UI management of Quartz, give MySchedule a try.

December 21, 2012

by Zemian Deng

· 50,007 Views · 2 Likes