DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest DevOps and CI/CD Topics

article thumbnail
Set up a Nightly Build Process with Jenkins, SVN and Nexus
we wanted to set up a nightly integration build with our projects so that we could run unit and integration tests on the latest version of our applications and their underlying libraries. we have a number of libraries that are shared across multiple projects and we wanted this build to run every night and use the latest versions of those libraries even if our applications had a specific release version defined in their maven pom file. in this way we would be alerted early if someone added a change to one of the dependency libraries that could potentially break an application when the developer upgraded the dependent library in a future version of the application. the chart below illustrates our dependencies between our libraries and our applications. updating versions nightly both the crossdock-shared and messaging-shared libraries depend on the siesta framework library. the crossdock web service and crossdockmessaging applications both depend on the crossdock-shared and messaging-shared libraries. because of the dependency structure, we wanted the siestaframework library built first. the crossdock-shared and messaging-shared libraries could be built in parallel, but we didn’t want the builds for the crossdock web service and crossdockmessaging applications to begin until all the libraries had finished building. we also wanted the nightly build to tag a subversion with the build date as well as upload the artifact to our nexus “nightly build” repository. the resulting artifact would look something like siestaframework-20120720.jar also as i had mentioned, even though the crossdockmessaging app may specify in its pom file it depends on version 5.0.4 of the siestaframework library. for the purposes of the nightly build, we wanted it to use the freshly built siestaframework-nightly-20120720.jar version of the library. the first problem to tackle was getting the current date into the project’s version number. for this i started with the jenkins zentimestamp plugin . with this plugin the format of jenkin’s build_id timestamp can be changed. i used this to specify using the format of yyyymmdd for the timestamp. the next step was to get the timestamp into the version number of the project. i was able to accomplish this by using the maven versions plugin. one thing the versions plugin can do is allow you to dynamically override the version number in the pom file at build time. the code snippet from the siestaframework’s pom file is below. org.codehaus.mojo versions-maven-plugin 1.3.1 at this point the jenkins job can be configured to invoke the “versions;set” goal, passing in the new version string to use. the ${build_id} jenkins variable will have the newly formatted date string. this will produce an artifact with the name siestaframework-nightly-20120720.jar uploading artifacts to a nightly repository since this job needed to upload the artifact to a different repository from our release repository that's defined in our project pom files, the “altdeploymentrepository” property was used to pass in the location of the nightly repository. the deployment portion of the siestaframework job specifies the location of the nightly repository where ${lynden_nightly_repo} is a jenkins variable containing the nightly repo url. tagging subversion finally, the jenkins subversion tagging plugin was used to tag svn if the project was successfully built. the plugin provides a post-build action for the job with the configuration section shown below. dynamically updating dependencies so now that the main project is set up, the dependent projects are set up in a similar way, but need to be configured to use the siestaframework-nightly-20120720 of the dependency rather than whatever version they currently have specified in their pom file. this can be accomplished by changing the pom to use a property for the version number of the dependency. for example, if the snippet below was the original pom file— com.lynden siestaframework 5.0.1 —changing it to the following would allow the siestaframework version to be set dynamically: 5.0.1 com.lynden siestaframework ${siesta.version} this version can then be overriden by the jenkins job. the example below shows the jenkins configuration for the crossdock-shared build. enforcing build order the final step in this process is setting up a structure to enforce the build order of the projects. the dependencies are set up in such a way that siestaframework needs to be built first, and the crossdock-shared and messaging-shared libraries can be run concurrently once siestaframework finishes. the crossdock web service and crossdockmessaging application jobs can be run concurrently, too, but not until after both shared libraries have finished. setting up the crossdock-shared and messaging-shared jobs to be built after the siestaframework finishes is pretty straightforward. in the jenkins job configuration for both the shared libraries, the following build trigger is added: to satisfy the requirement that the apps build only after all libraries have built, i enlisted the help of the join plugin . the join plugin can be used to execute a job once all “downstream” jobs have completed. what does this mean exactly? looking at the diagram below, the crossdock-shared and the messaging-shared jobs are “downstream” from the siestaframework job. once both of these jobs complete, a join trigger can be used to start other jobs. in this case, rather than having the join trigger kick off other app jobs directly, i created a dummy join job. in this way, as we add more application builds, we don’t need to keep modifying the siestaframework job with the new application job we just added. to illustrate the configuration, siestaframework has a new post-build action (below): join-build is a jenkins job i configured that does not do anything when executed. then our crossdock web service and crossdockmessaging applications define their builds to trigger as soon as join-build has completed. in this way we are able to run builds each night that will update to the latest version of our dependencies as well as tag svn and archive the binaries to nexus. i’d love to hear feedback from anyone who is handling nightly builds via jenkins, and how they have handled the configuration and build issues.
July 25, 2012
by Rob Terpilowski
· 22,837 Views
article thumbnail
The Most Pressed Keys in Various Programming Languages
i switch between programming languages quite a bit; i often wondered what happens when having to deal with the different syntaxes, does the syntax allow you to be more expressive or faster at coding in one language or another. i don't really know about that; but what i do know what keys are pressed when writing with different programming languages. this might be something interesting for people who are deciding to select a programming language might look into, here is a post on the answer to the aged question of: which programming language should i learn? as far as i can tell languages with a wider focused spread across the keyboard are usually syntaxes we usually associate with ugly languages (ugly to read and code). ex. shell and perl. you might argue that the variables names being used will alter the results, but as most languages programming have conventions for naming but we can assume a decent spread for variable names. i don’t offer conclusions, just poorly layout the facts. although the heat map does miss out on things like shift and caps. ex. in perl with the dollar sign. ($) whitespace hasn’t been taken into consideration (tabs and spaces) which would have been a cool thing to see. the data that was used to gather this information was spread amongst various popular github projects. javascript shell java c c++ ruby python php perl objc lisp lisp code here was written by paul graham. references heatmap.js http://www.patrick-wied.at/projects/heatmap-keyboard/
July 12, 2012
by Mahdi Yusuf
· 39,213 Views
article thumbnail
Everything You Need To Know About Couchbase Architecture
After receiving a lot of good feedback and comment on my last blog on MongoDb, I was encouraged to do another deep dive on another popular document oriented db; Couchbase. I have been a long-time fan CouchDb and has wrote a blog on it many years ago. After it merges with Membase, I am very excited to take a deep look into it again. Couchbase is the merge of two popular NOSQL technologies: Membase, which provides persistence, replication, sharding to the high performance memcached technology CouchDB, which pioneers the document oriented model based on JSON Like other NOSQL technologies, both Membase and CouchDB are built from the ground up on a highly distributed architecture, with data shard across machines in a cluster. Built around the Memcached protocol, Membase provides an easy migration to existing Memcached users who want to add persistence, sharding and fault resilience on their familiar Memcached model. On the other hand, CouchDB provides first class support for storing JSON documents as well as a simple RESTful API to access them. Underneath, CouchDB also has a highly tuned storage engine that is optimized for both update transaction as well as query processing. Taking the best of both technologies, Membase is well-positioned in the NOSQL marketplace. Programming model Couchbase provides client libraries for different programming languages such as Java / .NET / PHP / Ruby / C / Python / Node.js For read, Couchbase provides a key-based lookup mechanism where the client is expected to provide the key, and only the server hosting the data (with that key) will be contacted. Couchbase also provides a query mechanism to retrieve data where the client provides a query (for example, range based on some secondary key) as well as the view (basically the index). The query will be broadcasted to all servers in the cluster and the result will be merged and sent back to the client. For write, Couchbase provides a key-based update mechanism where the client sends in an updated document with the key (as doc id). When handling write request, the server will return to client’s write request as soon as the data is stored in RAM on the active server, which offers the lowest latency for write requests. Following is the core API that Couchbase offers. (in an abstract sense) # Get a document by key doc = get(key) # Modify a document, notice the whole document # need to be passed in set(key, doc) # Modify a document when no one has modified it # since my last read casVersion = doc.getCas() cas(key, casVersion, changedDoc) # Create a new document, with an expiration time # after which the document will be deleted addIfNotExist(key, doc, timeToLive) # Delete a document delete(key) # When the value is an integer, increment the integer increment(key) # When the value is an integer, decrement the integer decrement(key) # When the value is an opaque byte array, append more # data into existing value append(key, newData) # Query the data results = query(viewName, queryParameters) In Couchbase, document is the unit of manipulation. Currently Couchbase doesn't support server-side execution of custom logic. Couchbase server is basically a passive store and unlike other document oriented DB, Couchbase doesn't support field-level modification. In case of modifying documents, client need to retrieve documents by its key, do the modification locally and then send back the whole (modified) document back to the server. This design tradeoff network bandwidth (since more data will be transferred across the network) for CPU (now CPU load shift to client). Couchbase currently doesn't support bulk modification based on a condition matching. Modification happens only in a per document basis. (client will save the modified document one at a time). Transaction Model Similar to many NOSQL databases, Couchbase’s transaction model is primitive as compared to RDBMS. Atomicity is guaranteed at a single document and transactions that span update of multiple documents are unsupported. To provide necessary isolation for concurrent access, Couchbase provides a CAS (compare and swap) mechanism which works as follows … When the client retrieves a document, a CAS ID (equivalent to a revision number) is attached to it. While the client is manipulating the retrieved document locally, another client may modify this document. When this happens, the CAS ID of the document at the server will be incremented. Now, when the original client submits its modification to the server, it can attach the original CAS ID in its request. The server will verify this ID with the actual ID in the server. If they differ, the document has been updated in between and the server will not apply the update. The original client will re-read the document (which now has a newer ID) and re-submit its modification. Couchbase also provides a locking mechanism for clients to coordinate their access to documents. Clients can request a LOCK on the document it intends to modify, update the documents and then releases the LOCK. To prevent a deadlock situation, each LOCK grant has a timeout so it will automatically be released after a period of time. Deployment Architecture In a typical setting, a Couchbase DB resides in a server clusters involving multiple machines. Client library will connect to the appropriate servers to access the data. Each machine contains a number of daemon processes which provides data access as well as management functions. The data server, written in C/C++, is responsible to handle get/set/delete request from client. The Management server, written in Erlang, is responsible to handle the query traffic from client, as well as manage the configuration and communicate with other member nodes in the cluster. Virtual Buckets The basic unit of data storage in Couchbase DB is a JSON document (or primitive data type such as int and byte array) which is associated with a key. The overall key space is partitioned into 1024 logical storage unit called "virtual buckets" (or vBucket). vBucket are distributed across machines within the cluster via a map that is shared among servers in the cluster as well as the client library. High availability is achieved through data replication at the vBucket level. Currently Couchbase supports one active vBucket zero or more standby replicas hosted in other machines. Curremtly the standby server are idle and not serving any client request. In future version of Couchbase, the standby replica will be able to serve read request. Load balancing in Couchbase is achieved as follows: Keys are uniformly distributed based on the hash function When machines are added and removed in the cluster. The administrator can request a redistribution of vBucket so that data are evenly spread across physical machines. Management Server Management server performs the management function and co-ordinate the other nodes within the cluster. It includes the following monitoring and administration functions Heartbeat: A watchdog process periodically communicates with all member nodes within the same cluster to provide Couchbase Server health updates. Process monitor: This subsystem monitors execution of the local data manager, restarting failed processes as required and provide status information to the heartbeat module. Configuration manager: Each Couchbase Server node shares a cluster-wide configuration which contains the member nodes within the cluster, a vBucket map. The configuration manager pull this config from other member nodes at bootup time. Within a cluster, one node’s Management Server will be elected as the leader which performs the following cluster-wide management function Controls the distribution of vBuckets among other nodes and initiate vBucket migration Orchestrates the failover and update the configuration manager of member nodes If the leader node crashes, a new leader will be elected from surviving members in the cluster. When a machine in the cluster has crashed, the leader will detect that and notify member machines in the cluster that all vBuckets hosted in the crashed machine is dead. After getting this signal, machines hosting the corresponding vBucket replica will set the vBucket status as “active”. The vBucket/server map is updated and eventually propagated to the client lib. Notice that at this moment, the replication level of the vBucket will be reduced. Couchbase doesn’t automatically re-create new replicas which will cause data copying traffic. Administrator can issue a command to explicitly initiate a data rebalancing. The crashed machine, after reboot can rejoin the cluster. At this moment, all the data it stores previously will be completely discard and the machine will be treated as a brand new empty machine. As more machines are put into the cluster (for scaling out), vBucket should be redistributed to achieve a load balance. This is currently triggered by an explicit command from the administrator. Once receive the “rebalance” command, the leader will compute the new provisional map which has the balanced distribution of vBuckets and send this provisional map to all members of the cluster. To compute the vBucket map and migration plan, the leader attempts the following objectives: Evenly distribute the number of active vBuckets and replica vBuckets among member nodes. Place the active copy and each replicas in physically separated nodes. Spread the replica vBucket as wide as possible among other member nodes. Minimize the amount of data migration Orchestrate the steps of replica redistribution so no node or network will be overwhelmed by the replica migration. Once the vBucket maps is determined, the leader will pass the redistribution map to each member in the cluster and coordinate the steps of vBucket migration. The actual data transfer happens directly between the origination node to the destination node. Notice that since we have generally more vBuckets than machines. The workload of migration will be evenly distributed automatically. For example, when new machines are added into the clusters, all existing machines will migrate some portion of its vBucket to the new machines. There is no single bottleneck in the cluster. Throughput the migration and redistribution of vBucket among servers, the life cycle of a vBucket in a server will be in one of the following states “Active”: means the server is hosting the vBucket is ready to handle both read and write request “Replica”: means the server is hosting the a copy of the vBucket that may be slightly out of date but can take read request that can tolerate some degree of outdate. “Pending”: means the server is hosting a copy that is in a critical transitional state. The server cannot take either read or write request at this moment. “Dead”: means the server is no longer responsible for the vBucket and will not take either read or write request anymore. Data Server Data server implements the memcached APIs such as get, set, delete, append, prepend, etc. It contains the following key datastructure: One in-memory hashtable (key by doc id) for the corresponding vBucket hosted. The hashtable acts as both a metadata for all documents as well as a cache for the document content. Maintain the entry gives a quick way to detect whether the document exists on disk. To support async write, there is a checkpoint linkedlist per vBucket holding the doc id of modified documents that hasn't been flushed to disk or replicated to the replica. To handle a "GET" request Data server routes the request to the corresponding ep-engine responsible for the vBucket. The ep-engine will lookup the document id from the in-memory hastable. If the document content is found in cache (stored in the value of the hashtable), it will be returned. Otherwise, a background disk fetch task will be created and queued into the RO dispatcher queue. The RO dispatcher then reads the value from the underlying storage engine and populates the corresponding entry in the vbucket hash table. Finally, the notification thread notifies the disk fetch completion to the memcached pending connection, so that the memcached worker thread can revisit the engine to process a get request. To handle a "SET" request, a success response will be returned to the calling client once the updated document has been put into the in-memory hashtable with a write request put into the checkpoint buffer. Later on the Flusher thread will pickup the outstanding write request from each checkpoint buffer, lookup the corresponding document content from the hashtable and write it out to the storage engine. Of course, data can be lost if the server crashes before the data has been replicated to another server and/or persisted. If the client requires a high data availability across different crashes, it can issue a subsequent observe() call which blocks on the condition that the server persist data on disk, or the server has replicated the data to another server (and get its ACK). Overall speaking, the client has various options to tradeoff data integrity with throughput. Hashtable Management To synchronize accesses to a vbucket hash table, each incoming thread needs to acquire a lock before accessing a key region of the hash table. There are multiple locks per vbucket hash table, each of which is responsible for controlling exclusive accesses to a certain ket region on that hash table. The number of regions of a hash table can grow dynamically as more documents are inserted into the hash table. To control the memory size of the hashtable, Item pager thread will monitor the memory utilization of the hashtable. Once a high watermark is reached, it will initiate an eviction process to remove certain document content from the hashtable. Only entries that is not referenced by entries in the checkpoint buffer can be evicted because otherwise the outstanding update (which only exists in hashtable but not persisted) will be lost. After eviction, the entry of the document still remains in the hashtable; only the document content of the document will be removed from memory but the metadata is still there. The eviction process stops after reaching the low watermark. The high / low water mark is determined by the bucket memory quota. By default, the high water mark is set to 75% of bucket quota, while the low water mark is set to 60% of bucket quota. These water marks can be configurable at runtime. In CouchDb, every document is associated with an expiration time and will be deleted once it is expired. Expiry pager is responsible for tracking and removing expired document from both the hashtable as well as the storage engine (by scheduling a delete operation). Checkpoint Manager Checkpoint manager is responsible to recycle the checkpoint buffer, which holds the outstanding update request, consumed by the two downstream processes, Flusher and TAP replicator. When all the request in the checkpoint buffer has been processed, the checkpoint buffer will be deleted and a new one will be created. TAP Replicator TAP replicator is responsible to handle vBucket migration as well as vBucket replication from active server to replica server. It does this by propagating the latest modified document to the corresponding replica server. At the time a replica vBucket is established, the entire vBucket need to be copied from the active server to the empty destination replica server as follows The in-memory hashtable at the active server will be transferred to the replica server. Notice that during this period, some data may be updated and therefore the data set transfered to the replica can be inconsistent (some are the latest and some are outdated). Nevertheless, all updates happen after the start of transfer is tracked in the checkpoint buffer. Therefore, after the in-memory hashtable transferred is completed, the TAP replicator can pickup those updates from the checkpoint buffer. This ensures the latest versioned of changed documents are sent to the replica, and hence fix the inconsistency. However the hashtable cache doesn’t contain all the document content. Data also need to be read from the vBucket file and send to the replica. Notice that during this period, update of vBucket will happen in active server. However, since the file is appended only, subsequent data update won’t interfere the vBucket copying process. After the replica server has caught up, subsequent update at the active server will be available at its checkpoint buffer which will be pickup by the TAP replicator and send to the replica server. CouchDB Storage Structure Data server defines an interface where different storage structure can be plugged-in. Currently it supports both a SQLite DB as well as CouchDB. Here we describe the details of CouchDb, which provides a super high performance storage mechanism underneath the Couchbase technology. Under the CouchDB structure, there will be one file per vBucket. Data are written to this file in an append-only manner, which enables Couchbase to do mostly sequential writes for update, and provide the most optimized access patterns for disk I/O. This unique storage structure attributes to Couchbase’s fast on-disk performance for write-intensive applications. The following diagram illustrate the storage model and how it is modified by 3 batch updates (notice that since updates are asynchronous, it is perform by "Flusher" thread in batches). The Flusher thread works as follows: 1) Pick up all pending write request from the dirty queue and de-duplicate multiple update request to the same document. 2) Sort each request (by key) into corresponding vBucket and open the corresponding file 3) Append the following into the vBucket file (in the following contiguous sequence) All document contents in such write request batch. Each document will be written as [length, crc, content] one after one sequentially. The index that stores the mapping from document id to the document’s position on disk (called the BTree by-id) The index that stores the mapping from update sequence number to the document’s position on disk. (called the BTree by-seq) The by-id index plays an important role for looking up the document by its id. It is organized as a B-Tree where each node contains a key range. To lookup a document by id, we just need to start from the header (which is the end of the file), transfer to the root BTree node of the by-id index, and then further traverse to the leaf BTree node that contains the pointer to the actual document position on disk. During the write, the similar mechanism is used to trace back to the corresponding BTree node that contains the id of the modified documents. Notice that in the append-only model, update is not happening in-place, instead we located the existing location and copy it over by appending. In other words, the modified BTree node will be need to be copied over and modified and finally paste to the end of file, and then its parent need to be modified to point to the new location, which triggers the parents to be copied over and paste to the end of file. Same happens to its parents’ parent and eventually all the way to the root node of the BTree. The disk seek can be at the O(logN) complexity. The by-seq index is used to keep track of the update sequence of lived documents and is used for asynchronous catchup purposes. When a document is created, modified or deleted, a sequence number is added to the by-seq btree and the previous seq node will be deleted. Therefore, for cross-site replication, view index update and compaction, we can quickly locate all the lived documents in the order of their update sequence. When a vBucket replicator asks for the list of update since a particular time, it provides the last sequence number in previous update, the system will then scan through the by-seq BTree node to locate all the document that has sequence number larger than that, which effectively includes all the document that has been modified since the last replication. As time goes by, certain data becomes garbage (see the grey-out region above) and become unreachable in the file. Therefore, we need a garbage collection mechanism to clean up the garbage. To trigger this process, the by-id and by-seq B-Tree node will keep track of the data size of lived documents (those that is not garbage) under its substree. Therefore, by examining the root BTree node, we can determine the size of all lived documents within the vBucket. When the ratio of actual size and vBucket file size fall below a certain threshold, a compaction process will be triggered whose job is to open the vBucket file and copy the survived data to another file. Technically, the compaction process opens the file and read the by-seq BTree at the end of the file. It traces the Btree all the way to the leaf node and copy the corresponding document content to the new file. The compaction process happens while the vBucket is being updated. However, since the file is appended only, new changes are recorded after the BTree root that the compaction has opened, so subsequent data update won’t interfere with the compaction process. When the compaction is completed, the system need to copy over the data that was appended since the beginning of the compaction to the new file. View Index Structure Unlike most indexing structure which provide a pointer from the search attribute back to the document. The CouchDb index (called View Index) is better perceived as a denormalized table with arbitrary keys and values loosely associated to the document. Such denormalized table is defined by a user-provided map() and reduce() function. map = function(doc) { … emit(k1, v1) … emit(k2, v2) … } reduce = function(keys, values, isRereduce) { if (isRereduce) { // Do the re-reduce only on values (keys will be null) } else { // Do the reduce on keys and values } // result must be ready for input values to re-reduce return result } Whenever a document is created, updated, deleted, the corresponding map(doc) function will be invoked (in an asynchronous manner) to generate a set of key/value pairs. Such key/value will be stored in a B-Tree structure. All the key/values pairs of each B-Tree node will be passed into the reduce() function, which compute an aggregated value within that B-Tree node. Re-reduce also happens in non-leaf B-Tree nodes which further aggregate the aggregated value of child B-Tree nodes. The management server maintains the view index and persisted it to a separate file. Create a view index is perform by broadcast the index creation request to all machines in the cluster. The management process of each machine will read its active vBucket file and feed each surviving document to the Map function. The key/value pairs emitted by the Map function will be stored in a separated BTree index file. When writing out the BTree node, the reduce() function will be called with the list of all values in the tree node. Its return result represent a partially reduced value is attached to the BTree node. The view index will be updated incrementally as documents are subsequently getting into the system. Periodically, the management process will open the vBucket file and scan all documents since the last sequence number. For each changed document since the last sync, it invokes the corresponding map function to determine the corresponding key/value into the BTree node. The BTree node will be split if appropriate. Underlying, Couchbase use a back index to keep track of the document with the keys that it previously emitted. Later when the document is deleted, it can look up the back index to determine what those key are and remove them. In case the document is updated, the back index can also be examined; semantically a modification is equivalent to a delete followed by an insert. The following diagram illustrates how the view index file will be incrementally updated via the append-only mechanism. Query Processing Query in Couchbase is made against the view index. A query is composed of the view name, a start key and end key. If the reduce() function isn’t defined, the query result will be the list of values sorted by the keys within the key range. In case the reduce() function is defined, the query result will be a single aggregated value of all keys within the key range. If the view has no reduce() function defined, the query processing proceeds as follows: Client issue a query (with view, start/end key) to the management process of any server (unlike a key based lookup, there is no need to locate a specific server). The management process will broadcast the request to other management process on all servers (include itself) within the cluster. Each management process (after receiving the broadcast request) do a local search for value within the key range by traversing the BTree node of its view file, and start sending back the result (automatically sorted by the key) to the initial server. The initial server will merge the sorted result and stream them back to the client. However, if the view has reduce() function defined, the query processing will involve computing a single aggregated value as follows: Client issue a query (with view, start/end key) to the management process of any server (unlike a key based lookup, there is no need to locate a specific server). The management process will broadcast the request to other management process on all servers (include itself) within the cluster. Each management process do a local reduce for value within the key range by traversing the BTree node of its view file to compute the reduce value of the key range. If the key range span across a BTree node, the pre-computed of the sub-range can be used. This way, the reduce function can reuse a lot of partially reduced values and doesn’t need to recomputed every value of the key range from scratch. The original server will do a final re-reduce() in all the return value from each other servers, and then passed back the final reduced value to the client. To illustrate the re-reduce concept, lets say the query has its key range from A to F. Instead of calling reduce([A,B,C,D,E,F]), the system recognize the BTree node that contains [B,C,D] has been pre-reduced and the result P is stored in the BTree node, so it only need to call reduce(A,P,E,F). Update View Index as vBucket migrates Since the view index is synchronized with the vBuckets in the same server, when the vBucket has migrated to a different server, the view index is no longer correct; those key/value that belong to a migrated vBucket should be discarded and the reduce value cannot be used anymore. To keep track of the vBucket and key in the view index, each bTree node has a 1024-bitmask indicating all the vBuckets that is covered in the subtree (ie: it contains a key emitted from a document belonging to the vBucket). Such bit-mask is maintained whenever the bTree node is updated. At the server-level, a global bitmask is used to indicate all the vBuckets that this server is responsible for. In processing the query of the map-only view, before the key/value pair is returned, an extra check will be perform for each key/value pair to make sure its associated vBucket is what this server is responsible for. When processing the query of a view that has a reduce() function, we cannot use the pre-computed reduce value if the bTree node contains a vBucket that the server is not responsible for. In this case, the bTree node’s bit mask is compared with the global bit mask. In case if they are not aligned, then the reduce value need to be recomputed. Here is an example to illustrate this process Couchbase is one of the popular NOSQL technology built on a solid technology foundation designed for high performance. In this post, we have examined a number of such key features: Load balancing between servers inside a cluster that can grow and shrink according to workload conditions. Data migration can be used to re-achieve workload balance. Asynchronous write provides lowest possible latency to client as it returns once the data is store in memory. Append-only update model pushes most update transaction into sequential disk access, hence provide extremely high throughput for write intensive applications. Automatic compaction ensures the data lay out on disk are kept optimized all the time. Map function can be used to pre-compute view index to enable query access. Summary data can be pre-aggregated using the reduce function. Overall, this cut down the workload of query processing dramatically. For a review on NOSQL architecture in general and some theoretical foundation, I have wrote a NOSQL design pattern blog, as well as some fundamental difference between SQL and NOSQL. For other NOSQL technologies, please read my other blog on MongoDb, Cassandra and HBase, Memcached Special thanks to Damien Katz and Frank Weigel from Couchbase team who provide a lot of implementation details of Couchbase.
July 7, 2012
by Ricky Ho
· 84,698 Views · 5 Likes
article thumbnail
20 Subjects Every Software Engineer Should Know
Here are the most important subjects for software engineering, with brief explanations: 1.Object oriented analysis & design: For better maintainability, reusability and faster development, the most well accepted approach, shortly OOAD and its SOLID principals are very important for software engineering. 2.Software quality factors: Software engineering depends on some very important quality factors. Understanding and applying them is crucial. 3.Data structures & algorithms: Basic data structures like array, list, stack, tree, map, set etc. and useful algorithms are vital for software development. Their logical structure should be known. 4. Big-O notation: Big-O notation indicates the performance of an algorithm/code section. Understanding it is very important for comparing performances. 5.UML notation: UML is the universal and complete language for software design & analysis. If there is lack of UML in a development process, it feels there is no engineering. 6.Software processes and metrics: Software enginnering is not a random process. It requires a high level of systematic and some numbers to monitor those techniques. So, processes and metrics are essential. 7.Design patterns: Design patterns are standard and most effective solutions for specific problems. If you don't want to reinvent the wheel, you should learn them. 8.Operating systems basics: Learning OS basics is very important because all applications runs on it. By learning it, we can have better vision, viewpoints and performance for our applications. 9.Computer organization basics: All applications including OS requires a hardware for physical interaction. So, learning computer organization basics is vital again for better vision, viewpoints and performance. 10.Network basics: Network is related with computer organization, OS and the whole information transfer process. In any case we will face it while software development. So, it is important to learn network basics. 11.Requirement analysis: Requirement analysis is the starting point and one of the most important parts of software engineering. Performing it correctly and practically needs experience but it is very essential. 12.Software testing: Testing is another important part of software engineering. Unit testing, its best practices and techniques like black box, white box, mocking, TDD, integration testing etc. are subjects which must be known. 13.Dependency management: Library (JAR, DLL etc.) management, and widely known tools (Maven, Ant, Ivy etc.) are essential for large projects. Otherwise, antipatterns like Jar Hell are inevitable. 14.Continuous integration: Continuous integration brings easiness and automaticity for testing large modules, components and also performs auto-versioning. Its aim and tools (like Hudson etc.) should be known. 15.ORM (Object relational mapping): ORM and its widely known implementation Hibernate framework is an important technique for mapping objects into database tables. It reduces code length and maintenance time. 16.DI (Dependency Injection): DI or IoC (Inversion of Control) and its widely known implementation Spring framework makes life easy for object creation and lifetime management on big enterprise applications. 17.Version controlling systems: VCS tools (SVN, TFS, CVS etc.) are very important by saving so much time for collaborative works and versioning. Their logical viewpoint and standard cammands should be known. 18.Internationalization (i18n): i18n by extracting strings into external files is the best way of supporting multiple languages in our applications. Its practices on different IDEs and technologies must be known. 19.Architectural patterns: Understanding architectural design patterns (like MVC, MVP, MVVM etc.) is essential for producing a maintainable, clean, extendable and testable source code. 20.Writing clean code: Working code is not enough, it must be readable and maintainable also. So, code formatting and readable code development techniques are needed to be known and applied.
July 2, 2012
by Cagdas Basaraner
· 108,559 Views · 5 Likes
article thumbnail
Spring Integration - Robust Splitter Aggregator
A Robust Splitter Aggregator Design Strategy - Messaging Gateway Adapter Pattern What do we mean by robust? In the context of this article, robustness refers to an ability to manage exception conditions within a flow without immediately returning to the caller. In some processing scenarios n of m responses is good enough to proceed to conclusion. Example processing scenarios that typically have these tendencies are: Quotations for finance, insurance and booking systems. Fan-out publishing systems. Why do we need Robust Splitter Aggregator Designs? First and foremost an introduction to a typical Splitter Aggregator pattern maybe necessary. The Splitter is an EIP pattern that describes a mechanism for breaking composite messages into parts in order that they can be processed individually. A Router is an EIP pattern that describes routing messages into channels - aiming them at specific messaging endpoints. The Aggregator is an EIP pattern that collates and stores a set of messages that belong to a group, and releases them when that group is complete. Together, those three EIP constructs form a powerful mechanism for dividing processing into distinct units of work. Spring Integration (SI) uses the same pattern terminology as EIP and so readers of that methodology will be quite comfortable with Spring Integration Framework constructs. The SI Framework allows significant customisations of all three of those constructs and furthermore, by simply using asynchronous channels as you would in any other multi-threaded configuration, allows those units of work to be executed in parallel. An interesting challenge working with SI Splitter Aggregator designs is building appropriately robust flows that operate predictably in a number of invocation scenarios. A simple splitter aggregator design can be used in many circumstances and operate without heavy customisation of the SI constructs. However, some service requirements demand a more robust processing strategy and therefore more complex configuration. The following sections describe and show what a Simple Splitter Aggregator design actually looks like, the type of processing your design must be able to deal with and then goes on to suggest candidate solutions for more robust processing. A Simple Splitter Aggregator Design The following Splitter Aggregator design shows a simple flow that receives document request messages into messaging gateway, splits the message into two processing routes and then aggregates the response. Note that the diagram has been built from EIP constructs in OmniGraffle rather than being an Integration Graph view from within STS; the channels are missing from the diagram for the sake of brevity. SI Constructs in detail: Messaging Gateways - there are three messaging gateways. A number of configurations are available for gateway specifications but significantly can return business objects, exceptions and nulls (following a timeout). The gateway to the far left is the service gateway for which we are defining the flow. The other two gateways, between the Router and Aggregator, are external systems that will be providing responses to business questions that our flow generates. The Splitter - a single splitter exists and is responsible for consuming the document message and producing a collection of messages for onward processing. The Java signature for the, most often, custom Splitter specifies a single object argument and a collection for return. The Recipient List Router - a single router exists, any appropriate router can be used, chose the one that closely matches your requirements - you can easily route by expression or payload type. The primary purpose of the router is route a collection of messages supplied by the splitter. This is a pretty typical Splitter Aggregator configuration. Aggregator - a single construct that is responsible for collecting messages together in a group in order that further processing can take place on the gateway responses. Although the Aggregator can be configured with attributes and bean definitions to provide alternative grouping and release strategies, most often the default aggregation strategy suffices. Interesting Aspects of Splitter Aggregator Operation Gateway - the inbound gateway, the one on the far left, may or may not have an error handling bean reference defined on it. If it does then that bean will have an opportunity to handle an exceptions thrown within the flow to the right of that gateway. If not, any exception will be thrown straight out of the gateway. Gateway - an optional default-reply-timeout can be set on each of the gateways, there are significant implications for setting this value, ensure that they're well understood. An expired timeout will result in a null being returned from the gateway. This is the very same condition that can lead to a thread getting parked if an upstream gateway also has no default-reply-timeout set. Splitter Input Channel - this can be a simple direct channel or a direct channel with a dispatcher defined on it. If the channel has a dispatcher specified the flow downstream of this point will be asynchronous, multi-threaded. This also changes the upstream gateway semantics as it usually means that an otherwise impotent default-reply-timeout becomes active. Splitter - the splitter must return a single object. The single object returned by the splitter is a collection, a java.util.List. The SI framework will take each member of that list and feed it into the output-channel of the Splitter - as with this example, usually straight into a router. The contract for Splitter List returns is as its use in Java - it may contain zero, one or more elements. If the splitter returns an empty list it's unlikely that the router will have any work to do and so the flow invocation will be complete. However, if the List contains one item, the SI framework will extract that item from the list and push it into the router, if this gets routed successfully, the flow will continue. Router - the router will simply route messages into one of two gateways in this example. Gateways - the two gateways that are used between the Splitter and Aggregator are interesting. In this example I have used the generic gateway EIP pattern to represent a message sub-system but not defined it explicitly - we could use an HTTP outbound gateway, another SI flow or any other external system. Of course, for each of those sub-systems, a number of responses is possible. Depending on the protocol and external system, the message request may fail to send, the response fail to arrive, a long running process invoked, a network error or timeout or a general processing exception. Aggregator - the single aggregator will wait for a number of responses depending on what's been created by the Splitter. In the case where the splitter return list is empty the Aggregator will not get invoked. In the case where the Splitter return list has one entry, the aggregator will be waiting for one gateway response to complete the group. In the case where the Splitter list has n entries the Aggregator will be waiting for n entries to complete the group. Custom correlation strategies, release strategies and message stores can be injected amongst a set of rich configuration aspects. Interesting Aspects of Simple Splitter Aggregator Operation The primary deciding factor for establishing whether this type of simple gateway is adequate for requirements is to understand what happens in the event of failure. If any exception occurring in your SI flow results in the flow invocation being abandoned and that suits your requirements, there's no need to read any further. If, however, you need to continue processing following failure in one of the gateways the remainder of this article may be of more interest. Exceptions, from any source, generated between the splitter and aggregator, will result in an empty or partial group being discarded by the Aggregator. The exception will propagate back to the closest upstream gateway for either handling by a custom bean or re-throwing by the gateway. Note that a custom release strategy on the Aggregator is difficult to use and especially so alongside timeouts but would not help in this case as the exception will propagate back to the leftmost gateway before the aggregator is invoked. It's also possible to configure exception handlers on the innermost gateways, the exception message could be caught but how do you route messages from a custom exception handler into the aggregator to complete the group, inject the aggregator channel definition into the custom exception handler? This is a poor approach and would involve unpacking an exception message payload, copying the original message headers into a new SI message and then adding the original payload - only four or five lines of code, but dirty it is. Following exception generation, exception messages (without modification) cannot be routed into an Aggregator to complete the group. The original message, the one that contains the correlation and sequence ids for the group and group position are buried inside the SI messages exception payload. If processing needs to continue following exception generation, it should be clear that in order to continue processing, the following must take place: the aggregation group needs to be completed, any exceptions must be caught and handled before getting back to the closet upstream gateway, the correlation and sequence identifiers that allow group completion in the aggregator are buried within the exception message payload and will require extraction and setting on the message that's bound for the aggregator A More Robust Solution - Messaging Gateway Adapter Pattern Dealing with exceptions and null returns from gateways naturally leads to a design that implements a wrapper around the messaging gateway. This affords a level of control that would otherwise be very difficult to establish. This adapter technique allows all returns from messaging gateways to be caught and processed as the messaging gateway is injected into the Service Activator and called directly from that. The messaging gateway no longer responds to the aggregator directly, it responds to a custom Java code Spring bean configured in the Service Activator namespace definition. As expected, processing that does not undergo exception will continue as normal. Those flows that experience exception conditions or unexpected or missing responses from messaging gateways need to process messages in such as way that message groups bound for aggregation can be completed. If the Service Activator were to allow the exception to be propagated outside of it's backing bean, the group would not complete. The same applies not just for exceptions but any return object that does not carry the prerequisite group correlation id and sequence headers - this is where the adaptation is applied. Exception messages or null responses from messaging gateways are caught and handled as shown in the following example code: import com.l8mdv.sample.*; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.integration.Message; import org.springframework.integration.MessageHeaders; import org.springframework.integration.support.MessageBuilder; import org.springframework.util.Assert; public class AvsServiceImpl implements AvsService { private static final Logger logger = LoggerFactory.getLogger(AvsServiceImpl.class); public static final String MISSING_MANDATORY_ARG = "Mandatory argument is missing."; private AvsGateway avsGateway; public AvsServiceImpl(final AvsGateway avsGateway) { this.avsGateway = avsGateway; } public Message service(Message message) { Assert.notNull(message, MISSING_MANDATORY_ARG); Assert.notNull(message.getPayload(), MISSING_MANDATORY_ARG); MessageHeaders requestMessageHeaders = message.getHeaders(); Message responseMessage = null; try { logger.debug("Entering AVS Gateway"); responseMessage = avsGateway.send(message); if (responseMessage == null) responseMessage = buildNewResponse(requestMessageHeaders, AvsResponseType.NULL_RESULT); logger.debug("Exited AVS Gateway"); return responseMessage; } catch (Exception e) { return buildNewResponse(responseMessage, requestMessageHeaders, AvsResponseType.EXCEPTION_RESULT, e); } } private Message buildNewResponse(MessageHeaders requestMessageHeaders, AvsResponseType avsResponseType) { Assert.notNull(requestMessageHeaders, MISSING_MANDATORY_ARG); Assert.notNull(avsResponseType, MISSING_MANDATORY_ARG); AvsResponse avsResponse = new AvsResponse(); avsResponse.setError(avsResponseType); return MessageBuilder.withPayload(avsResponse) .copyHeadersIfAbsent(requestMessageHeaders).build(); } private Message buildNewResponse(Message responseMessage, MessageHeaders requestMessageHeaders, AvsResponseType avsResponseType, Exception e) { Assert.notNull(responseMessage, MISSING_MANDATORY_ARG); Assert.notNull(responseMessage.getPayload(), MISSING_MANDATORY_ARG); Assert.notNull(requestMessageHeaders, MISSING_MANDATORY_ARG); Assert.notNull(avsResponseType, MISSING_MANDATORY_ARG); Assert.notNull(e, MISSING_MANDATORY_ARG); AvsResponse avsResponse = new AvsResponse(); avsResponse.setError(avsResponseType, responseMessage.getPayload(), e); return MessageBuilder.withPayload(avsResponse) .copyHeadersIfAbsent(requestMessageHeaders).build(); } } Notice the last line of the catch clause of the exception handling block. This line of code copies the correlation and sequence headers into the response message, this is mandatory if the aggregation group is going to be allowed to complete and will always be necessary following an exception as shown here. Consequences of using this technique There's no doubt that introducing a Messaging Gateway Adapter into SI config makes the configuration more complex to read and follow. The key factor here is that there is no longer a linear progression through the configuration file. This because the Service Activator must forward reference a Gateway or a Gateway defined before it's adapting Service Activator - in both cases the result is the same. Resources Note:- The design for the software that drove creation of this meta-pattern was based on a requirement that a number of external risk assessment services would be accessed by a single, central Risk Assessment Service. In order to satisfy clients of the service, invocation had to take place in parallel and continue despite failure in any one of those external services. This requirement lead to the design of the Messaging Gateway Adapter Pattern for the project. Spring Integration Reference Manual The solution approach for this problem was discussed directly with Mark Fisher (SpringSource) in the context of building Risk Assessment flows for a large US financial institution. Although the configuration and code is protected by NDA and copyright, it's acceptable to express the design intention and similar code in this article.
June 3, 2012
by Matt Vickery
· 23,382 Views
article thumbnail
Spring Integration Gateways - Null Handling & Timeouts
Spring Integration (SI) Gateways Spring Integration Gateways () provide a semantically rich interface to message sub-systems. Gateways are specified using namespace constructs, these reference a specific Java interface () that is backed by an object dynamically implemented at run-time by the Spring Integration framework. Furthermore, these Java interfaces can, if you so wish, be defined entirely independent of any Spring artefacts - that's both code and configuration. One of the primary advantages of using the SI gateway as an interface to message sub-systems is that it's possible to automatically adopt the benefit of rich, default and customisable, gateway configuration. One such configuration attribute deserves further scrutiny and discussion primarily because it's easy to misunderstand and misconfigure around - default-reply-timeout. Primary Motivator for Gateway Analysis During recent consulting engagements, I've encountered a number of deployments that use Spring Integration Gateway specifications that may, in some circumstances, lead to production operational instability. This has often been in high-pressure environments or those where technology support is not backed by adequate training, testing, review or technology mentoring. How do gateways behave in Spring Integration (R2.0.5) One of the key sections, regarding gateways, in the Spring Integration manual clearly explains gateway semantics. Below is a 2-dimensional table of possible non-standard gateway returns for each of the scenarios that the SI Manual (r2.0.5) refers to. Gateway Non-standard Responses Runtime Events default-reply-timeout=x Single-threaded default-reply-timeout=x Multi-threaded default-reply-timeout=null Single-threaded default-reply-timeout=null Multi-threaded 1. Long Running Process Thread Parked null returned Thread Parked Thread Parked 2. Null Returned Downstream null returned null returned Thread Parked Thread Parked 3. void method Downstream null returned null returned Thread Parked Thread Parked 4. Runtime Exception Error handler invoked or exception thrown. Error handler invoked or exception thrown. Error handler invoked or exception thrown. Error handler invoked or exception thrown. The key parts of this table are the conditions that lead to invoking threads being parked (noted in red), nulls returned (noted in orange) and exceptions (noted in green). Each contributor consists of configuration that is under the developers control, deployed code that is under developers control and conditions that are usually not under developers control. Clearly, the column headings in the table above are divided into two sections; two gateway configuration attributes. The default-reply-timeout is set by the SI configured and is the amount of time that a client call is wiling to wait for a response from the gateway. Secondly, synchronous flows are represented by Single-threaded flows, asynchronous by Multi-threaded flows. A synchronous, or single-threaded flow, is one such as the following: The implicit input channel (gateway-request-channel) has no associated dispatcher configured. An asynchronous, or multi-threaded flow, is one such as the following: The explicit input channel has a dispatcher configured ("taskExecutor"). This task executor specifies a thread pool that supplies threads for execution and whose configuration as above marks a thread boundary. Note: This is not the only way of making channels asynchronous The other configuration attribute referenced is default-reply-timeout, this is set on the gateway namespace configuration such as the example above. Note that both of these runtime aspects are set by the configurer during SI flow design and implementation. They are entirely under developer control. The 'Runtime Events' column indicates gateway relevant runtime events that have to be considered during gateway configuration - these are obviously not under developer control. Trigger conditions for these events are not as unusual as one may hope. 1. Long Running Processes It's not uncommon for thread pools to become exhausted because all pooled threads are waiting for an external resource accessed through a socket, this may be a long running database query, a firewall keeping a connection open despite the server terminating etc. There is significant potential for these types of trigger. Some long-running processes terminate naturally, sometimes they never completed - an application restart is required. 2. Null returned downstream A null may be returned from a downstream SI construct such as a Transformer, Service Activator or Gateway. A Gateway may return null in some circumstances such as following a gateway timeout event. 3. Void method downstream Any custom code invoked during an SI flow may use a void method signature. This can also be caused by configuration in circumstances where flows are determined dynamically at runtime. 4. Runtime Exception RuntimeException's can be triggered during normal operation and are generally handled by catching them at the gateway or allowing them to propagate through. The reason that they are coloured green in the table above is that they are generally much easier to handle than timeouts. Gateway Timeout Handling Strategies There are four possible outcomes from invoking a gateway with a request message, all of these as a result of specific runtime events: a) an ordinary message response, b) an exception message, c) a null or d) no-response. Ordinary business responses and exceptions are straight forward to understand and will not be covered further in this article. The two significant outcomes that will be explored further are strategies for dealing with nulls and no-response. Generally speaking, long running processes either terminate or not. Long running processes that terminate may eventually return a message through the invoked gateway or timeout depending on timeout configuration, in which case a null may be returned. The severity of this as a problem depends on throughput volume, length of long running process and system resources (thread-pool size). Configuration exists for default-reply-timeout In the case where a long running process event is underway and a default-reply-timeout has been set, as long as the long running process completes before the default-reply-timeout expires, there is no problem to deal with. However, if the long running process does not complete before that timeout expires one of three outcomes will apply. Firstly, if the long running process terminates subsequent to the reply timeout expiry, the gateway will have already returned null to the invoker so the null response needs handling by the invoker. The thread handling the long-running process will be returned to the pool. Secondly, if the long running process does not terminate and a reply timeout has been set, the gateway will return null to the gateway invoker but the thread executing the long-running process will not get returned to the pool. Thirdly, and most significantly, if a default-reply-timeout has been configured but the long running process is running on the same thread as the invoker, i.e. synchronous channels supply messages to that process, the thread will not return, the default-reply-timeout has no affect. Assuming the most common processing scenario, a long running process completes either before or after the reply timeout expiry. When a null is returned by the gateway, the invoker is forced to deal with a null response. It's often unacceptable to force gateway consumers to deal with null responses and is not necessary as with a little additional configuration, this can be avoided. Absent Configuration for default-reply-timeout The most significant danger exists around gateways that have no default-reply-timeout configuration set. A long running process or a null returned from downstream will mean that the invoking thread is parked. This is true for both synchronous and asynchronous flows and may ultimately force an application to be restarted because the invoker thread pool is likely to start on a depletion course if this continues to occur. Spring Integration Timeout Handling Design Strategies For those Spring Integration configuration designers that are comfortable with gateway invokers dealing with null responses, exceptions and set default-reply-timeouts on gateways, there's no need to read further. However, if you wish to provide clients of your gateway a more predictable response, a couple of strategies exist for handling null responses from gateways in order that invokers are protected from having to deal with them. Firstly, the simpliest solution is to wrap the gateway with a service activator. The gateway must have the default-reply-timeout attribute value set in order to avoid unnecessary parking of threads. In order to avoid the consequence of long-running threads it's also very prudent to use a dispatcher soon after entry to the gateway - this breaks the thread boundary. Whilst this is a valid technical approach, the impact is that we have forced a different entry point to our message sub-system. Entry is now via a Service Activator rather than a Gateway. A side affect of this change is that the testing entry point changes. Integration tests that would normally reference a gateway to send a message now have to locate the backing implementation for the Service Activator, not ideal. An alternative approach toward solving this problem would be to configure two gateways with a Service Activator between them. Only one of the gateways would be exposed to invokers, the outer one. Both Gateways would reference the same service interface. The outer gateway specification would not specify the default-reply-timeout but would specify the input and output channels in the same way that a single gateway would. The Service Activator between the Gateways would handle null gateway responses and possibly any exceptions if preferred to the gateway error handler approach. An example is as follows: The Service Activator bean (enrollmentServiceGatewayHandler) deals with both null and exception responses from the adapted gateway (enrollmentServiceAdaptedGateway), in the situation where these are generated a business response detailing the error is generated. Spring Integration R2.1 Changes async-executor on gateway spec
May 26, 2012
by Matt Vickery
· 24,382 Views · 1 Like
article thumbnail
The Limited Usefulness of AsyncContext.start()
Some time ago I came across What's the purpose of AsyncContext.start(...) in Servlet 3.0? question. Quoting the Javadoc of aforementioned method: Causes the container to dispatch a thread, possibly from a managed thread pool, to run the specified Runnable. To remind all of you, AsyncContext is a standard way defined in Servlet 3.0 specification to handle HTTP requests asynchronously. Basically HTTP request is no longer tied to an HTTP thread, allowing us to handle it later, possibly using fewer threads. It turned out that the specification provides an API to handle asynchronous threads in a different thread pool out of the box. First we will see how this feature is completely broken and useless in Tomcat and Jetty - and then we will discuss why the usefulness of it is questionable in general. Our test servlet will simply sleep for given amount of time. This is a scalability killer in normal circumstances because even though sleeping servlet is not consuming CPU, but sleeping HTTP thread tied to that particular request consumes memory - and no other incoming request can use that thread. In our test setup I limited the number of HTTP worker threads to 10 which means only 10 concurrent requests are completely blocking the application (it is unresponsive from the outside) even though the application itself is almost completely idle. So clearly sleeping is an enemy of scalability. @WebServlet(urlPatterns = Array("/*")) class SlowServlet extends HttpServlet with Logging { protected override def doGet(req: HttpServletRequest, resp: HttpServletResponse) { logger.info("Request received") val sleepParam = Option(req.getParameter("sleep")) map {_.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") } } Benchmarking this code reveals that the average response times are close to sleep parameter as long as the number of concurrent connections is below the number of HTTP threads. Unsurprisingly the response times begin to grow the moment we exceed the HTTP threads count. Eleventh connection has to wait for any other request to finish and release worker thread. When the concurrency level exceeds 100, Tomcat begins to drop connections - too many clients are already queued. So what about the the fancy AsyncContext.start() method (do not confuse with ServletRequest.startAsync())? According to the JavaDoc I can submit any Runnable and the container will use some managed thread pool to handle it. This will help partially as I no longer block HTTP worker threads (but still another thread somewhere in the servlet container is used). Quickly switching to asynchronous servlet: @WebServlet(urlPatterns = Array("/*"), asyncSupported = true) class SlowServlet extends HttpServlet with Logging { protected override def doGet(req: HttpServletRequest, resp: HttpServletResponse) { logger.info("Request received") val asyncContext = req.startAsync() asyncContext.setTimeout(TimeUnit.MINUTES.toMillis(10)) asyncContext.start(new Runnable() { def run() { logger.info("Handling request") val sleepParam = Option(req.getParameter("sleep")) map {_.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") asyncContext.complete() } }) } } We are first enabling the asynchronous processing and then simply moving sleep() into a Runnable and hopefully a different thread pool, releasing the HTTP thread pool. Quick stress test reveals slightly unexpected results (here: response times vs. number of concurrent connections): Guess what, the response times are exactly the same as with no asynchronous support at all (!) After closer examination I discovered that when AsyncContext.start() is called Tomcat submits given task back to... HTTP worker thread pool, the same one that is used for all HTTP requests! This basically means that we have released one HTTP thread just to utilize another one milliseconds later (maybe even the same one). There is absolutely no benefit of calling AsyncContext.start() in Tomcat. I have no idea whether this is a bug or a feature. On one hand this is clearly not what the API designers intended. The servlet container was suppose to manage separate, independent thread pool so that HTTP worker thread pool is still usable. I mean, the whole point of asynchronous processing is to escape the HTTP pool. Tomcat pretends to delegate our work to another thread, while it still uses the original worker thread pool. So why I consider this to be a feature? Because Jetty is "broken" in exactly same way... No matter whether this works as designed or is only a poor API implementation, using AsyncContext.start() in Tomcat and Jetty is pointless and only unnecessarily complicates the code. It won't give you anything, the application works exactly the same under high load as if there was no asynchronous logic at all. But what about using this API feature on correct implementations like IBM WAS? It is better, but still the API as is doesn't give us much in terms of scalability. To explain again: the whole point of asynchronous processing is the ability to decouple HTTP request from an underlying thread, preferably by handling several connections using the same thread. AsyncContext.start() will run the provided Runnable in a separate thread pool. Your application is still responsive and can handle ordinary requests while long-running request that you decided to handle asynchronously are processed in a separate thread pool. It is better, unfortunately the thread pool and thread per connection idiom is still a bottle-neck. For the JVM it doesn't matter what type of threads are started - they still occupy memory. So we are no longer blocking HTTP worker threads, but our application is not more scalable in terms of concurrent long-running tasks we can support. In this simple and unrealistic example with sleeping servlet we can actually support thousand of concurrent (waiting) connections using Servlet 3.0 asynchronous support with only one extra thread - and without AsyncContext.start(). Do you know how? Hint: ScheduledExecutorService. Postscriptum: Scala goodness I almost forgot. Even though examples were written in Scala, I haven't used any cool language features yet. Here is one: implicit conversions. Make this available in your scope: implicit def blockToRunnable[T](block: => T) = new Runnable { def run() { block } } And suddenly you can use code block instead of instantiating Runnable manually and explicitly: asyncContext start { logger.info("Handling request") val sleepParam = Option(req.getParameter("sleep")) map { _.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") asyncContext.complete() } Sweet!
May 22, 2012
by Tomasz Nurkiewicz
· 17,534 Views · 1 Like
article thumbnail
Spring Integration: Splitter-Aggregator
Within Spring Integration, one form of EIP scatter-gather is provided by the splitter and aggregator constructs.
May 18, 2012
by Matt Vickery
· 47,622 Views · 2 Likes
article thumbnail
Continuous Delivery vs. Traditional Agile
in working with development teams at organizations which are adopting continuous delivery , i have found there can be friction over practices that many developers have come to consider as the right way for agile teams to work. i believe the root of conflicts between what i’ve come to think of as traditional agile and cd is the approach to making software “ready for release”. evolution of software delivery a usefully simplistic view of the evolution of ideas about making software ready for release is this: waterfall believes a team should only start making its software ready for release when all of the functionality for the release has been developed (i.e. when it is “feature complete”). agile introduces the idea that the team should get their software ready for release throughout development. many variations of agile (which i refer to as “traditional agile” in this post) believe this should be done at periodic intervals. continuous delivery is another subset of agile which in which the team keeps its software ready for release at all times during development. it is different from “traditional” agile in that it does not involve stopping and making a special effort to create a releasable build. continuous delivery is not about shorter cycles going from traditional agile development to continuous delivery is not about adopting a shorter cycle for making the software ready for release. making releasable builds every night is still not continuous delivery. cd is about moving away from making the software ready as a separate activity, and instead developing in a way that means the software is always ready for release. ready for release does not mean actually releasing a common misunderstanding is that continuous delivery means releasing into production very frequently. this confusion is made worse by the use of organizations that release software multiple times every day as poster children for cd. continuous delivery doesn’t require frequent releases, it only requires ensuring software could be released with very little effort at any point during development. (see jez humble’s article on continuous delivery vs. continuous deployment .) although developing this capability opens opportunities which may encourage the organization to release more often, many teams find more than enough benefit from cd practices to justify using it even when releases are fairly infrequent. friction points between continuous delivery and traditional agile as i mentioned, there are sometimes conflicts between continuous delivery and practices that development teams take for granted as being “proper” agile. friction point: software with unfinished work can still be releasable one of these points of friction is the requirement that the codebase not include incomplete stories or bugfixes at the end of the iteration. i explored this in my previous post on iterations . this requirement comes from the idea that the end of the iteration is the point where the team stops and does the extra work needed to prepare the software for release. but when a team adopts continuous delivery, there is no additional work needed to make the software releasable. more to the point, the cd team ensures that their code could be released to production even when they have work in progress, using techniques such as feature toggles . this in turn means that the team can meet the requirement that they be ready for release at the end of the iteration even with unfinished stories. this can be a bit difficult for people to swallow. the team can certainly still require all work to be complete at the iteration boundary, but this starts to feel like an arbitrary constraint that breaks the team’s flow. continuous delivery doesn’t require non-timeboxed iterations, but the two practices are complementary. friction point: snapshot/release builds many development teams divide software builds into two types, “snapshot” builds and “release” builds. this is not specific to agile, but has become strongly embedded in the java world due to the rise of maven, which puts the snapshot/build concept at the core of its design. this approach divides the development cycle into two phases, with snapshots being used while software is in development, and a release build being created only when the software is deemed ready for release. this division of the release cycle clearly conflicts with the continuous delivery philosophy that software should always be ready for release. the way cd is typically implemented involves only creating a build once, and then promoting it through multiple stages of a pipeline for testing and validation activities, which doesn’t work if software is built in two different ways as with maven. it’s entirely possible to use maven with continuous delivery, for example by creating a release build for every build in the pipeline. however this leads to friction with maven tools and infrastructure that assume release builds are infrequent and intended for production deployment. for example, artefact repositories such as nexus and artefactory have housekeeping features to delete old snapshot builds, but don’t allow release builds to be deleted. so an active cd team, which may produce dozens of builds a day, can easily chew through gigabytes and terabytes of disk space on the repository. friction point: heavier focus on testing deployability a standard practice with continuous delivery is automatically deploying every build that passes basic continuous integration to an environment that emulates production as closely as possible, using the same deployment process and tooling. this is essential to proving whether the code is ready for release on every commit, but this is more rigorous than many development teams are used to having in their ci. for example, pre-cd continuous integration might run automated functional tests against the application by deploying it to an embedded application server using a build tool like ant or maven. this is easier for developers to use and maintain, but is probably not how the application will be deployed in production. so a cd team will typically add an automated deployment to an environment will more fully replicates production, including separated web/app/data tiers, and deployment tooling that will be used in production. however this more production-like deployment stage is more likely to fail due to its added complexity, and may be may be more difficult for developers to maintain and fix since it uses tooling more familiar to system administrators than to developers. this can be an opportunity to work more closely with the operations team to create a more reliable, easily supported deployment process. but it is likely to be a steep curve to implement and stabilize this process, which may impact development productivity. is cd worth it? given these friction points, what benefit is there to moving from traditional agile to continuous delivery worthwhile, especially for a team that is unlikely to actually release into production more often than every iteration? decrease risk by uncovering deployment issues earlier, increase flexibility by giving the organization the option to release at any point with minimal added cost or risk, involves everyone involved in production releases - such as qa, operations, etc. - in making the full process more efficient. the entire organization must identify difficult areas of the process and find ways to fix them, through automation, better collaboration, and improved working practices, by continuously rehearsing the release process, the organization becomes more competent at doing it, so that releasing becomes autonomic, like breathing, rather than traumatic, like giving birth, improves the quality of the software, by forcing the team to fix problems as they are found rather than being able to leave things for later. dealing with the friction the friction points i’ve described seem to come up fairly often when continuous delivery is being introduced. my hope is that understanding the source of this friction will be helpful in discussing it when it comes up, and working through the issues. if developers who are initially uncomfortable with breaking with the “proper” way of doing things, or find a cd pipeline overly complex or difficult understand the aims and value of these practices, hopefully they will be more open to giving them a chance. once these practices become embedded and mature in an organization, team members often find it’s difficult to go back to the old ways of doing them. edit: i’ve rephrased the definition of the “traditional agile” approach to making software ready for release. this definition is not meant to apply to all agile practices, but rather applies to what seems to me to be a fairly mainstream belief that agile means stopping work to make the software releasable.
May 9, 2012
by Kief Morris
· 54,137 Views · 7 Likes
article thumbnail
Bridging between JMS and RabbitMQ (AMQP) using Spring Integration
An old customer recently asked me if I had a solution for how to integrate between their existing JMS infrastructure on Websphere MQ with RabbitMQ. Although I know that RabbitMQ has the shovel plugin which can bridge between Rabbit instances I've yet not found a good plugin for JMS <-> AMQP forwarding. The first thing that came to my mind was to utilize a Spring Integration mediation as SI has excellent support for both JMS and Rabbit. Curious as I am I started a PoC and this is the result. It takes messages of a JMS queue and forwards to an AMQP exchange that is bound to a queue the consumer application is supposed to listen to. I used an external HornetQ instance in JBoss 6.1 as the JMS Provider, but I am 100% secure that the same setup would work for Websphere MQ as they both implement JMS. Be aware that I've done no performance tweaking or QoS setup yet as this is just a proof-of-concept. For a real setup you'd probably have to think about delivery guarantees versus performance and etc... The code will be available at a GitHub repository near you soon.. SpringContext in XML: org.jnp.interfaces.NamingContextFactory jnp://localhost:1099 org.jnp.interfaces:org.jboss.naming ConnectionFactory Maven POM: 4.0.0 org.rl si.jmstorabbit 0.0.1-SNAPSHOT jar si.jmstorabbit http://maven.apache.org UTF-8 2.2.5.Final 2.1.0.RELEASE springsource-release http://repository.springsource.com/maven/bundles/release false springsource-external http://repository.springsource.com/maven/bundles/external false org.springframework.integration spring-integration-core ${spring.integration.version} org.springframework.integration spring-integration-file ${spring.integration.version} org.springframework.integration spring-integration-amqp ${spring.integration.version} org.springframework.integration spring-integration-jms ${spring.integration.version} junit junit 3.8.1 test org.springframework spring-context 3.0.7.RELEASE jboss jnp-client 4.2.2.GA org.hornetq hornetq-core-client ${hornet.version} org.hornetq hornetq-jms-client ${hornet.version} org.hornetq hornetq-jms ${hornet.version} jboss jboss-common-client 3.2.3 org.jboss.netty netty 3.2.7.Final javax.jms jms 1.1
April 24, 2012
by Billy Sjöberg
· 30,068 Views
article thumbnail
How to Use Sigma.js with Neo4j
i’ve done a few posts recently using d3.js and now i want to show you how to use two other great javascript libraries to visualize your graphs. we’ll start with sigma.js and soon i’ll do another post with three.js . we’re going to create our graph and group our nodes into five clusters. you’ll notice later on that we’re going to give our clustered nodes colors using rgb values so we’ll be able to see them move around until they find their right place in our layout. we’ll be using two sigma.js plugins, the gefx (graph exchange xml format) parser and the forceatlas2 layout. you can see what a gefx file looks like below. notice it comes from gephi which is an interactive visualization and exploration platform, which runs on all major operating systems, is open source, and is free. ... ... in order to build this file, we will need to get the nodes and edges from the graph and create an xml file. get '/graph.xml' do @nodes = nodes @edges = edges builder :graph end we’ll use cypher to get our nodes and edges: def nodes neo = neography::rest.new cypher_query = " start node = node:nodes_index(type='user')" cypher_query << " return id(node), node" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0]}.merge(n[1]["data"])} end we need the node and relationship ids, so notice i’m using the id() function in both cases. def edges neo = neography::rest.new cypher_query = " start source = node:nodes_index(type='user')" cypher_query << " match source -[rel]-> target" cypher_query << " return id(rel), id(source), id(target)" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0], "source" => n[1], "target" => n[2]} } end so far we have seen graphs represented as json, and we’ve built these manually. today we’ll take advantage of the builder ruby gem to build our graph in xml. xml.instruct! :xml xml.gexf 'xmlns' => "http://www.gephi.org/gexf", 'xmlns:viz' => "http://www.gephi.org/gexf/viz" do xml.graph 'defaultedgetype' => "directed", 'idtype' => "string", 'type' => "static" do xml.nodes :count => @nodes.size do @nodes.each do |n| xml.node :id => n["id"], :label => n["name"] do xml.tag!("viz:size", :value => n["size"]) xml.tag!("viz:color", :b => n["b"], :g => n["g"], :r => n["r"]) xml.tag!("viz:position", :x => n["x"], :y => n["y"]) end end end xml.edges :count => @edges.size do @edges.each do |e| xml.edge:id => e["id"], :source => e["source"], :target => e["target"] end end end end you can get the code on github as usual and see it running live on heroku. you will want to see it live on heroku so you can see the nodes in random positions and then move to form clusters. use your mouse wheel to zoom in, and click and drag to move around. credit goes out to alexis jacomy and mathieu jacomy . you’ve seen me create numerous random graphs, but for completeness here is the code for this graph. notice how i create 5 clusters and for each node i assign half its relationships to other nodes in their cluster and half to random nodes? this is so the forceatlas2 layout plugin clusters our nodes neatly. def create_graph neo = neography::rest.new graph_exists = neo.get_node_properties(1) return if graph_exists && graph_exists['name'] names = 500.times.collect{|x| generate_text} clusters = 5.times.collect{|x| {:r => rand(256), :g => rand(256), :b => rand(256)} } commands = [] names.each_index do |n| cluster = clusters[n % clusters.size] commands << [:create_node, {:name => names[n], :size => 5.0 + rand(20.0), :r => cluster[:r], :g => cluster[:g], :b => cluster[:b], :x => rand(600) - 300, :y => rand(150) - 150 }] end names.each_index do |from| commands << [:add_node_to_index, "nodes_index", "type", "user", "{#{from}"] connected = [] # create clustered relationships members = 20.times.collect{|x| x * 10 + (from % clusters.size)} members.delete(from) rels = 3 rels.times do |x| to = members[x] connected << to commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless to == from end # create random relationships rels = 3 rels.times do |x| to = rand(names.size) commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless (to == from) || connected.include?(to) end end batch_result = neo.batch *commands end
April 12, 2012
by Max De Marzi
· 15,380 Views
article thumbnail
All about JMS messages
JMS providers like ActiveMQ are based on the concept of passing one-directional messages between nodes and brokers asynchronously. A thorough knowledge of the type of messages that can be sent through a JMS middleware can simplify a lot your work in mapping the communication patterns to real code. The basic Message interface Some object members are shared by all messages: header fields, used to identify univocally a message and to route it to the right brokers and consumers. A dynamic map of properties which can be read programmatically by JMS brokers in order to filter or to route messages. A body, which is differentiated in the various implementations we'll see. Header fields The set of getJMS*() methods on the Message interface defines the available headers. Two of them are oriented to message identification: getJMSMessageID() contains a generated ID for identifying a message, unique at least for the current broker. All generated IDs start with the prefix 'ID:', but you can override it with the corresponding setter. getJMSCorrelationID() (and getJMSCorrelationID() as bytes) can link a message with another, usually one that has been sent previously. For example, a reply can carry the ID of the original message when put in another queue. Two to sender and recipient identification: getJMSDestination() returns a Destination object (a Topic or a Queue, or their temporary version) describing where the message was directed. getJMSReplyTo() is a Destination object where replies should be sent; it can be null of course. Three tune the delivery mechanism: getJMSDeliveryMode() can be DeliveryMode.NON_PERSISTENT or DeliveryMode.PERSISTENT; only persistent messages guarantee delivery in case of a crash of the brokers that transport it. getJMSExpiration() returns a timestamp indicating the expiration time of the message; it can be 0 on a message without a defined expiration. getJMSPriority() returns a 0-9 integer value (higher is better) defining the priority for delivery. It is only a best-effort value. While the remaining ones contain metadata: getJMSRedelivered() returns a boolean indicating if the message is being delivered again after a delivery which was not acknowledge. getJMSTimestamp() returns a long indicating the time of sending. getJMSType() defines a field for provider-specific or application-specific message types. Of these headers, only JMSCorrelationID, JMSReplyTo and JMSType have to be set when needed. The others are generated or managed by the send() and publish() methods if not specified (there are setters available for each of these headers.) Properties Generic properties with a String name can be added to messages and read with getBooleanProperty(), getStringProperty() and similar methods. The corresponding setters setBooleanProperty(), setStringProperty(), ... can be used for their addition. The reason for keeping some properties out of the content of the message (which a MapMessage can contain) is so they could be read before reaching the destination, for example in a JMS broker. The use case for this access to message properties is routing and filtering: downstream brokers and consumers may define a filter such as "I am interested only on messages on this Topic that have property X = 'value' and Y = 'value2'". Bodies All the subinterfaces of javax.jms.Message defined by the API provide different types of message bodies (while actual classes are defined by the providers and are not part of the API). Actual instantiation is then handled by the Session, which implements an Abstract Factory pattern. On the receival side, a cast is necessary for any message type (at least in the Java JMS Api), since only a generic Message is read. BytesMessage is the most basic type: it contains a sequence of uninterpreted bytes. Hence, it can in theory contain anything, but the generation and interpretation of the content is the client's job. BytesMessage m = session.createBytesMessage(); m.writeByte(65); m.writeBytes(new byte[] { 66, 68, 70 }); // on receival (cast shown only here) BytesMessage m = (BytesMessage) genericMessage; byte[] content = new byte[4]; m.readBytes(content); MapMessage defines a message containing an (unordered) set of key/value pairs, also called a map or dictionary or hash. However, the keys are String objects, while the values are primitives or Strings; since they are primitives, they shouldn't be null. MapMessage = session.createMapMessage(); m.setString('key', 'value'); // or m.setObject('key', 'value') to avoid specifying a type // on receival m.getString('key'); // or m.getObject('key') ObjectMessage wraps a generic Object for transmission. The Object should be Serializable. ObjectMessage m = session.createObjectMessage(); m.setObject(new ValueObject('field1', 42)); // on receival ValueObject vo = (ValueObject) m.getObject(); StreamMessage wraps a stream of primitive values of indefinite length. StreamMessage m = session.createStreamMessage(); m.writeBoolean(true); m.writeBoolean(false); m.writeBoolean(true); // receival System.out.println(m.readBoolean()); System.out.println(m.readBoolean()); System.out.println(m.readBoolean()); // prints true, false, true TextMessage wraps a String of any length. TextMessage m = session.createTextMessage("Contents"); // or use m.setText() afterwards // receival String text = m.getText(); Usually all messages are in a read-only phase after receival, so only getters can be called on them.
March 12, 2012
by Giorgio Sironi
· 61,863 Views · 1 Like
article thumbnail
Why Having "DevOps" in a Job Title Makes Sense
We’ve been trying to grow our team for a few months now and the title we’re hiring for is Devops Engineer. One of the candidates our recruiters reached out to, let’s call him John, came back to us with a bunch of questions including: How do you feel about hiring someone with a devops title? It’s a very legittimate question, Devops is a cultural and professional movement, so how could it be a job title? What I argued in my reply to this fella is that Devops isn’t the job title, Devops Engineer is, and in this sense Devops is just a qualifier and I strongly believe a very useful one. I really sympathise with those that are fighting hard to keep Devops real and avoid the same faith that some refer to as the sad commercialisation of Agile. My campaign to make of devops a job title isn’t a campaign to come up with a set of bullet points that define Devops as a job so that I can put it on a resume or build it into a product. My argument here is that the guy I’m trying to hire, John, I want him to be a certain kind of guy and the best way I have to describe what I want is Devops Engineer. I’m looking for an operations guy , but I want him to be open to developers, consider engineering and the company as a whole, be focused on delivering value and not rathole into fights about technology or claim root access only on principle. I want that guy to have great communication skills and the interest to explore what’s besides his infrastructure, to be wanting to borrow as much good he can find in other disciplines across the organisation. And then of course there is the practical part, the desire to automate and escape a boring manual routine, the familiarity with cloud that willing or not has powered the movement, and even more specific things like configuration management. You may argue that this is just a good engineer or what systems engineers are becoming, in other words nothing new under the sun. And you may be right, but job titles are in many ways just another way to communicate, to broadcast an intent and a need. So you know what I told John about hiring Devops Engineers? That I felt pretty damn proud about it. The true ones, not the ones slapping it on their CV to get a job, are fantastic engineers and I can’t but encourage them to start to respond to that qualifier. Likewise the companies and individuals seeking them out are likely the ones building great groups those people will want to be members of. Yes, the moment it becomes a keyword recruiters start to match against we’re likely to see a spur of fakes trying to land a job, but that’s nothing new under the sun. Signed, a Devops manager Source: http://www.spikelab.org/devops-job-title/
March 5, 2012
by Spike Morelli
· 10,695 Views
article thumbnail
Why You Need a Git Pre-Commit Hook and Why Most Are Wrong
a pre-commit hook is a piece of code that runs before every commit and determines whether or not the commit should be accepted. think of it as the gatekeeper to your codebase. want to ensure you didn’t accidentally leave any pdb s in your code? pre-commit hook. want to make sure your javascript is jshint approved? pre-commit hook. want to guarantee clean, readable pep8 -compliant code? pre-commit hook. want to pipe all of the comments in your codebase through strunk & white ? please don’t. the pre-commit hook is just an executable file that runs before every commit. if it exits with zero status, the commit is accepted. if it exits with a non-zero status, the commit is rejected. (note: a pre-commit hook can be bypassed by passing the --no-verify argument.) along with the pre-commit hook there are numerous other git hooks that are available: post-commit, post-merge, pre-receive, and others that can be found here . why most pre-commit hooks are wrong be wary of the above’s example as the majority of pre-commit hooks you’ll see on the web are wrong. most test against whatever files are currently on disk, not what is in the staging area (the files actually being committed). we avoid this in our hook by stashing all changes that are not part of the staging area before running our checks and then popping the changes afterwards. this is very important because a file could be fine on disk while the changes that are being committed are wrong. the code below is the pre-commit hook we use at yipit. our hook is simply a set of checks to be run against any files that have been modified in this commit. each check can be configured to include/exclude particular types of files. it is designed for a django environment, but should be adaptable to other environments with minor changes. note that you need git 1.7.7+ #!/usr/bin/env python import os import re import subprocess import sys modified = re.compile('^(?:m|a)(\s+)(?p.*)') checks = [ { 'output': 'checking for pdbs...', 'command': 'grep -n "import pdb" %s', 'ignore_files': ['.*pre-commit'], 'print_filename': true, }, { 'output': 'checking for ipdbs...', 'command': 'grep -n "import ipdb" %s', 'ignore_files': ['.*pre-commit'], 'print_filename': true, }, { 'output': 'checking for print statements...', 'command': 'grep -n print %s', 'match_files': ['.*\.py$'], 'ignore_files': ['.*migrations.*', '.*management/commands.*', '.*manage.py', '.*/scripts/.*'], 'print_filename': true, }, { 'output': 'checking for console.log()...', 'command': 'grep -n console.log %s', 'match_files': ['.*yipit/.*\.js$'], 'print_filename': true, }, { 'output': 'checking for debugger...', 'command': 'grep -n debugger %s', 'match_files': ['.*\.js$'], 'print_filename': true, }, { 'output': 'running jshint...', # by default, jshint prints 'lint free!' upon success. we want to filter this out. 'command': 'jshint %s | grep -v "lint free!"', 'match_files': ['.*yipit/.*\.js$'], 'print_filename': false, }, { 'output': 'running pyflakes...', 'command': 'pyflakes %s', 'match_files': ['.*\.py$'], 'ignore_files': ['.*settings/.*', '.*manage.py', '.*migrations.*', '.*/terrain/.*'], 'print_filename': false, }, { 'output': 'running pep8...', 'command': 'pep8 -r --ignore=e501,w293 %s', 'match_files': ['.*\.py$'], 'ignore_files': ['.*migrations.*'], 'print_filename': false, }, { 'output': 'checking for sass changes...', 'command': 'sass --quiet --update %s', 'match_files': ['.*\.scss$'], 'print_filename': true, }, ] def matches_file(file_name, match_files): return any(re.compile(match_file).match(file_name) for match_file in match_files) def check_files(files, check): result = 0 print check['output'] for file_name in files: if not 'match_files' in check or matches_file(file_name, check['match_files']): if not 'ignore_files' in check or not matches_file(file_name, check['ignore_files']): process = subprocess.popen(check['command'] % file_name, stdout=subprocess.pipe, stderr=subprocess.pipe, shell=true) out, err = process.communicate() if out or err: if check['print_filename']: prefix = '\t%s:' % file_name else: prefix = '\t' output_lines = ['%s%s' % (prefix, line) for line in out.splitlines()] print '\n'.join(output_lines) if err: print err result = 1 return result def main(all_files): # stash any changes to the working tree that are not going to be committed subprocess.call(['git', 'stash', '-u', '--keep-index'], stdout=subprocess.pipe) files = [] if all_files: for root, dirs, file_names in os.walk('.'): for file_name in file_names: files.append(os.path.join(root, file_name)) else: p = subprocess.popen(['git', 'status', '--porcelain'], stdout=subprocess.pipe) out, err = p.communicate() for line in out.splitlines(): match = modified.match(line) if match: files.append(match.group('name')) result = 0 print 'running django code validator...' return_code = subprocess.call('$virtual_env/bin/python manage.py validate', shell=true) result = return_code or result for check in checks: result = check_files(files, check) or result # unstash changes to the working tree that we had stashed subprocess.call(['git', 'reset', '--hard'], stdout=subprocess.pipe, stderr=subprocess.pipe) subprocess.call(['git', 'stash', 'pop', '-q'], stdout=subprocess.pipe, stderr=subprocess.pipe) sys.exit(result) if __name__ == '__main__': all_files = false if len(sys.argv) > 1 and sys.argv[1] == '--all-files': all_files = true main(all_files) to use this hook or a hook that you create yourself, simply copy the file to .git/hooks/pre-commit inside of your project and make sure that it is executable or add in to your git repo and setup a symlink.
March 3, 2012
by Steve Pulec
· 23,485 Views · 1 Like
article thumbnail
Creating a build pipeline using Maven, Jenkins, Subversion and Nexus.
for a while now, we had been operating in the wild west when it comes to building our applications and deploying to production. builds were typically done straight from the developer’s ide and manually deployed to one of our app servers. we had a manual process in place, where the developer would do the following steps. check all project code into subversion and tag build the application. archive the application binary to a network drive deploy to production update our deployment wiki with the date and version number of the app that was just deployed. the problem is that there were occasionally times where one of these steps were missed, and it always seemed to be at a time when we needed to either rollback to the previous version, or branch from the tag to do a bugfix. sometimes the previous version had not been archived to the network, or the developer forgot to tag svn. we were already using jenkins to perform automated builds, so we wanted to look at extending it further to perform release builds. the maven release plug-in provides a good starting point for creating an automated release process. we have also just started using the nexus maven repository and wanted to incorporate that as well to archive our binaries to, rather than archiving them to a network drive. the first step is to set up the project’s pom file with the deploy plugin as well as include configuration information about our nexus and subversion repositories. org.apache.maven.plugins maven-release-plugin 2.2.2 http://mks:8080/svn/jrepo/tags/frameworks/siestaframework the release plugin configuration is pretty straightforward. the configuration takes the subversion url of the location where the tags will reside for this project. the next step is to configure the svn location where the code will be checked out from. scm:svn:http://mks:8080/svn/jrepo/trunk/frameworks/siestaframework http://mks:8080/svn the last step in configuring the project is to set up the location where the binaries will be archived to. in our case, the nexus repository. lynden-java-release lynden release repository http://cisunwk:8081/nexus/content/repositories/lynden-java-release the project is now ready to use the maven release plug-in. the release plugin provides a number of useful goals. release:clean – cleans the workspace in the event the last release process was not successful. release: prepare – performs a number of operations checks to make sure that there are no uncommitted changes. ensures that there are no snapshot dependencies in the pom file, changes the version of the application and removes snapshot from the version. ie 1.0.3-snapshot becomes 1.0.3 run project tests against modified poms commit the modified pom tag the code in subersion increment the version number and append snapshot. ie 1.0.3 becomes 1.0.4-snapshot commit modified pom release: perform – performs the release process checks out the code using the previously defined tag runs the deploy maven goal to move the resulting binary to the repository. putting it all together the last step in this process is to configure jenkins to allow release builds on-demand, meaning we want the user to have to explicitly kick off a release build for this process to take place. we have download and installed the release jenkins plug-in in order to allow developers to kick off release builds from jenkins. the release plug-in will execute tasks after the normal build has finished. below is a screenshot of the configuration of one of our projects. the release build option for the project is enabled by selecting the “configure release build” option in the “build environment” section. the maven release plug-in is activated by adding the goals to the “after successful release build” section. (the –b option enables batch mode so that the release plug-in will not ask the user for input, but use defaults instead.) once the release option has been configured for a project there will be a “release” icon on the left navigation menu for the project. selecting this will kick off a build and then the maven release process, assuming the build succeeds. finally a look at svn and nexus verifies that the build for version 1.0.4 of the siesta-framework project has been tagged in svn and uploaded to nexus. the next steps for this project will be to generate release notes for release builds, and also to automate a deployment pipeline, so that developers can deploy to our test, staging and production servers via jenkins rather than manually from their development workstations. twitter: @robterp blog: http://rterp.wordpress.com
February 29, 2012
by Rob Terpilowski
· 86,815 Views
article thumbnail
How to Add an Existing Project to TFS
this is a super basic and easy question, but i found quite often people asking me how to add an existing project to a tfs team project. it turns out that there are more than one way of doing this, but i usually suggests this simple path that is quite simple and is understandable from people that comes from the subversion world. first of all be sure that you have a valid workspace in your computer that maps the folder in the team project where you want to put your existing project and issue a getlatest . if the team project is new you can simply create a workspace going to menu file –> source control –> workspaces, press add and create a workspace that maps the source to a folder on your pc. figure 1: creating a workspace that maps the whole folder of the source control. now simply go to the local folder and move the existing solution from the original location to the mapped folder, then go to the team explorer –> source control and manually add all the files to tfs source control. in figure2 i represented the process to accomplish this task, first of all select source control, press the add button then visual studio presents you all the files that are in your local folder but are not present in the source control (they are the candidate to be added to the source control). figure 2: adding existing file to source control now you are presented with a list of all the files that will be added to tfs source control, as seen in figure 3 figure 3: list of files that gets added to the source control. as you can see some of the files are excluded (22 in figure 3), this happens because visual studio already knows that certain types of file should be excluded from source control, like all the bin and debug folders and all *.dll files . if you have some lib folder where you store third party library, you can go to the excluded items tab in figure 3 to add manually all the excluded files you want to be added to the source control. since this window is usually cluttered with all the bin and obj directory, i found simpler to: 1) in a first pass add all the file suggested by visual studio 2) browse to lib directory (or whatever folder contains third party library) and add explicitly all the files in the directory. now all files were in pending-add, this means that they will be sent to the source control with a check-in, but before the check-in phase you should open solution file from the source control explorer windows. after the solution is opened visual studio should tell you that this solution is in a source control monitored folder, but source control is not enabled this basically means that, source files are correctly linked to the source control system, but the visual studio integration is not enabled, you can simply press yes and let visual studio actually do binding between projects and source control . if the binding does not happens automatically you will get the windows of figure 4, (you can always open this windows with the menu file –> source control –> change source control. in this window you can simply select one project at a time and press the bind button to perform the bind, until all the project files are in connected status. figure 4: binding windows where you can bind solution and projects file to tfs now you can check-in everything and usually you should create another workspace or ask to another member of the team to do a get-latest of the just-inserted solution, to verify that all the needed files were correctly added to the source control. gian maria. source: http://www.codewrecks.com/blog/index.php/2012/01/27/add-existing-project-to-tfs
February 9, 2012
by Ricci Gian Maria
· 117,232 Views · 1 Like
article thumbnail
Separating Integration and Unit Tests with Maven, Sonar, Failsafe, and JaCoCo
Execute the slow integration tests separately from unit tests and show as much information about them as possible in Sonar.
February 8, 2012
by Jakub Holý
· 57,001 Views · 1 Like
article thumbnail
Java.lang.VerifyError: Expecting a stackmap frame at branch target – JDK 7
Right now, when I try to persist an object in Google App Engine, I’m facing the error “Java.lang.VerifyError: Expecting a stackmap frame at branch target“. I’m using JDK 7 and it seems like the problem lies with this JDK. After googling a bit, I found that there seems to be two solutions to fix this problem. Solution 1: Change to JDK 6 As simple as is, change your JDK to version 6 and you won’t be bugged by this exception anymore. Well, in my case, I have to use JDK 7. So, moving on to the solution 2. Solution 2: Configure JVM Go to Windows -> Preferences -> Installed JREs. Select the default JVM and click edit. Then add this parameter as VM argument “-XX:-UseSplitVerifier” as seen below. This should solve the issue. From http://veerasundar.com/blog/2012/01/java-lang-verifyerror-expecting-a-stackmap-frame-at-branch-target-jdk-7/
February 6, 2012
by Veera Sundar
· 47,672 Views
article thumbnail
Which Integration Framework Should You Use – Spring Integration, Mule ESB or Apache Camel?
Data exchanges between companies are increasing a lot. The number of applications that must be integrated is increasing, too. The interfaces use different technologies, protocols and data formats. Nevertheless, the integration of these applications must be modeled in a standardized way, realized efficiently and supported by automatic tests. Three integration frameworks are available in the JVM environment, which fulfil these requirements: Spring Integration, Mule ESB and Apache Camel. They implement the well-known Enteprise Integration Patterns (EIP, http://www.eaipatterns.com) and therefore offer a standardized, domain-specific language to integrate applications. These integration frameworks can be used in almost every integration project within the JVM environment – no matter which technologies, transport protocols or data formats are used. All integration projects can be realized in a consistent way without redundant boilerplate code. This article compares all three alternatives and discusses their pros and cons. If you want to know, when to use a more powerful Enterprise Service Bus (ESB) instead of one of these lightweight integration frameworks, then you should read this blog post: http://www.kai-waehner.de/blog/2011/06/02/when-to-use-apache-camel/ (it explains when to use Apache Camel, but the title could also be „When to use a lightweight integration framework“). Comparison Criteria Several criteria can be used to compare these three integration frameworks: Open source Basic concepts / architecture Testability Deployment Popularity Commercial support IDE-Support Errorhandling Monitoring Enterprise readiness Domain specific language (DSL) Number of components for interfaces, technologies and protocols Expandability Similarities All three frameworks have many similarities. Therefore, many of the above comparison criteria are even! All implement the EIPs and offer a consistent model and messaging architecture to integrate several technologies. No matter which technologies you have to use, you always do it the same way, i.e. same syntax, same API, same automatic tests. The only difference is the the configuration of each endpoint (e.g. JMS needs a queue name while JDBC needs a database connection url). IMO, this is the most significant feature. Each framework uses different names, but the idea is the same. For instance, „Camel routes“ are equivalent to „Mule flows“, „Camel components“ are called „adapters“ in Spring Integration. Besides, several other similarities exists, which differ from heavyweight ESBs. You just have to add some libraries to your classpath. Therefore, you can use each framework everywhere in the JVM environment. No matter if your project is a Java SE standalone application, or if you want to deploy it to a web container (e.g. Tomcat), JEE application server (e.g. Glassfish), OSGi container or even to the cloud. Just add the libraries, do some simple configuration, and you are done. Then you can start implementing your integration stuff (routing, transformation, and so on). All three frameworks are open source and offer familiar, public features such as source code, forums, mailing lists, issue tracking and voting for new features. Good communities write documentation, blogs and tutorials (IMO Apache Camel has the most noticeable community). Only the number of released books could be better for all three. Commercial support is available via different vendors: Spring Integration: SpringSource (http://www.springsource.com) Mule ESB: MuleSoft (http://www.mulesoft.org) Apache Camel: FuseSource (http://fusesource.com) and Talend (http://www.talend.com) IDE support is very good, even visual designers are available for all three alternatives to model integration problems (and let them generate the code). Each of the frameworks is enterprise ready, because all offer required features such as error handling, automatic testing, transactions, multithreading, scalability and monitoring. Differences If you know one of these frameworks, you can learn the others very easily due to their same concepts and many other similarities. Next, let’s discuss their differences to be able to decide when to use which one. The two most important differences are the number of supported technologies and the used DSL(s). Thus, I will concentrate especially on these two criteria in the following. I will use code snippets implementing the well-known EIP „Content-based Router“ in all examples. Judge for yourself, which one you prefer. Spring Integration Spring Integration is based on the well-known Spring project and extends the programming model with integration support. You can use Spring features such as dependency injection, transactions or security as you do in other Spring projects. Spring Integration is awesome, if you already have got a Spring project and need to add some integration stuff. It is almost no effort to learn Spring Integration if you know Spring itself. Nevertheless, Spring Integration only offers very rudimenary support for technologies – just „basic stuff“ such as File, FTP, JMS, TCP, HTTP or Web Services. Mule and Apache Camel offer many, many further components! Integrations are implemented by writing a lot of XML code (without a real DSL), as you can see in the following code snippet: You can also use Java code and annotations for some stuff, but in the end, you need a lot of XML. Honestly, I do not like too much XML declaration. It is fine for configuration (such as JMS connection factories), but not for complex integration logic. At least, it should be a DSL with better readability, but more complex Spring Integration examples are really tough to read. Besides, the visual designer for Eclipse (called integration graph) is ok, but not as good and intuitive as its competitors. Therefore, I would only use Spring Integration if I already have got an existing Spring project and must just add some integration logic requiring only „basic technologies“ such as File, FTP, JMS or JDBC. Mule ESB Mule ESB is – as the name suggests – a full ESB including several additional features instead of just an integration framework (you can compare it to Apache ServiceMix which is an ESB based on Apache Camel). Nevertheless, Mule can be use as lightweight integration framework, too – by just not adding and using any additional features besides the EIP integration stuff. As Spring Integration, Mule only offers a XML DSL. At least, it is much easier to read than Spring Integration, in my opinion. Mule Studio offers a very good and intuitive visual designer. Compare the following code snippet to the Spring integration code from above. It is more like a DSL than Spring Integration. This matters if the integration logic is more complex. The major advantage of Mule is some very interesting connectors to important proprietary interfaces such as SAP, Tibco Rendevous, Oracle Siebel CRM, Paypal or IBM’s CICS Transaction Gateway. If your integration project requires some of these connectors, then I would probably choose Mule! A disadvantage for some projects might be that Mule says no to OSGi: http://blogs.mulesoft.org/osgi-no-thanks/ Apache Camel Apache Camel is almost identical to Mule. It offers many, many components (even more than Mule) for almost every technology you could think of. If there is no component available, you can create your own component very easily starting with a Maven archetype! If you are a Spring guy: Camel has awesome Spring integration, too. As the other two, it offers a XML DSL: ${in.header.type} is ‘com.kw.DvdOrder’ ${in.header.type} is ‘com.kw.VideogameOrder’ Readability is better than Spring Integration and almost identical to Mule. Besides, a very good (but commercial) visual designer called Fuse IDE is available by FuseSource – generating XML DSL code. Nevertheless, it is a lot of XML, no matter if you use a visual designer or just your xml editor. Personally, I do not like this. Therefore, let’s show you another awesome feature: Apache Camel also offers DSLs for Java, Groovy and Scala. You do not have to write so much ugly XML. Personally, I prefer using one of these fluent DSLs instead XML for integration logic. I only do configuration stuff such as JMS connection factories or JDBC properties using XML. Here you can see the same example using a Java DSL code snippet: from(“file:incomingOrders “) .choice() .when(body().isInstanceOf(com.kw.DvdOrder.class)) .to(“file:incoming/dvdOrders”) .when(body().isInstanceOf(com.kw.VideogameOrder.class)) .to(“jms:videogameOrdersQueue “) .otherwise() .to(“mock:OtherOrders “); The fluent programming DSLs are very easy to read (even in more complex examples). Besides, these programming DSLs have better IDE support than XML (code completion, refactoring, etc.). Due to these awesome fluent DSLs, I would always use Apache Camel, if I do not need some of Mule’s excellent connectors to proprietary products. Due to its very good integration to Spring, I would even prefer Apache Camel to Spring Integration in most use cases. By the way: Talend offers a visual designer generating Java DSL code, but it generates a lot of boilerplate code and does not allow vice-versa editing (i.e. you cannot edit the generated code). This is a no-go criteria and has to be fixed soon (hopefully)! And the winner is… … all three integration frameworks, because they are all lightweight and easy to use – even for complex integration projects. It is awesome to integrate several different technologies by always using the same syntax and concepts – including very good testing support. My personal favorite is Apache Camel due to its awesome Java, Groovy and Scala DSLs, combined with many supported technologies. I would only use Mule if I need some of its unique connectors to proprietary products. I would only use Spring Integration in an existing Spring project and if I only need to integrate „basic technologies“ such as FTP or JMS. Nevertheless: No matter which of these lightweight integration frameworks you choose, you will have much fun realizing complex integration projects easily with low efforts. Remember: Often, a fat ESB has too much functionality, and therefore too much, unnecessary complexity and efforts. Use the right tool for the right job! Best regards, Kai Wähner (Twitter: @KaiWaehner) http://www.kai-waehner.de/blog/2012/01/10/spoilt-for-choice-which-integration-framework-to-use-spring-integration-mule-esb-or-apache-camel/
January 19, 2012
by Kai Wähner DZone Core CORE
· 110,870 Views · 9 Likes
article thumbnail
Maven + JavaScript Unit Test: Running in a Continuous Integration Environment
So you're still interested in unit testing JavaScript (good). This post is an extension of my much more indepth first posting on how to unit test JavaScript using JS Test Driver. Please check it out here. Recap Last Posting In the last posting we successfully unit tested JavaScript using Maven and JsTest Driver. This allowes us to test JavaScript when on an environment that has a modern browser installed and can be run. Problem with typical CI environments So what happens when the test are passing on your local box, but you go to check in your code and the Continuous Integration (CI) server pukes on the new tests becasue there is no "screen" to run chrome or firefox? As of this posting, none of the top-tier browsers have a "headless" or an in-memory only browser window. There are alternatives to running JavaScript in a browser, such as rhino.js, env.js or HtmlUnit, however, these are just ports of browsers and the JavaScript and DOM representation are not 100% accurate which can lead to problems with your code when rendered in a client's browser. Approach What we need to do is to run JSTestDriver's browser in a Virtual X Framebuffer (Xvfb) which is possible on nearly all Linux based systems. The example below uses a Solaris version of Linux, however, Debian and RedHat linux distrubutions come with the simplified bash script to easily run an appliation in a virtual framebuffer. This solution was derived from one posted solution on the JS Test Driver wiki. The given example is also a full working example that is in use at my current client. Here is the quick list of what we will accomplish. Note, several of these steps are discussed in depth in the previous post and are not covered in depth here. Create a profile to run Js Unit-Tests Copy JsTestDriver library to a known location for Maven to use Copy JavaScript main and test files to known locations Use ANT to start JsTestDriver and pipe the screen into xvfb Here is a sample profile to use. You will need to adjust the properties at the top of the profile to match your system. ci-jstests /opt/swf/bin/firefox 1.3.2 /opt/X11R6/xvfb-run org.apache.maven.plugins maven-dependency-plugin 2.1 copy generate-resources copy com.google.jstestdriver jstestdriver ${js-test-driver.version} jar true jsTestDriver.jar ${project.build.directory}/jstestdriver false true maven-resources-plugin 2.4.3 copy-main-files generate-test-resources copy-resources ${project.build.directory}/test-classes/main-js src/main/webapp/scripts false copy-test-files generate-test-resources copy-resources ${project.build.directory}/test-classes/test-js src/test/webapp/scripts false org.apache.maven.plugins maven-antrun-plugin 1.6 test run Possible problems Although I cannot predict or fix all problems, I can share the one major problem I ran into with Solaris and the script used to fix that. In Solaris (and could happen to other distros) the xvfb-run script was not available and several of the other libraries did not exist. I first had to download the latest X libraries and place them in their appropriate locations on the CI server. Next, I had to re-engineer the xvfb-run script. Here is a copy of my script (NOTE: This is the solution for my server and this may not work for you) I created a script that contains: /usr/openwin/bin/Xvfb :1 screen 0 1280x1024x8 pixdepths 8 24 fbdir /tmp/.X11-vbf & From http://www.ensor.cc/2011/08/maven-javascript-unit-test-running-in.html
December 23, 2011
by Mike Ensor
· 12,249 Views
  • Previous
  • ...
  • 327
  • 328
  • 329
  • 330
  • 331
  • 332
  • 333
  • 334
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×