Microservices Resources

The Latest Microservices Topics

A Webapp Makeover with Spring 4 and Spring Boot

A typical Maven and Spring web application has a fair amount of XML and verbosity to it. Add in Jersey and Spring Security and you can have hundreds of lines of XML before you even start to write your Java code. As part of a recent project, I was tasked with upgrading a webapp like this to use Spring 4 and Spring Boot. I also figured I'd try to minimize the XML. This is my story on how I upgraded to Spring 4, Jersey 2, Java 8 and Spring Boot 0.5.0 M6. When I started, the app was using Spring 3.2.5, Spring Security 3.1.4 and Jersey 1.18. The pom.xml had four Jersey dependencies, three Spring dependencies and three Spring Security dependencies, along with a number of exclusions for "jersey-spring". Upgrading to Spring 4 Upgrading to Spring 4 was easy, I changed the version property to 4.0.0.RC2 and added the new Spring bill of materials to my pom.xml. I also add the Spring milestone repo since Spring 4 won't be released to Maven central until tomorrow. org.springframework spring-framework-bom ${spring.framework.version} pom import spring-milestones http://repo.spring.io/milestone true Next, I removed all the references to ${spring.framework.version} in dependencies since it'd be controlled by Maven's dependency management feature. org.springframework spring-web - ${spring.framework.version} I also changed to use Maven 3's wildcard syntax to exclude multiple dependencies. com.sun.jersey.contribs jersey-spring org.springframework - spring - - - org.springframework - spring-core - - - org.springframework - spring-web - - - org.springframework - spring-beans - - - org.springframework - spring-context + * I confirmed the upgrade worked by running "mvn dependency:tree | grep spring", followed by "mvn jetty:run" and viewing the app in my browser. Upgrading to Jersey 2 The next item I tackled was upgrading to Jersey 2.4.1. I changed the version number in my pom.xml, then added the Jersey BOM. org.glassfish.jersey jersey-bom ${jersey.version} pom import You might ask "why Jersey?" if we already have Spring MVC and its REST support? You might also ask why not Play or Grails instead of a Java + Spring stack? For this particular project, I recommended technology options, and these were certainly among them. However, the team chose differently and I support their decision. The project is creating an iOS app, as well as a responsive HTML5 mobile/desktop app. We figured we had enough risk with new technologies on the front-end that we should play it a bit safer on the backend. To make the backend work a bit sexier, we've decided to allow Spring 4, Java 8 and possibly some reactive principles. Next, I changed from the old com.sun.jersey dependencies to org.glassfish.jersey and removed jersey-spring. org.glassfish.jersey.containers jersey-container-servlet org.glassfish.jersey.media jersey-media-json-jackson The last thing I needed to do was change the servlet-class and param-name in web.xml: jersey-servlet org.glassfish.jersey.servlet.ServletContainer jersey.config.server.provider.packages com.raibledesigns.boot.service 1 Requiring Java 8 Requiring Java 8 to compile was easy enough. I added the maven-compiler-plugin to enforce a minimum version. maven-compiler-plugin 3.1 1.8 1.8 I downloaded the latest Java 8 SDK and installed it. Then I set my JAVA_HOME to use it. export JAVA_HOME=`/usr/libexec/java_home -v 1.8` Integrating Spring Boot I learned about Spring Boot a few weeks ago at Devoxx. Josh Long gave me a 3-minute demo at the speaker's dinner and showed me enough to peak my interest. To integrate it into my project, I started with the Quick Start. I added the boot-parent, dependencies for web, security and actuator (logging, metrics, etc.) and the Maven plugin. I removed all the Spring and Spring Security dependencies. org.springframework.boot spring-boot-starter-parent 0.5.0.M6 ... spring-milestones http://repo.spring.io/milestone ... org.springframework.boot spring-boot-starter-web org.springframework.boot spring-boot-starter-security org.springframework.boot spring-boot-starter-actuator ... org.springframework.boot spring-boot-maven-plugin Upon restarting my app, I got an error about spring-security.xml using a 3.1 XSD. I fixed it by changing to 3.2. Next, I wanted to eliminate web.xml. First of all, I created an ApplicationInitializer so the WAR could be started from the command line. package com.raibledesigns.boot.config; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.EnableAutoConfiguration; import org.springframework.boot.builder.SpringApplicationBuilder; import org.springframework.boot.web.SpringBootServletInitializer; import org.springframework.context.annotation.ComponentScan; import org.springframework.context.annotation.Configuration; @Configuration @EnableAutoConfiguration @ComponentScan public class ApplicationInitializer extends SpringBootServletInitializer { @Override protected SpringApplicationBuilder configure(SpringApplicationBuilder application) { return application.sources(ApplicationInitializer.class); } public static void main(String[] args) { SpringApplication.run(ApplicationInitializer.class, args); } } However, after adding this, I received the following error on startup: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.springframework.boot.context.properties.ConfigurationPropertiesBindingPostProcessor': Invocation of init method failed; nested exception is java.lang.AbstractMethodError: org.hibernate.validator.internal.engine.ConfigurationImpl .getDefaultParameterNameProvider()Ljavax/validation/ParameterNameProvider; Adding hibernate-validator as a dependency solved this problem: org.hibernate hibernate-validator To configure Spring Security without web.xml and spring-security.xml, I created WebSecurityConfig.java: package com.raibledesigns.boot.config; import org.springframework.context.annotation.Configuration; import org.springframework.core.Ordered; import org.springframework.core.annotation.Order; import org.springframework.security.config.annotation.authentication.builders.AuthenticationManagerBuilder; import org.springframework.security.config.annotation.web.builders.HttpSecurity; import org.springframework.security.config.annotation.web.configuration.EnableWebSecurity; import org.springframework.security.config.annotation.web.configuration.WebSecurityConfigurerAdapter; @Configuration @EnableWebSecurity @Order(Ordered.LOWEST_PRECEDENCE - 6) public class WebSecurityConfig extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) throws Exception { http.authorizeRequests() .antMatchers("/", "/home").permitAll() .antMatchers("/v1.0/**").hasRole("USER") .anyRequest().authenticated(); http.httpBasic().realmName("My API"); } @Override protected void configure(AuthenticationManagerBuilder authManagerBuilder) throws Exception { authManagerBuilder.inMemoryAuthentication() .withUser("test").password("test123").roles("USER"); } } To configure Jersey without web.xml, I created a JerseyConfig class: package com.raibledesigns.boot.config; import org.glassfish.jersey.filter.LoggingFilter; import org.glassfish.jersey.jackson.JacksonFeature; import org.glassfish.jersey.server.ResourceConfig; import org.glassfish.jersey.server.ServerProperties; import javax.ws.rs.ApplicationPath; @ApplicationPath("/v1.0") public class JerseyConfig extends ResourceConfig { public JerseyConfig() { packages("com.raibledesigns.boot.service"); property(ServerProperties.BV_SEND_ERROR_IN_RESPONSE, true); property(ServerProperties.JSON_PROCESSING_FEATURE_DISABLE, false); property(ServerProperties.MOXY_JSON_FEATURE_DISABLE, true); property(ServerProperties.WADL_FEATURE_DISABLE, true); register(LoggingFilter.class); register(JacksonFeature.class); } } Finally, I created MvcConfig.java to set the welcome page. package com.raibledesigns.boot.config; import org.springframework.context.annotation.Configuration; import org.springframework.web.servlet.config.annotation.ViewControllerRegistry; import org.springframework.web.servlet.config.annotation.WebMvcConfigurerAdapter; @Configuration public class MvcConfig extends WebMvcConfigurerAdapter { @Override public void addViewControllers(ViewControllerRegistry registry) { registry.addViewController("/").setViewName("index"); } } To cleanup, I deleted src/main/webapp/WEB-INF and created src/main/resources/logback.xml: Since Boot doesn't support JSP out-of-the-box, I renamed my index.jsp file to index.html and changed the URL in it to point to "/v1.0/hello". I was pleased to see that everything worked nicely. I learned shortly after that I could remove the Spring BOM since Spring Boot uses a property to control its Spring version. The only issue I found is when started the app with "mvn package && java -jar target/app.war", it failed to initialize Jersey. I tried adding a @Bean for the servlet: @Bean public ServletRegistrationBean jerseyServlet() { ServletRegistrationBean registration = new ServletRegistrationBean(new ServletContainer(), "/v1.0/*"); registration.addInitParameter(ServletProperties.JAXRS_APPLICATION_CLASS, JerseyConfig.class.getName()); return registration; } Unfortunately, when running it using "java -jar", I get the following error: org.glassfish.hk2.api.MultiException: A MultiException has 1 exceptions. They are: 1. org.glassfish.jersey.server.internal.scanning.ResourceFinderException: java.io.FileNotFoundException: /.../target/app.war!/WEB-INF/classes (No such file or directory) at org.jvnet.hk2.internal.Utilities.justCreate(Utilities.java:869) at org.jvnet.hk2.internal.ServiceLocatorImpl.create(ServiceLocatorImpl.java:814) at org.jvnet.hk2.internal.ServiceLocatorImpl.createAndInitialize(ServiceLocatorImpl.java:906) at org.jvnet.hk2.internal.ServiceLocatorImpl.createAndInitialize(ServiceLocatorImpl.java:898) at org.glassfish.jersey.server.ApplicationHandler.createApplication(ApplicationHandler.java:300) at org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:279) at org.glassfish.jersey.servlet.WebComponent.(WebComponent.java:302) This seems strange since there is a WEB-INF/classes in my WAR. Regardless, this is not a Boot problem per se, but more of a Jersey issue. From one of the Boot developers: The whole idea with Boot is that servlets are just a transport - they are a means to an end, and hopefully not the only one - the "container" is Spring, not the servlet container. We probably could add some form of support for SCI but only by hacking the containers since the spec really doesn't allow for much control of their lifecycle. It hasn't been a priority so far. Summary I hope this article is useful to see how you to upgrade your Java webapps to use Spring 4 and Spring Boot. I've created a boot-makeover project on GitHub with all the code mentioned. You can also view the commits for each step.

December 13, 2013

by Matt Raible

· 26,715 Views

Populate Your Maven Repo With Mule ESB Libraries

when you build applications based on mule ee (enterprise edition) and you are using maven to build your projects, you will notice you have dependencies to libraries that are not available in the public maven repos. to add these libraries to your local maven repo the mule distribution comes with a script ‘populate_m2_repo’ which is described here how to use it. now that is okay if you are the only developer and you are running your continuous integration on your local machine. in my case we are using artifactory as our company maven repository and also our build server is using it as the maven repo. so what i wanted was not to populate my local repository but the artifactory instance with all mule libraries. to do so i did two things: first make sure that maven is authorised to add libraries to artifactory. you can do this by adding the following to your settings.xml: artifactory admin password second step is to modify the original ‘populate_m2_repo.groovy’ script. replace the following line: mvn(["install:install-file", "-dgroupid=${project.groupid}", "-dartifactid=${project.artifactid}", "-dversion=${version}", "-dpackaging=pom", "-dfile=${localpom.canonicalpath}"]) with mvn(["deploy:deploy-file", "-dgroupid=${project.groupid}", "-dartifactid=${project.artifactid}", "-dversion=${version}", "-dpackaging=pom", "-dfile=${localpom.canonicalpath}", "-drepositoryid=arti", "-durl=http://localhost:8080/artifactory/libs-release-local" ]) and do the same for the line: def args = ["install:install-file", "-dgroupid=${pomprops.groupid}", "-dartifactid=${pomprops.artifactid}", "-dversion=${pomprops.version}", "-dpackaging=jar", "-dfile=${f.canonicalpath}", "-dpomfile=${localpom.canonicalpath}"] by replacing it with: def args = ["deploy:deploy-file", "-dgroupid=${pomprops.groupid}", "-dartifactid=${pomprops.artifactid}", "-dversion=${pomprops.version}", "-dpackaging=jar", "-dfile=${f.canonicalpath}", "-dpomfile=${localpom.canonicalpath}", "-drepositoryid=arti", "-durl=http://localhost:8080/artifactory/libs-release-local" ] now you can run the script with: ./populate_m2_repo bla as you can see it doesn’t really matter what you supply as m2_repo_home here because the libraries are uploaded to artifactory anyway. if you want you can replace the hardcoded url for artifactory in the script with the supplied parameter but in my case this solution was sufficient

December 4, 2013

by $$anonymous$$

· 10,836 Views

Customization in Saas using Plug and Play (PnP) architecture

there are a lot of design patterns, architectures and design concepts that can be applied to technical aspects of implementing a product. for exapmle, we have mvc architecture that isolates the view, controller and model of the application. we have a factory pattern that defines how to create objects and so on. yet, these all limit themselves to the technical aspects of the product, while there exists no definite pre-defined architecture or patterns to design and build "business functions". the plug and play architecture i want to describe in this article is a "functional architecture", that defines a pattern to design the business functionality to get the most out of it. modular architecture is a design technique where functionality of a program is separated into independent, interchangeable modules, such that each module contains everything necessary to execute one aspect of the desired functionality. typically, in modular architecture, separation is done based on technical aspects. for eg., a module is created for database interaction, another is created for logging and so on. the advantage of a modular architecture is that you can easily replace or add components without affecting the rest of the application. the modules clearly define the interfaces that are used to interact with the module and these are tied at compile time to other modules. what if we could extend this architecture to "business functionality" rather than just limiting it to technical aspects of a product. we already design "business functionality" in a modular manner. but, we do not consciously look at it as modules with clear-cut interfaces for inputs and outputs. defining the plug and play architecture the plug and play architecture extends the techniques of modular architecture to business functions. for eg., a module can be created to encompass all functions related to order, while another can be created to encompass all functions related to quote. i will call these independent modules implementing business functions as flows. an application, in this architecture, is defined as a "collection of flows working together". the same advantage of modular architecture applies here, i.e., you can replace, add or remove flows without affecting the rest of the application. an added advantage to extending modules to business functions, is that the flows can be combined in different paths to get different business functionality. further, extending this definition to be able to add, remove or replace modules at "runtime" rather than tying them together at compile time gives us the flexibility to customize the application at runtime without "in-built flags and if clauses". the beauty of this extension of the architecture is that "it gives us the flexibility to create a set of loosely coupled flows that can be bound together at runtime as opposed to coding one tightly integrated application that is rigid and not malleable". to further see how this architecture works, let us consider a set of loosely coupled flows coded as per the diagram: there are 5 flows coded and deployed. each flow exposes an output and accepts a set of inputs. for eg., the "product listing" flow encapsulates the following business functions create products manually by admin import products by admin provide services to view, search and list products expose an output, "product" data that can be linked to other flows another eg., is the cart flow. this encapsulates the following business functions create a new cart add items to the cart check out the cart expose an output, "cart" data that can be linked to other flows accept an input, "cart item" that can be linked to the output from other flows we can connect the output "product" data from "product listing" flow to the "cart item" input of the cart flow. this is not tightly coupled at compile time and is left as an open gate or an input for the cart flow. at runtime the gate is closed or left open as required by the customer. from the diagram we see that we have 5 gates that can be closed. leading us to create different applications at runtime by closing the correct gates. for eg., we can just enable "product listing" for a customer which provides just the basic features of product listing. gates 4 and 6 can be closed to create an application with the features create, list products add products to a cart send an enquiry for the products added to the cart in another variation instead of gates 4 and 6, gate 7 can be closed. the application now has the features create, list products send an enquiry for single products by plugging the "inputs" of flows with different "outputs" from other flows, varied applications can be created. this is called a plug and play architecture. smart is implemented in this architecture to provide a highly flexible container to build saas products. need for plug and play architecture when a product is exposed as a multi-tenanted saas product, it cannot be created with "one size fits all" tenet. to break this tenet we need a flexible, customizable runtime environment where features can be varied based on a customer requirement without affecting other customers serviced by the same application. the plug and play architecture provides this flexibility to the saas.

November 3, 2013

by Raji Sankar

· 6,488 Views

Securing Docker’s Remote API

One piece to Docker that is interesting AMAZING is the Remote API that can be used to programatically interact with docker. I recently had a situation where I wanted to run many containers on a host with a single container managing the other containers through the API. But the problem I soon discovered is that at the moment when you turn networking on it is an all or nothing type of thing… you can’t turn networking off selectively on a container by container basis. You can disable IPv4 forwarding, but you can still reach the docker remote API on the machine if you can guess the IP address of it. One solution I came up with for this is to use nginx to expose the unix socket for docker over HTTPS and utilize client-side ssl certificates to only allow trusted containers to have access. I liked this setup a lot so I thought I would share how it’s done. Disclaimer: assumes some knowledge of docker! Generate The SSL Certificates We’ll use openssl to generate and self-sign the certs. Since this is for an internal service we’ll just sign it ourselves. We also remove the password from the keys so that we aren’t prompted for it each time we start nginx. # Create the CA Key and Certificate for signing Client Certs openssl genrsa -des3 -out ca.key 4096 openssl rsa -in ca.key -out ca.key # remove password! openssl req -new -x509 -days 365 -key ca.key -out ca.crt # Create the Server Key, CSR, and Certificate openssl genrsa -des3 -out server.key 1024 openssl rsa -in server.key -out server.key # remove password! openssl req -new -key server.key -out server.csr # We're self signing our own server cert here. This is a no-no in production. openssl x509 -req -days 365 -in server.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out server.crt # Create the Client Key and CSR openssl genrsa -des3 -out client.key 1024 openssl rsa -in client.key -out client.key # no password! openssl req -new -key client.key -out client.csr # Sign the client certificate with our CA cert. Unlike signing our own server cert, this is what we want to do. openssl x509 -req -days 365 -in client.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out client.crt Another option may be to leave the passphrase in and provide it as an environment variable when running a docker container or through some other means as an extra layer of security. We’ll move ca.crt, server.key and server.crt to /etc/nginx/certs. Setup Nginx The nginx setup for this is pretty straightforward. We just listen for traffic on localhost on port 4242. We require client-side ssl certificate validation and reference the certificates we generated in the previous step. And most important of all, set up an upstream proxy to the docker unix socket. I simply overwrote what was already in /etc/nginx/sites-enabled/default. upstream docker { server unix:/var/run/docker.sock fail_timeout=0; } server { listen 4242; server localhost; ssl on; ssl_certificate /etc/nginx/certs/server.crt; ssl_certificate_key /etc/nginx/certs/server.key; ssl_client_certificate /etc/nginx/certs/ca.crt; ssl_verify_client on; access_log on; error_log /dev/null; location / { proxy_pass http://docker; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; client_max_body_size 10m; client_body_buffer_size 128k; proxy_connect_timeout 90; proxy_send_timeout 120; proxy_read_timeout 120; proxy_buffer_size 4k; proxy_buffers 4 32k; proxy_busy_buffers_size 64k; proxy_temp_file_write_size 64k; } } One important piece to make this work is you should add the user nginx runs as to the docker group so that it can read from the socket. This could be www-data, nginx, or something else! Hack It Up! With this setup and nginx restarted, let’s first run a curl command to make sure that this setup correctly. First we’ll make a call without the client cert to double check that we get denied access then a proper one. # Is normal http traffic denied? curl -v http://localhost:4242/info # How about https, sans client cert and key? curl -v -s -k https://localhost:4242/info # And the final good request! curl -v -s -k --key client.key --cert client.crt https://localhost:4242/info For the first two we should get some run of the mill 400 http response codes before we get a proper JSON response from the final command! Woot! But wait there’s more… let’s build a container that can call the service to launch other containers! For this example we’ll simply build two containers: one that has the client certificate and key and one that doesn’t. The code for these examples are pretty straightforward and to save space I’ll leave the untrusted container out. You can view the untrusted container on github (although it is nothing exciting). First, the node.js application that will connect and display information: https = require 'https' fs = require 'fs' options = host: 172.42.1.62 port: 4242 method: 'GET' path: '/containers/json' key: fs.readFileSync('ssl/client.key') cert: fs.readFileSync('ssl/client.crt') headers: { 'Accept': 'application/json'} # not required, but being semantic here! req = https.request options, (res) -> console.log res req.end() And the Dockerfile used to build the container. Notice we add the client.crt and client.key as part of building it! FROM shykes/nodejs MAINTAINER James R. Carr ADD ssl/client* /srv/app/ssl ADD package.json /srv/app/package.json ADD app.coffee /srv/app/app.coffee RUN cd /srv/app && npm install . CMD cd /srv/app && npm start That’s about it. Run docker build . and docker run -n >IMAGE ID< and we should see a json dump to the console of the actively running containers. Doing the same in the untrusted directory should present us with some 400 error about not providing a client ssl certificate. I’ve shared a project with all this code plus a vagrant file on github for your own prusual. Enjoy!

October 31, 2013

by James Carr

· 14,313 Views

Sparse and Memory-mapped Files

One of the problems with memory-mapped files is that you can’t actually map beyond the end of the file. So you can’t use that to extend your file. I had a thought about and set out to check out what happens when I create a sparse file, a file that only take space when you write to it, and at the same time, map it. As it turns out, this actually works pretty well in practice. You can do so without any issues. Here is how it works: using (var f = File.Create(path)) { int bytesReturned = 0; var nativeOverlapped = new NativeOverlapped(); if (!NativeMethod.DeviceIoControl(f.SafeFileHandle, EIoControlCode.FsctlSetSparse, IntPtr.Zero, 0, IntPtr.Zero, 0, ref bytesReturned, ref nativeOverlapped)) { throw new Win32Exception(); } f.SetLength(1024*1024*1024*64L); } This creates a sparse file that is 64 GB in size. Then we can map it normally: using (var mmf = MemoryMappedFile.CreateFromFile(path)) using (var memoryMappedViewAccessor = mmf.CreateViewAccessor(0, 1024*1024*1024*64L)) { for (long i = 0; i < memoryMappedViewAccessor.Capacity; i += buffer.Length) { memoryMappedViewAccessor.WriteArray(i, buffer, 0, buffer.Length); } } And then we can do stuff to it. And that includes writing to yet-unallocated parts of the file. This also means that you don’t have to worry about writing past the end of the file, the OS will take care of all of that for you. Happy happy, joy joy, etc. There is one problem with this method, however. It means that you have a 64 GB file, but you don’t have that much allocated. What that means in turn is that you might not have that much space available for the file. Which brings up an interesting question, what happens when you are trying to commit a new page, and the disk is out of space? Using file I/O you would get an I/O error with the right code. But when using memory mapped files, the error would actually turn up during access, which can happen pretty much anywhere. It also means that it is a Standard Exception Handling error in Windows, which requires special treatment. To test this out, I wrote the following so it would write to a disk that had only about 50 GB free. I wanted to know what would happen when it ran out of space. That is actually something that happens, and we need to be able to address this issue robustly. The kicker is that this might actually happen at any time, so that would really result is some… interesting behavior with regards to robustness. In other words, I don’t think that this is a viable option, it is a really cool trick, but I don’t think it is a very well thought out option. By the way, the result of my experiment was that we had an effectively a frozen process. No errors, nothing, just a hung. Also, I am pretty sure that WriteArray() is really slow, but I’ll check this out at another pointer in time.

October 1, 2013

by Oren Eini

· 8,160 Views

ElasticSearch: Java API

ElasticSearch provides Java API, thus it executes all operations asynchronously by using client object.

September 30, 2013

by Hüseyin Akdoğan

CORE

· 137,598 Views · 4 Likes

API Gateway and API Portal - The pillars of API Management and the evolution of SOA

API Management solutions must combine an API Portal (for signing up developers) with an API Gateway (to link back to the enterprise). But where do these come from, and what is the relationship with SOA? To answer these questions, first let's look at a bit of history: In the 2000's, we had the SOA Gateway and the SOA Registry, working hand-in-hand. This was "SOA Governance". The SOA Registry (with a Repository) was intended to be the "central store of truth" for information about Web Services. It was often the public face of SOA Governance, the part which people could see. Usually the services in the registry took the form of heavyweight SOAP services, defined by WSDLs. The problem was that developers were often forced to register their SOAP services in the registry, rather than feeling that it was something beneficial to them. Browsing the registry was also a chore, involving the use of UDDI, also a heavyweight protocol (in fact, it was built on SOAP). Fast-forward to the current decade, and we find that the SOA Registry has been replaced by the API Portal. An API portal is also the "central store of truth", but now it includes REST APIs definitions (usually expressed using a Swagger-type format) as well as SOAP services. The API Portal is designed to be useful and helpful to developers who wish to build apps, rather than feeling like a chore to use. The lesson of SOA was that an attitude of "If we build it, they will come" (or "If we put it in the SOA Registry, people will use it") does not work. You have to make it into a pleasant experience for developers. API portals work for the very reason that SOA registries did not work: usability. Just like the SOA Gateway worked with the SOA Registry, so the API Gateway works hand-in-hand with the API Portal. Together, the combination of the API Portal with the API Gateway constitutes "API Management". The API Portal is for developers to sign up to use APIs, receive API Keys and quotas, and the API Gateway operates at runtime, managing the API Key usage and enforcing the API usage quotas. The API Gateway also performs the very important task of bridging from the technologies used by API clients (REST, OAuth) to the technologies used in the enterprise (Kerberos, SAML, or proprietary identity tokens such as CA SiteMinder smsession tokens). For more on this bridging, check out my webinar with Jason Cardinal from Identica tomorrow on "Bridging APIs to Enterprise Infrastructure". Gartner defines the combination of SOA Governance and API Management as "Application Services Governance". I'm proud to say that Axway (which acquired Vordel in 2012) is recognized by Gartner as a Leader in the category of Application Services Governance. We've seen an evolution of technologies (SOAP to REST) and approach (the UDDI registry to the web-based API Portal) in the journey from SOA Governance to API Management. From 30,000 feet, SOA Governance and API Management might look similar, but the new approach of API Management has already outshone SOA. The API Gateway and API Portal are key to this.

September 3, 2013

by Mitch Pronschinske

· 7,874 Views

OpenStack Savanna: Fast Hadoop Cluster Provisioning on OpenStack

introduction openstack is one of the most popular open source cloud computing projects to provide infrastructure as a service solution. its key components are compute (nova), networking (neutron, formerly known as quantum), storage (object and block storage, swift and cinder, respectively), openstack dashboard (horizon), identity service (keystone) and image service (glance). there are other official incubated projects like metering (celiometer) and orchestration and service definition (heat). savanna is a hadoop as a service for openstack introduced by mirantis . it is still in an early phase (version .02 was released in summer 2013) and according to its roadmap version 1.0 is targeted for official openstack incubation. in principle, heat also could be used for hadoop cluster provisioning but savanna is especially tuned for providing hadoop-specific api functionality while heat is meant to be used for generic purposes. savanna architecture savanna is integrated with the core openstack components such as keystone, nova, glance, swift and horizon. it has a rest api that supports the hadoop cluster provisioning steps. savanna api is implemented as a wsgi server that, by default, listens to port 8386. in addition, savanna can also be integrated with horizon, the openstack dashboard to create a hadoop cluster from the management console. savanna also comes with a vanilla plugin that deploys a hadoop cluster image. the standard out-of-the-box vanilla plugin supports hadoop 1.1.2 version. installing savanna the simplest option to try out savanna is to use devstack in a virtual machine. i was using an ubuntu 12.04 virtual instance in my tests. in that environment we need to execute the following commands to install devstack and savanna api: $ sudo apt-get install git-core $ git clone https://github.com/openstack-dev/devstack.git $ vi localrc # edit localrc admin_password=nova mysql_password=nova rabbit_password=nova service_password=$admin_password service_token=nova # enable swift enabled_services+=,swift swift_hash=66a3d6b56c1f479c8b4e70ab5c2000f5 swift_replicas=1 swift_data_dir=$dest/data # force checkout prerequsites # force_prereq=1 # keystone is now configured by default to use pki as the token format which produces huge tokens. # set uuid as keystone token format which is much shorter and easier to work with. keystone_token_format=uuid # change the floating_range to whatever ips vm is working in. # in nat mode it is subnet vmware fusion provides, in bridged mode it is your local network. floating_range=192.168.55.224/27 # enable auto assignment of floating ips. by default savanna expects this setting to be enabled extra_opts=(auto_assign_floating_ip=true) # enable logging screen_logdir=$dest/logs/screen $ ./stack.sh # this will take a while to execute $ sudo apt-get install python-setuptools python-virtualenv python-dev $ virtualenv savanna-venv $ savanna-venv/bin/pip install savanna $ mkdir savanna-venv/etc $ cp savanna-venv/share/savanna/savanna.conf.sample savanna-venv/etc/savanna.conf # to start savanna api: $ savanna-venv/bin/python savanna-venv/bin/savanna-api --config-file savanna-venv/etc/savanna.conf to install savanna ui integrated with horizon, we need to run the following commands: $ sudo pip install savanna-dashboard $ cd /opt/stack/horizon/openstack-dashboard $ vi settings.py horizon_config = { 'dashboards': ('nova', 'syspanel', 'settings', 'savanna'), installed_apps = ( 'savannadashboard', .... $ cd /opt/stack/horizon/openstack-dashboard/local $ vi local_settings.py savanna_url = 'http://localhost:8386/v1.0' $ sudo service apache2 restart provisioning a hadoop cluster as a first step, we need to configure keystone-related environment variables to get the authentication token: ubuntu@ip-10-59-33-68:~$ vi .bashrc $ export os_auth_url=http://127.0.0.1:5000/v2.0/ $ export os_tenant_name=admin $ export os_username=admin $ export os_password=nova ubuntu@ip-10-59-33-68:~$ source .bashrc ubuntu@ip-10-59-33-68:~$ ubuntu@ip-10-59-33-68:~$ env | grep os os_password=nova os_auth_url=http://127.0.0.1:5000/v2.0/ os_username=admin os_tenant_name=admin ubuntu@ip-10-59-33-68:~$ keystone token-get +-----------+----------------------------------+ | property | value | +-----------+----------------------------------+ | expires | 2013-08-09t20:31:12z | | id | bdb582c836e3474f979c5aa8f844c000 | | tenant_id | 2f46e214984f4990b9c39d9c6222f572 | | user_id | 077311b0a8304c8e86dc0dc168a67091 | +-----------+----------------------------------+ $ export auth_token="bdb582c836e3474f979c5aa8f844c000" $ export tenant_id="2f46e214984f4990b9c39d9c6222f572" then we need to create the glance image that we want to use for our hadoop cluster. in our example we have used mirantis's vanilla image but we can also build our own image: $ wget http://savanna-files.mirantis.com/savanna-0.2-vanilla-1.1.2-ubuntu-12.10.qcow2 $ glance image-create --name=savanna-0.2-vanilla-hadoop-ubuntu.qcow2 --disk-format=qcow2 --container-format=bare < ./savanna-0.2-vanilla-1.1.2-ubuntu-12.10.qcow2 ubuntu@ip-10-59-33-68:~/devstack$ glance image-list +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ | id | name | disk format | container format | size | status | +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ | d0d64f5c-9c15-4e7b-ad4c-13859eafa7b8 | cirros-0.3.1-x86_64-uec | ami | ami | 25165824 | active | | fee679ee-e0c0-447e-8ebd-028050b54af9 | cirros-0.3.1-x86_64-uec-kernel | aki | aki | 4955792 | active | | 1e52089b-930a-4dfc-b707-89b568d92e7e | cirros-0.3.1-x86_64-uec-ramdisk | ari | ari | 3714968 | active | | d28051e2-9ddd-45f0-9edc-8923db46fdf9 | savanna-0.2-vanilla-hadoop-ubuntu.qcow2 | qcow2 | bare | 551699456 | active | +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ $ export image_id=d28051e2-9ddd-45f0-9edc-8923db46fdf9 then we have installed httpie , an open source http client that can be used to send rest requests to savanna api: $ sudo pip install httpie from now on we will use httpie to send savanna commands. we need to register the image with savanna: $ export savanna_url="http://localhost:8386/v1.0/$tenant_id" $ http post $savanna_url/images/$image_id x-auth-token:$auth_token username=ubuntu http/1.1 202 accepted content-length: 411 content-type: application/json date: thu, 08 aug 2013 21:28:07 gmt { "image": { "os-ext-img-size:size": 551699456, "created": "2013-08-08t21:05:55z", "description": "none", "id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "metadata": { "_savanna_description": "none", "_savanna_username": "ubuntu" }, "mindisk": 0, "minram": 0, "name": "savanna-0.2-vanilla-hadoop-ubuntu.qcow2", "progress": 100, "status": "active", "tags": [], "updated": "2013-08-08t21:28:07z", "username": "ubuntu" } } $ http $savanna_url/images/$image_id/tag x-auth-token:$auth_token tags:='["vanilla", "1.1.2", "ubuntu"]' http/1.1 202 accepted content-length: 532 content-type: application/json date: thu, 08 aug 2013 21:29:25 gmt { "image": { "os-ext-img-size:size": 551699456, "created": "2013-08-08t21:05:55z", "description": "none", "id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "metadata": { "_savanna_description": "none", "_savanna_tag_1.1.2": "true", "_savanna_tag_ubuntu": "true", "_savanna_tag_vanilla": "true", "_savanna_username": "ubuntu" }, "mindisk": 0, "minram": 0, "name": "savanna-0.2-vanilla-hadoop-ubuntu.qcow2", "progress": 100, "status": "active", "tags": [ "vanilla", "ubuntu", "1.1.2" ], "updated": "2013-08-08t21:29:25z", "username": "ubuntu" } } then we need to create a nodegroup templates (json files) that will be sent to savanna. there is one template for the master nodes ( namenode , jobtracker ) and another template for the worker nodes such as datanode and tasktracker . the hadoop version is 1.1.2. $ vi ng_master_template_create.json { "name": "test-master-tmpl", "flavor_id": "2", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_processes": ["jobtracker", "namenode"] } $ vi ng_worker_template_create.json { "name": "test-worker-tmpl", "flavor_id": "2", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_processes": ["tasktracker", "datanode"] } $ http $savanna_url/node-group-templates x-auth-token:$auth_token < ng_master_template_create.json http/1.1 202 accepted content-length: 387 content-type: application/json date: thu, 08 aug 2013 21:58:00 gmt { "node_group_template": { "created": "2013-08-08t21:58:00", "flavor_id": "2", "hadoop_version": "1.1.2", "id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "name": "test-master-tmpl", "node_configs": {}, "node_processes": [ "jobtracker", "namenode" ], "plugin_name": "vanilla", "updated": "2013-08-08t21:58:00", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } } $ http $savanna_url/node-group-templates x-auth-token:$auth_token < ng_worker_template_create.json http/1.1 202 accepted content-length: 388 content-type: application/json date: thu, 08 aug 2013 21:59:41 gmt { "node_group_template": { "created": "2013-08-08t21:59:41", "flavor_id": "2", "hadoop_version": "1.1.2", "id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "name": "test-worker-tmpl", "node_configs": {}, "node_processes": [ "tasktracker", "datanode" ], "plugin_name": "vanilla", "updated": "2013-08-08t21:59:41", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } } the next step is to define the cluster template: $ vi cluster_template_create.json { "name": "demo-cluster-template", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_groups": [ { "name": "master", "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "count": 1 }, { "name": "workers", "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "count": 2 } ] } $ http $savanna_url/cluster-templates x-auth-token:$auth_token < cluster_template_create.json http/1.1 202 accepted content-length: 815 content-type: application/json date: fri, 09 aug 2013 07:04:24 gmt { "cluster_template": { "anti_affinity": [], "cluster_configs": {}, "created": "2013-08-09t07:04:24", "hadoop_version": "1.1.2", "id": "{ "name": "cluster-1", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "cluster_template_id" : "64c4117b-acee-4da7-937b-cb964f0471a9", "user_keypair_id": "stack", "default_image_id": "3f9fc974-b484-4756-82a4-bff9e116919b" }", "name": "demo-cluster-template", "node_groups": [ { "count": 1, "flavor_id": "2", "name": "master", "node_configs": {}, "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "node_processes": [ "jobtracker", "namenode" ], "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 }, { "count": 2, "flavor_id": "2", "name": "workers", "node_configs": {}, "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "node_processes": [ "tasktracker", "datanode" ], "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } ], "plugin_name": "vanilla", "updated": "2013-08-09t07:04:24" } } now we are ready to create the hadoop cluster: $ vi cluster_create.json { "name": "cluster-1", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "cluster_template_id" : "64c4117b-acee-4da7-937b-cb964f0471a9", "user_keypair_id": "savanna", "default_image_id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9" } $ http $savanna_url/clusters x-auth-token:$auth_token < cluster_create.json http/1.1 202 accepted content-length: 1153 content-type: application/json date: fri, 09 aug 2013 07:28:14 gmt { "cluster": { "anti_affinity": [], "cluster_configs": {}, "cluster_template_id": "64c4117b-acee-4da7-937b-cb964f0471a9", "created": "2013-08-09t07:28:14", "default_image_id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "hadoop_version": "1.1.2", "id": "d919f1db-522f-45ab-aadd-c078ba3bb4e3", "info": {}, "name": "cluster-1", "node_groups": [ { "count": 1, "created": "2013-08-09t07:28:14", "flavor_id": "2", "instances": [], "name": "master", "node_configs": {}, "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "node_processes": [ "jobtracker", "namenode" ], "updated": "2013-08-09t07:28:14", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 }, { "count": 2, "created": "2013-08-09t07:28:14", "flavor_id": "2", "instances": [], "name": "workers", "node_configs": {}, "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "node_processes": [ "tasktracker", "datanode" ], "updated": "2013-08-09t07:28:14", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } ], "plugin_name": "vanilla", "status": "validating", "updated": "2013-08-09t07:28:14", "user_keypair_id": "savanna" } } after a while we can run the nova command to check if the instances are created and running: $ nova list +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ | id | name | status | task state | power state | networks | +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ | 1a9f43bf-cddb-4556-877b-cc993730da88 | cluster-1-master-001 | active | none | running | private=10.0.0.2, 192.168.55.227 | | bb55f881-1f96-4669-a94a-58cbf4d88f39 | cluster-1-workers-001 | active | none | running | private=10.0.0.3, 192.168.55.226 | | 012a24e2-fa33-49f3-b051-9ee2690864df | cluster-1-workers-002 | active | none | running | private=10.0.0.4, 192.168.55.225 | +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ now we can log in to the hadoop master instance and run the required hadoop commands: $ ssh -i savanna.pem [email protected] $ sudo chmod 777 /usr/share/hadoop $ sudo su hadoop $ cd /usr/share/hadoop $ hadoop jar hadoop-example-1.1.2.jar pi 10 100 savanna ui via horizon in order to create nodegroup templates, cluster templates and the cluster itself we used a command line tool -- httpie -- to send rest api calls. the same functionality is also available via horizon, the standard openstack dashboard. first we need to register the image with savanna: then we need to create the nodegroup templates: after that we have to create the cluster template: and finally we have to create the cluster:

August 20, 2013

by Istvan Szegedi

· 9,497 Views

Apache Camel / Talend ESB: Your Management and Monitoring Options

A question every customer asks me is: How can you manage and monitor integration routes implemented with Apache Camel (http://camel.apache.org/) and / or Talend ESB (http://en.talend.com/products/esb) (which is based on Apache Camel and also available as open source version)? This blog post will show different alternatives to answer this question. The good news first: As Apache Camel and Talend ESB are based on open standards, you can use your own frameworks and tools if tooling of the product is not sufficient. So, I will not talk just about features of Apache Camel or Talend ESB, but also about additional options. Except for Talend Administration Center (user interface on top of Talend ESB’s service monitoring and clustering features), all tools are open source. Talend Administration Center is only available in Talend’s enterprise editions. Management of Apache Camel and Talend ESB Let’s begin with management of Apache Camel and Talend ESB routes. I define management as “controlling server instances, services and integration routes”. Besides Talend Administration Center, a Talend-specific web application, several other options exist. I will show jconsole and hawtio. However, as Apache Camel and Talend ESB are standards-based, you can use any other management tool, which supports the Java Management Extensions (JMX) standard, e.g. Hyperic, Nagios or IBM Tivoli. Talend Administration Center Talend Administration Center (http://www.talendforge.org/tutorials/tutorial.php?idTuto=54) is a web application to configure and monitor different server instances. Besides, you can install, start and stop Camel routes and SOAP / REST web services. Provisioning (i.e. deploying and managing routes and services on different server instances) can be automated easily because Talend ESB container supports Apache Karaf Cellar, an OSGi addon which is built exactly for this reason. Therefore, you can create so-called “virtual servers” within Talend Administration Center. Service locator is another feature, which is based on Apache Zookeeper to implement load balancing and high availability under the hood. Administrators can monitor the state of server instances, Camel routes and REST / SOAP web services within Talend Administration Center. Since Talend ESB 5.3, Talend Administration Center contains other important enterprise features, especially management of Talend’s new service registry (based on Apache Jackrabbit) for service discovery and policy enforcement, and its new identity management (based on Apache Syncope) for authorization. Let’s now talk about two other alternatives for managing Camel and Talend ESB via JMX standard interface. jconsole jconsole is a graphical monitoring tool to monitor Java Virtual Machine (JVM) and java applications both on a local or remote machine. jconsole uses underlying features of JVM to provide information on performance and resource consumption of applications running on the Java platform using JMX technology. jconsole comes as part of Java Development Kit (JDK) and the graphical console can be started using “jconsole” command. Apache Camel and Talend ESB offer several JMX interfaces to monitor and manage integration routes, endpoints, error handlers, etc. jconsole has a very simple user interface, but is sufficient for some projects. hawtio hawtio (http://hawt.io) is a highly modular single-page JavaScript application that fetches JMX metric data using Jolokia, a JMX-HTTP bridge. hawtio is a relatively new tool “on the market”. Nevertheless, it already contains plugins for several different technologies and frameworks such as Apache Camel (integration framework), Apache Karaf (OSGi container), Apache ActiveMQ (JMS messaging) and Apache Tomcat (Java EE web container). As Talend ESB uses respectively supports all of these technologies and frameworks, hawtio is a perfect match, too! hawtio is not “just another jconsole”, but offers some very cool additional features, for example a graphical presentation of Camel routes including real-time monitoring. hawtio was created by James Strachan (whom else?!) who is the creator of Apache Camel and some other great frameworks. Therefore, at the moment, most contributions come from JBoss (where James works after JBoss’ acquisition of FuseSource last year). In the future, I hope (and expect) to see other companies supporting this great open source project, too. Monitoring of Apache Camel and Talend ESB Let’s continue with some monitoring alternatives for Apache Camel and Talend ESB. I define monitoring as “searching and (visually) analyzing state of integration routes and services at runtime”. I will show you Talend’s Service Activity Monitoring for SOAP and REST web services. Besides, you will learn about logstash and Kibana, two awesome open source tools (based on ElasticSearch) for analysis and visualization of any (distributed) log files. Service Activity Monitoring (Part of Talend Administration Center) Service Activity Monitoring (SAM, http://www.liquid-reality.de/display/liquid/Talend+ESB+Service+Activity+Monitoring) is part of Talend Administration Center and allows for logging and monitoring SOAP and REST web service calls. For example, SAM could be used for collecting usage statistics and fault monitoring. SAM is a part of Talend’s OSGi container and can be installed via a single command. It can be activated for SOAP and REST web service providers and consumers within Talend Studio via a single click. So, without any additional efforts, you can monitor your SOAP and REST web services. You can sort by several different parameters and step into details within Talend Administration Center, i.e. you can look at every incoming request message and outgoing response message. logstash logstash (http://logstash.net) is a tool for managing events and logs. You can use it to collect different (distributed) logs, parse them, and store them for later use. Speaking of searching, logstash comes with a simple, but fine web interface for searching and drilling into all of your logs. It uses ElasticSearch under the hood. So, you can easily query through all your logs for specific errors or business analytics (e.g. searching for all lines matching an unique order id). I never used logstash before. I really love it now! However, documentation is not perfect for getting started. So, two hints for newbies: 1) logstash uses ElasticSearch under the hood. However, you do not have to install it separately. It is included and can be used as embedded server. 2) I thought logstash can process every log file by default. However, your log files have to contain a specific structure or you have to configure logstash to know your custom log structure. Especially, if logstash does not know the correct structure of your timestamp, most features will not work. This was not obvious to me. It took me some time to find out that nothing is happening due to wrong log structures (no errors or exceptions came up, just nothing happened at all). Apache Camel and Talend ESB do not offer support for logstash implicitly. One solution is to use Enterprise Integration Patterns (http://www.eaipatterns.com) such as Wire Tap or Message Store to implement logging. Afterwards, logstash can access and analyse these logs. logstash is really awesome, but its real power can be reached by combining it with Kibana. So let’s go to the next section. Kibana Kibana (http://three.kibana.org ) is a browser based analytics and search interface to logstash and other timestamped data sets stored in ElasticSearch. Kibana strives to be easy to get started with, while also being flexible and powerful, just like logstash and ElasticSearch. Main difference to logstash is a much more powerful HTML5 based web interface. You can use multiple concurrent search inputs highlight to drill down bar charts create line charts, stacked, unstacked, filled or unfilled, with or without points create Pie and donut charts that compare top terms or the results of multiple queries create custom dashboards with multiple charts and much more… Again, some hints for newbies (as documentation for installation and usage is very, very short without many important details, unfortunately): 1) Kibana uses logstash and ElasticSearch under the hood. Both are not included, but must be installed separately. As logstash contains an embedded ElasticSearch server, you can get started quickly by just installing logstash in advance. 2) Kibana installation is tricky! If you google, go to Kibana website and navigate to its installation guide (http://kibana.org/intro.html), you will have to install it via Ruby. As always on my Mac, I had some trouble installing Ruby software (version and compiler problems, etc.). But finally I made it… Just to find out that this is an old version. Accidentally, I came to www.three.kibana.org/intro.html, which is the real up-to-date version. Kibana 3 !!! Here, you do NOT need to install Kibana via Ruby. Just download the web application and deploy it to your web container (e.g. Jetty, Tomcat) or Java EE container (e.g. Glassfish, WebSphere). This is so simple and just works with no further configuration. You can use Kibana to analyse and monitor your data as you do with logstash, i.e. you still have to use Enterprise Integration Patterns to log your data first. Afterwards, you can use Kibana’s web interface, which is much more powerful and comfortable than logstash. Finally, great news for Talend users: Kibana (including ElasticSearch) will be integrated into Talend 5.4 for event logging, complex event processing and visual monitoring. I am really looking forward to this awesome feature! Summary Your mission-critical projects need management and monitoring. Today, this is not just possible with complex and expensive tools of large vendors, but also with lightweight frameworks and products such as Apache Camel or Talend ESB. Management and monitoring of Apache Camel and Talend ESB integration routes and SOAP / REST web services is very easy. Several great alternatives are available. Once again, vendor-independent standards such as JXM show their benefits. You can use tooling which is available implicitly in the framework respectively product, or you can integrate your existing tooling very easily. Content from my blog (www.kai-waehner.de/blog). Feedback and other experiences or alternatives appreciated via comments or directly via @KaiWaehner / [email protected] / LinkedIn / Xing. Best regards, Kai Wähner

July 16, 2013

by Kai Wähner

CORE

· 26,484 Views

Akka vs Storm

I was recently working a bit with Twitter’s Storm, and it got me wondering, how does it compare to another high-performance, concurrent-data-processing framework, Akka. WHAT’S AKKA AND STORM? Let’s start with a short description of both systems. Storm is a distributed, real-time computation system. On a Storm cluster, you execute topologies, which process streams of tuples (data). Each topology is a graph consisting of spouts (which produce tuples) and bolts (which transform tuples). Storm takes care of cluster communication, fail-over and distributing topologies across cluster nodes. Akka is a toolkit for building distributed, concurrent, fault-tolerant applications. In an Akka application, the basic construct is an actor; actors process messages asynchronously, and each actor instance is guaranteed to be run using at most one thread at a time, making concurrency much easier. Actors can also be deployed remotely. There’s a clustering module coming, which will handle automatic fail-over and distribution of actors across cluster nodes. Both systems scale very well and can handle large amounts of data. But when to use one, and when to use the other? There’s another good blog post on the subject, but I wanted to take the comparison a bit further: let’s see how elementary constructs in Storm compare to elementary constructs in Akka. COMPARING THE BASICS Firstly, the basic unit of data in Storm is a tuple. A tuple can have any number of elements, and each tuple element can be any object, as long as there’s a serializer for it. In Akka, the basic unit is amessage, which can be any object, but it should be serializable as well (for sending it to remote actors). So here the concepts are almost equivalent. Let’s take a look at the basic unit of computation. In Storm, we have components: bolts andsprouts. A bolt can be any piece of code, which does arbitrary processing on the incoming tuples. It can also store some mutable data, e.g. to accumulate results. Moreover, bolts run in a single thread, so unless you start additional threads in your bolts, you don’t have to worry about concurrent access to the bolt’s data. This is very similar to an actor, isn’t it? Hence a Storm bolt/sprout corresponds to an Akka actor. How do these two compare in detail? Actors can receive arbitrary messages; bolts can receive arbitrary tuples. Both are expected to do some processing basing on the data received. Both have internal state, which is private and protected from concurrent thread access. ACTORS & BOLTS: DIFFERENCES One crucial difference is how actors and bolts communicate. An actor can send a message to any other actor, as long as it has the ActorRef (and if not, an actor can be looked up by-name). It can also send back a reply to the sender of the message that is being handled. Storm, on the other hand is one-way. You cannot send back messages; you also can’t send messages to arbitrary bolts. You can also send a tuple to a named channel (stream), which will cause the tuple (message) to be broadcast to all listeners, defined in the topology. (Bolts also ack messages, which is also a form of communication, to the ackers.) In Storm, multiple copies of a bolt’s/sprout’s code can be run in parallel (depending on theparallelism setting). So this corresponds to a set of (potentially remote) actors, with a load-balancer actor in front of them; a concept well-known from Akka’s routing. There are a couple of choices on how tuples are routed to bolt instances in Storm (random, consistent hashing on a field), and this roughly corresponds to the various router options in Akka (round robin, consistent hashing on the message). There’s also a difference in the “weight” of a bolt and an actor. In Akka, it is normal to have lots of actors (up to millions). In Storm, the expected number of bolts is significantly smaller; this isn’t in any case a downside of Storm, but rather a design decision. Also, Akka actors typically share threads, while each bolt instance tends to have a dedicated thread. OTHER FEATURES Storm also has one crucial feature which isn’t implemented in Akka out-of-the-box: guaranteed message delivery. Storm tracks the whole tree of tuples that originate from any tuple produced by a sprout. If all tuples aren’t acknowledged, the tuple will be replayed. Also the cluster management of Storm is more advanced (automatic fail-over, automatic balancing of workers across the cluster; based on Zookeeper); however the upcoming Akka clustering module should address that. Finally, the layout of the communication in Storm – the topology – is static and defined upfront. In Akka, the communication patterns can change over time and can be totally dynamic; actors can send messages to any other actors, or can even send addresses (ActorRefs). So overall, Storm implements a specific range of usages very well, while Akka is more of a general-purpose toolkit. It would be possible to build a Storm-like system on top of Akka, but not the other way round (at least it would be very hard).

June 26, 2013

by Adam Warski

· 21,306 Views

Automating Nginx Reverse Proxy Configuration

It’s really nice if you can decouple your external API from the details of application segregation and deployment. In a previous post I explained some of the benefits of using a reverse proxy. On my current project we’ve building a distributed service oriented architecture that also exposes an HTTP API, and we’re using a reverse proxy to route requests addressed to our API to individual components. We have chosen the excellent Nginx web server to serve as our reverse proxy; it’s fast, reliable and easy to configure. We use it to aggregate multiple services exposing HTTP APIs into a single URL space. So, for example, when you type: http://api.example.com/product/pinstripe_suit It gets routed to: http://10.0.1.101:8001/product/pinstripe_suit But when you go to: http://api.example.com/customer/103474783 It gets routed to http://10.0.1.104:8003/customer/103474783 To the consumer of the API it appears that they are exploring a single URL space (http://api.example.com/blah/blah), but behind the scenes the different top level segments of the URL route to different back end servers. /product/… routes to 10.0.1.101:8001, but /customer/… routes to 10.0.1.104:8003. We also want this to be self-configuring. So, say I want to create a new component of the system that records stock levels. Rather than extending an existing component, I want to be able to write a stand-alone executable or service that exposes an HTTP endpoint, have it be automatically deployed to one of the hosts in my cloud infrastructure, and have Nginx automatically route requests addressed http://api.example.com/stock/whatever to my new component. We also want to load balance these back end services. We might want to deploy several instances of our new stock API and have Nginx automatically round robin between them. We call each top level segment ( /stock, /product, /customer ) a claim. A component publishes an ‘AddApiClaim’ message over RabbitMQ when it comes on line. This message has 3 fields: ‘Claim', ‘ipAddress’, and ‘PortNumber’. We have a special component, ProxyAutomation, that subscribes to these messages and rewrites the Nginx configuration as required. It uses SSH and SCP to log into the Nginx server, transfer the various configuration files, and instruct Nginx to reload its configuration. We use the excellent SSH.NET library to automate this. A really nice thing about Nginx configuration is wildcard includes. Take a look at our top level configuration file: ... http { include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfile on; keepalive_timeout 65; include /etc/nginx/conf.d/*.conf; } Line 16 says, take any *.conf file in the conf.d directory and add it here. Inside conf.d is a single file for all api.example.com requests: include /etc/nginx/conf.d/api.example.com.conf.d/upstream.*.conf; server { listen 80; server_name api.example.com; include /etc/nginx/conf.d/api.example.com.conf.d/location.*.conf; location / { root /usr/share/nginx/api.example.com; index index.html index.htm; } } This is basically saying listen on port 80 for any requests with a host header ‘api.example.com’. This has two includes. The first one at line 1, I’ll talk about later. At line 7 it says ‘take any file named location.*.conf in the subdirectory ‘api.example.com.conf.d’ and add it to the configuration. Our proxy automation component adds new components (AKA API claims) by dropping new location.*.conf files in this directory. For example, for our stock component it might create a file, ‘location.stock.conf’, like this: location /stock/ { proxy_pass http://stock; } This simply tells Nginx to proxy all requests addressed to api.example.com/stock/… to the upstream servers defined at ‘stock’. This is where the other include mentioned above comes in, ‘upstream.*.conf’. The proxy automation component also drops in a file named upstream.stock.conf that looks something like this: upstream stock { server 10.0.0.23:8001; server 10.0.0.23:8002; } This tells Nginx to round-robin all requests to api.example.com/stock/ to the given sockets. In this example it’s two components on the same machine (10.0.0.23), one on port 8001 and the other on port 8002. As instances of the stock component get deployed, new entries are added to upstream.stock.conf. Similarly, when components get uninstalled, the entry is removed. When the last entry is removed, the whole file is also deleted. This infrastructure allows us to decouple infrastructure configuration from component deployment. We can scale the application up and down by simply adding new component instances as required. As a component developer, I don’t need to do any proxy configuration, just make sure my component publishes add and remove API claim messages and I’m good to go.

June 19, 2013

by Mike Hadlow

· 59,153 Views

Create a Couchbase Cluster with Ansible

[This blog was syndicated from http://blog.grallandco.com] Introduction When I was looking for a more effective way to create my cluster I asked some sysadmins which tools I should use to do it. The answer I got during OSDC was not Puppet, nor Chef, but wasAnsible. This article shows you how you can easily configure and create a Couchbase cluster deployed and many linux boxes...and the only thing you need on these boxes is an SSH Server! Thanks to Jan-Piet Mens that was one of the person that convinced me to use Ansible and answered questions I had about Ansible. You can watch the demonstration below, and/or look at all the details in the next paragraph. Ansible Ansible is an open-source software that allows administrator to configure and manage many computers over SSH. I won't go in all the details about the installation, just follow the steps documented in the Getting Started Guide. As you can see from this guide, you just need Python and few other libraries and clone Ansible project from Github. So I am expecting that you have Ansible working with your various servers on which you want to deploy Couchbase. Also for this first scripts I am using root on my server to do all the operations. So be sure you have register the root ssh keys to your administration server, from where you are running the Ansible scripts. Create a Couchbase Cluster So before going into the details of the Ansible script it is interesting to explain how you create a Couchbase Cluster. So here are the 5 steps to create and configure a cluster: Install Couchbase on each nodes of the cluster, as documented here. Take one of the node and "initialize" the cluster, using cluster-init command. Add the other nodes to the cluster, using server-add command. Rebalance, using rebalance command. Create a Bucket, using bucket-create command. So the goal now is to create an Ansible Playbook that executes these steps for you. Ansible Playbook for Couchbase The first think you need is to have the list of hosts you want to target, so I have create a hosts file that contains all my server organized in 2 groups: [couchbase-main] vm1.grallandco.com [couchbase-nodes] vm2.grallandco.com vm3.grallandco.com The group [couchbase-main] group is just one of the node that will drive the installation and configuration, as you probably already know, Couchbase does not have any master... All nodes in the cluster are identical. To ease the configuration of the cluster, I have create another file that contains all parameters that must be sent to all the various commands. This file is located in the group_vars/all see the section Splitting Out Host and Group Specific Data in the documentation. # Adminisrator user and password admin_user: Administrator admin_password: password # ram quota for the cluster cluster_ram_quota: 1024 # bucket and replicas bucket_name: ansible bucket_ram_quota: 512 num_replicas: 2 Use this file to configure your cluster. Let's describe the playbook file : - name: Couchbase Installation hosts: all user: root tasks: - name: download Couchbase package get_url: url=http://packages.couchbase.com/releases/2.0.1/couchbase-server-enterprise_x86_64_2.0.1.deb dest=~/. - name: Install dependencies apt: pkg=libssl0.9.8 state=present - name: Install Couchbase .deb file on all machines shell: dpkg -i ~/couchbase-server-enterprise_x86_64_2.0.1.deb As expected, the installation has to be done on all servers as root then we need to execute 3 tasks: Download the product, the get_url command will only download the file if not already present Install the dependencies with the apt command, the state=present allows the system to only install this package if not already present Install Couchbase with a simple shell command. (here I am not checking if Couchbase is already installed) So we have now installed Couchbase on all the nodes. Let's now configure the first node and add the others: - name: Initialize the cluster and add the nodes to the cluster hosts: couchbase-main user: root tasks: - name: Configure main node shell: /opt/couchbase/bin/couchbase-cli cluster-init -c 127.0.0.1:8091 --cluster-init-username=${admin_user} --cluster-init-password=${admin_password} --cluster-init-port=8091 --cluster-init-ramsize=${cluster_ram_quota} - name: Create shell script for configuring main node action: template src=couchbase-add-node.j2 dest=/tmp/addnodes.sh mode=750 - name: Launch config script action: shell /tmp/addnodes.sh - name: Rebalance the cluster shell: /opt/couchbase/bin/couchbase-cli rebalance -c 127.0.0.1:8091 -u ${admin_user} -p ${admin_password} - name: create bucket ${bucket_name} with ${num_replicas} replicas shell: /opt/couchbase/bin/couchbase-cli bucket-create -c 127.0.0.1:8091 --bucket=${bucket_name} --bucket-type=couchbase --bucket-port=11211 --bucket-ramsize=${bucket_ram_quota} --bucket-replica=${num_replicas} -u ${admin_user} -p ${admin_password} Now we need to execute specific taks on the "main" server: Initialization of the cluster using the Couchbase CLI, on line 06 and 07 Then the system needs to ask all other server to join the cluster. For this the system needs to get the various IP and for each IP address execute the add-server command with the IP address. As far as I know it is not possible to get the IP address from the main playbook YAML file, so I ask the system to generate a shell script to add each node and execute the script. This is done from the line 09 to 13. To generate the shell script, I use Ansible Template, the template is available in the couchbase-add-node.j2 file. {% for host in groups['couchbase-nodes'] %} /opt/couchbase/bin/couchbase-cli server-add -c 127.0.0.1:8091 -u ${admin_user} -p ${admin_password} --server-add={{ hostvars[host]['ansible_eth0']['ipv4']['address'] }:8091 --server-add-username=${admin_user} --server-add-password=${admin_password} {% endfor %} As you can see this script loop on each server in the [couchbase-nodes] group and use its IP address to add the node to the cluster. Finally the script rebalance the cluster (line 16) and add a new bucket (line 19). You are now ready to execute the playbook using the following command : ./bin/ansible-playbook -i ./couchbase/hosts ./couchbase/couchbase.yml -vv I am adding the -vv parameter to allow you to see more information about what's happening during the execution of the script. This will execute all the commands described in the playbook, and after few seconds you will have a new cluster ready to be used! You can for example open a browser and go to the Couchase Administration Console and check that your cluster is configured as expected. As you can see it is really easy and fast to create a new cluster using Ansible. I have also create a script to uninstall properly the cluster.. just launch ./bin/ansible-playbook -i ./couchbase/hosts ./couchbase/couchbase-uninstall.yml

June 3, 2013

by Don Pinto

· 5,156 Views · 1 Like

Avro's Built-In Sorting

avro has a little-known gem of a feature which allows you to control which fields in an avro record are used for partitioning , sorting and grouping in mapreduce. the following figure gives a quick refresher as to what these terms mean. oh, and don’t take the placement of the “sorting” literally - sorting actually occurs on both the map and reduce side - but it’s always performed in the context of a specific partition (i.e. for a specific reducer). by default all the fields in an avro map output key are used for partitioning, sorting and grouping in mapreduce. let’s walk through an example and see how this works. you’ll begin with a simple schema github source : {"type": "record", "name": "com.alexholmes.avro.weathernoignore", "doc": "a weather reading.", "fields": [ {"name": "station", "type": "string"}, {"name": "time", "type": "long"}, {"name": "temp", "type": "int"}, {"name": "counter", "type": "int", "default": 0} ] } we’re going to see what happens when we run this code against a small sample data set, which we’ll generate using avro code github source : file input = tmpfolder.newfile("input.txt"); avrofiles.createfile(input, weathernoignore.schema$, arrays.aslist( weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(3).build(), weathernoignore.newbuilder().setstation("iad").settime(1).settemp(1).build(), weathernoignore.newbuilder().setstation("sfo").settime(2).settemp(1).build(), weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(2).build(), weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(1).build() ).toarray()); to understand how avro is partitioning, sorting and grouping the data, we’ll write an identity mapper and reducer, with a small enhancement to the reducer to increment the counter field for each record we see in an individual reducer instance github source : package com.alexholmes.avro.sort.basic; import com.alexholmes.avro.weathernoignore; import org.apache.avro.mapred.avrokey; import org.apache.avro.mapred.avrovalue; import org.apache.avro.mapreduce.avrojob; import org.apache.avro.mapreduce.avrokeyinputformat; import org.apache.avro.mapreduce.avrokeyoutputformat; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import java.io.ioexception; public class avrosort { private static class sortmapper extends mapper, nullwritable, avrokey, avrovalue> { @override protected void map(avrokey key, nullwritable value, context context) throws ioexception, interruptedexception { context.write(key, new avrovalue(key.datum())); } } private static class sortreducer extends reducer, avrovalue, avrokey, nullwritable> { @override protected void reduce(avrokey key, iterable> values, context context) throws ioexception, interruptedexception { int counter = 1; for (avrovalue weathernoignore : values) { weathernoignore.datum().setcounter(counter++); context.write(new avrokey(weathernoignore.datum()), nullwritable.get()); } } } public boolean runmapreduce(final job job, path inputpath, path outputpath) throws exception { fileinputformat.setinputpaths(job, inputpath); job.setinputformatclass(avrokeyinputformat.class); avrojob.setinputkeyschema(job, weathernoignore.schema$); job.setmapperclass(sortmapper.class); avrojob.setmapoutputkeyschema(job, weathernoignore.schema$); avrojob.setmapoutputvalueschema(job, weathernoignore.schema$); job.setreducerclass(sortreducer.class); avrojob.setoutputkeyschema(job, weathernoignore.schema$); job.setoutputformatclass(avrokeyoutputformat.class); fileoutputformat.setoutputpath(job, outputpath); return job.waitforcompletion(true); } } if you look at the output of the job below, you’ll see that the output is sorted across all the fields, and that the sorting is in field ordinal order. what this means is that when mapreduce is sorting these records, it compares the station field first, then the time field second, and so on according to the ordering of the fields in the avro schema. this is pretty much what you’d expect if you write your own complex writable type, and your comparator compared all the fields in order. {"station": "iad", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 2, "counter": 1} {"station": "sfo", "time": 1, "temp": 3, "counter": 1} {"station": "sfo", "time": 2, "temp": 1, "counter": 1} oh, and before we move on notice that the value for the counter field is always 1 , meaning that each reducer was only fed a single key/vaue pair, which makes sense since our identity mapper only emitted a single value for each key, the keys are unique, and the mapreduce partitioner, sorter and grouper were using all the fields in the record. excluding fields for sorting avro gives us the ability to indicate that specific fields should be ignored when performing ordering functions. in mapreduce these fields are ignored for sorting/partitioning and grouping in mapreduce, which basically means that we have the ability to perform secondary sorting. let’s examine the following schema github source : {"type": "record", "name": "com.alexholmes.avro.weather", "doc": "a weather reading.", "fields": [ {"name": "station", "type": "string"}, {"name": "time", "type": "long"}, {"name": "temp", "type": "int", "order": "ignore"}, {"name": "counter", "type": "int", "order": "ignore", "default": 0} ] } it’s pretty much identical to the first schema, the only difference being that the last two fields are flagged as being “ignored” for sorting/partitioning/grouping. let’s run the same (other than modified to work with the different schema) mapreduce code github source as above against this new schema and examine the outputs. {"station": "iad", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 3, "counter": 1} {"station": "sfo", "time": 1, "temp": 2, "counter": 2} {"station": "sfo", "time": 1, "temp": 1, "counter": 3} {"station": "sfo", "time": 2, "temp": 1, "counter": 1} there are a couple of notable differences between this output, and the output from the previous schema which didn’t have any ignored fields. first, it’s clear that the temp field isn’t being used in the sorting, which makes sense since we specified that it should be ignored in the schema. however, more interestingly, note the value of the counter field. all records that had identical station and time values went to the same reducer invocation, evidenced by the increasing value of counter . this is essentially secondary sort! now, all of this greatness isn’t without some limitations: you can’t support two mapreduce jobs that use the same avro key, but have different sorting/partitioning/grouping requirements. although it’s conceivable that you could create a new instance of the avro schema and set the ignored flags for these fields yourself. the partitioner, sorter and grouping functions in mapreduce all work off of the same fields (i.e. they all ignore fields that set as ignored in the schema). this means that your options for secondary sorting are limited. for example, you wouldn’t be able to partition all stations to the same reducer, and then group by station and time. ordering uses a field’s ordinal position to determine its order within the overall set of fields to be ordered. in other words, in a two-field record, the first field is always compared before the second. there’s no way to change this behavior other than flipping the order of the fields in the record. having said all of that - the “ignoring fields” feature for sorting is pretty awesome, and something that will no doubt come in handy in my future mapreduce work.

May 29, 2013

by Alex Holmes

· 8,148 Views

Accessing An Artifact’s Maven And SCM Versions At Runtime

You can easily tell Maven to include the version of the artifact and its Git/SVN/… revision in the JAR manifest file and then access that information at runtime via getClass().getPackage.getImplementationVersion(). (All credit goes to Markus Krüger and other colleagues.) Include Maven artifact version in the manifest (Note: You will actually not want to use it, if you also want to include a SCM revision; see below.) pom.xml: ... org.apache.maven.plugins maven-jar-plugin ... true true ... ... The resulting MANIFEST.MF of the JAR file will then include the following entries, with values from the indicated properties: Built-By: ${user.name} Build-Jdk: ${java.version} Specification-Title: ${project.name} Specification-Version: ${project.version} Specification-Vendor: ${project.organization.name Implementation-Title: ${project.name} Implementation-Version: ${project.version} Implementation-Vendor-Id: ${project.groupId} Implementation-Vendor: ${project.organization.name} (Specification-Vendor and Implementation-Vendor come from the POM’s organization/name.) Include SCM revision For this you can either use the Build Number Maven plugin that produces the property ${buildNumber}, or retrieve it from environment variables passed by Jenkinsor Hudson (SVN_REVISION for Subversion, GIT_COMMIT for Git). For git alone, you could also use the maven-git-commit-id-plugin that can either replace strings such as ${git.commit.id} in existing resource files (using maven’s resource filtering, which you must enable) with the actual values or output all of them into a git.properties file. Let’s use the buildnumber-maven-plugin and create the manifest entries explicitely, containing the build number (i.e. revision) org.codehaus.mojo buildnumber-maven-plugin 1.2 validate create false false org.apache.maven.plugins maven-jar-plugin 2.4 ${project.name} ${project.version} ${buildNumber} ... Accessing the version & revision As mentioned above, you can access the manifest entries from your code via getClass().getPackage.getImplementationVersion() andgetClass().getPackage.getImplementationTitle(). References SO: How to get Maven Artifact version at runtime? Maven Archiver documentation

May 28, 2013

by Jakub Holý

· 12,760 Views

Using Avro's code generation from Maven

Avro has the ability to generate Java code from Avro schema, IDL and protocol files. Avro also has a plugin which allows you to generate these Java sources directly from Maven, which is a good idea as it avoids issues that can arise if your schema/protocol files stray from the checked-in code generated equivalents. Today I created a simple GitHub project called avro-maven because I had to fiddle a bit to get Avro and Maven to play nice. The GitHub project is self-contained and also has a README which goes over the basics. In this post I’ll go over how to use Maven to generate code for schema, IDL and protocol files. pom.xml updates to support the Avro plugin Avro schema files only define types, whereas IDL and protocol files model types as well as RPC semantics such as messages. The only difference between IDL and protocol files is that IDL files are Avro’s DSL for specifying RPC, versus protocol files are the same in JSON form. Each type of file has an entry that can be used in the goals element as can be seen below. All three can be used together, or if you only have schema files you can safely remove the protocol and idl-protocol entries (and vice-versa). org.apache.avro avro-maven-plugin ${avro.version} generate-sources schema protocol idl-protocol ... org.apache.avro avro ${avro.version} org.apache.avro avro-maven-plugin ${avro.version} org.apache.avro avro-compiler ${avro.version} org.apache.avro avro-ipc ${avro.version} By default the plugin assumes that your Avro sources are located in ${basedir}/src/main/avro, and that you want your generated sources to be written to ${project.build.directory}/generated-sources/avro, where ${project.build.directory} is typically the target directory. Keep reading if you want to change any of these settings. Avro configurables Luckily Avro’s Maven plugin offers the ability to customize various code generation settings. The following table shows the configurables that can be used for any of the schema, IDL and protocol code generators. Configurable Default value Description sourceDirectory ${basedir}/src/main/avro The Avro source directory for schema, protocol and IDL files. outputDirectory ${project.build.directory}/generated-sources/avro The directory where Avro writes code-generated sources. testSourceDirectory ${basedir}/src/test/avro The input directory containing any Avro files used in testing. testOutputDirectory ${project.build.directory}/generated-test-sources/avro The output directory where Avro writes code-generated files for your testing purposes. fieldVisibility PUBLIC_DEPRECATED Determines the accessibility of fields (e.g. whether they are public or private). Must be one of PUBLIC, PUBLIC_DEPRECATED or PRIVATE. PUBLIC_DEPRECATED merely adds a deprecated annotation to each field, e.g. "@Deprecated public long time". In addition, the includes and testIncludes configurables can also be used to specify alternative file extensions to the defaults, which are **/*.avsc, **/*.avpr and **/*.avdl for schema, protocol and IDL files respectively. Let’s look at an example of how we can specify all of these options for schema compilation. org.apache.avro avro-maven-plugin ${avro.version} generate-sources schema ${project.basedir}/src/main/myavro/ ${project.basedir}/src/main/java/ ${project.basedir}/src/main/myavro/ ${project.basedir}/src/test/java/ PRIVATE **/*.avro **/*.test As a reminder everything covered in this blog article can be seen in action in the GitHub repo at https://github.com/alexholmes/avro-maven.

May 26, 2013

by Alex Holmes

· 67,851 Views · 4 Likes

WSO2 ESB Filter Mediator Tutorial

This post will be for WSO2 ESB Filter Mediator and it will be cover simple usecase with basic Filter Mediator functions. It can be used for XPath filtering of messages. There are two modes of operation Specifies the XPath (boolean expression), return true or false XPath will be matched against the regular expression return true or false Syntax mediator+ Usecase I have services call 'BusService' where I can give rootId (road name) and get list bus number that going on that root. In the same services it have some train deatils also. When client call busService it must give busService and also If client ask for train details rather bus system must give it also. Here is busServices calls request: root1 respond: root1Colombo Negombo Galle getTraingNo request: root1 respond: 12-Colombo 13-Muthu 01-Bange Now I have write simple WSO2 ESB proxy with filter mediator. 1. Download wso2 esb 4.6.0 2. Start wso2 esb /bin/wso2server.bat (offset 1) Other services expose in wso2 AS in offset 0 3. Go to https://localhost:9444/carbon/ 4. Then Create "Pass Through Proxy" 5. Here I am adding WSO2 Filter Mediator Specify As: XPath or a Regular expression. XPath: XPath expression if you selected the "Specify As" option to "XPath". Source: which is going match with the reguilar expression Regex: Regular expression to match with the source value. 6. In Here I am filtering for the action of the WS request and it log the client request is it bus or train request? Here is proxy Source View 6. Go to https://localhost:9444/services/transportProxy?tryit# Make bus request and train request and see console log You can improve this usecase with some WSO2 ESB mediator if you wish!!

May 15, 2013

by Madhuka Udantha

· 13,230 Views

Async I/O and ThreadPool Deadlock (Part 1)

I’ve mentioned in a past post that it was conceived while reading the source code for the System.Diagnostics.Process class. This post is about the reason that pushed me to read the source code in an attempt to fix the issue. It turned out that this was yet another case of LeakyAbstraction, which is a special interest of mine. As it turned out, this post ended being way too long (even for me). I don’t like installments, but I felt that it is something that is worth trying as the size was prohibitive for single-post consumption. As such, I’ve split it up on 5 parts, so that each part would be around a 1000 words or less. I’ll post one part a day. To give you an idea of the scope and subject of what’s to come, here is a quick overview. In part 1 I’ll lay out the problem. We are trying to spawn processes, read their output and kill if they take too long. Our first attempt is to use simple synchronous I/O to read the output and discover a deadlock. We solve the deadlock using asynchronous I/O. In part 2 we parallelize the code and discover reduced performance and yet another deadlock. We create a testbed and set about to investigate the problem at depth. In part 3 we will find out the root cause and we’ll discuss the mechanics (how and why) we hit such a problem. In part 4 we’ll discuss solutions to the problem and develop a generic solutions (with code) to fix the problem. Finally, in part 5 we see whether or not a generic solution could work before we summarize and conclude. Let’s begin at the very beginning. Suppose you want to execute some program (call it child), get all its output (and error) and, if it doesn’t exit within some time limit, kill it. Notice that there is no interaction and no input. This is how tests are executed in Phalanger using a test runner. Synchronous I/O The Process class has conveniently exposed the underlying pipes to the child process using stream instances StandardOutput and StandardError. And, like many, we too might be tempted to simply call StandardOutput.ReadToEnd() and StandardError.ReadToEnd(). Albeit, that would work, until it doesn’t. As Raymond Chen noted, it’ll work as long as the data fits into the internal pipe buffer. The problem with this approach is that we are asking to read until we reach the end of the data, which will only happen for certainty when the child process we spawned exits. However, when the buffer of the pipe which the child writes its output to is full, the child has to wait until there is free space in the buffer to write to. But, you say, what if we always read and empty the buffer? Good idea, except, we need to do that for both StandardOutput and StandardError at the same time. In the StandardOutput.ReadToEnd() call we read every byte coming in the buffer until the child process exits. While we have drained the StandardOutput buffer (so that the child process can’t be possibly blocked on that,) if it fills the StandardError buffer, which we aren’t reading yet, we will deadlock. The child won’t exit until it fully writes to the StandardError buffer (which is full because no one is reading it,) meanwhile, we are waiting for the process to exit so we can be sure we read to the end of the StandardOutput before we return (and start reading StandardError). The same problem exists for StandardOutput, if we first read StandardError, hence the need to drain both pipe buffers as they are fed, not one after the other. Async Reading The obvious (and only practical) solution is to read both pipes at the same time using separate threads. To that end, there are mainly two approaches. The pre-4.0 approach (async events), and the 4.5-and-up approach (tasks). Async Reading with Events The code is reasonably straight forward as it uses .Net events. We have two manual-reset events and two delegates that get called asynchronously when we read a line from each pipe. We get null data when we hit the end of file (i.e. when the process exits) for each of the two pipes. public static string ExecWithAsyncEvents(string path, string args, int timeoutMs) { using (var outputWaitHandle = new ManualResetEvent(false)) { using (var errorWaitHandle = new ManualResetEvent(false)) { using (var process = new Process()) { process.StartInfo = new ProcessStartInfo(path); process.StartInfo.Arguments = args; process.StartInfo.UseShellExecute = false; process.StartInfo.RedirectStandardOutput = true; process.StartInfo.RedirectStandardError = true; process.StartInfo.ErrorDialog = false; process.StartInfo.CreateNoWindow = true; var sb = new StringBuilder(1024); process.OutputDataReceived += (sender, e) => { sb.AppendLine(e.Data); if (e.Data == null) { outputWaitHandle.Set(); } }; process.ErrorDataReceived += (sender, e) => { sb.AppendLine(e.Data); if (e.Data == null) { errorWaitHandle.Set(); } }; process.Start(); process.BeginOutputReadLine(); process.BeginErrorReadLine(); process.WaitForExit(timeoutMs); outputWaitHandle.WaitOne(timeoutMs); errorWaitHandle.WaitOne(timeoutMs); process.CancelErrorRead(); process.CancelOutputRead(); return sb.ToString(); } } } } We certainly can improve on the above code (for example we should make the total wait limit <= timeoutMs) but you get the point with this sample. Also, no error handling or killing the child process when it times out and doesn’t exit. Async Reading with Tasks A much more simplified and sanitized approach is to use the new System.Threading.Tasks namespace/framework to do all the heavy-lifting for us. As you can see, the code has been cut by half and it’s much more readable, but we need Framework 4.5 and newer for this to work (although my target is 4.0, but for comparison purposes I gave it a spin). The results are the same. public static string ExecWithAsyncTasks(string path, string args, int timeout) { using (var process = new Process()) { process.StartInfo = new ProcessStartInfo(path); process.StartInfo.Arguments = args; process.StartInfo.UseShellExecute = false; process.StartInfo.RedirectStandardOutput = true; process.StartInfo.RedirectStandardError = true; process.StartInfo.ErrorDialog = false; process.StartInfo.CreateNoWindow = true; var sb = new StringBuilder(1024); process.Start(); var stdOutTask = process.StandardOutput.ReadToEndAsync(); var stdErrTask = process.StandardError.ReadToEndAsync(); process.WaitForExit(timeout); stdOutTask.Wait(timeout); stdErrTask.Wait(timeout); return sb.ToString(); } } Again, a healthy doze of error-handling is in order, but for illustration purposes left out. A point worthy of mention is that we can’t assume we read the streams by the time the child exits. There is a race condition and we still need to wait for the I/O operations to finish before we can read the results. In the next part we’ll parallelize the execution in an attempt to maximize efficiency and concurrency.

April 3, 2013

by Ashod Nakashian

· 5,794 Views

In-Memory Data Grids

Introduction The IT buzzword of 2012 is without a doubt Big Data. It’s new and here to stay, and for a good reason. Big data is data that exceeds the processing capacity of conventional database systems. Great examples are CERN with the Large Hadron Collider, whose experiments generate 25 petabytes of data annually, or Walmart, which handles more than one million customer transaction every hour. Problems These vast amounts of data leave us with two problems. Problem 1: To gain value from this data, one must choose an alternative way to process it. The value of big data to an organization falls into two categories: analytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports. Problem 2: The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. Remember the CERN case where the LHC produces over 25 Petabytes of data annually? No “classic” database architecture or setup is capable of holding these amounts of data. Solutions Fortunately, both problems can be solved by implementing the correct infrastructure and rethinking data storage. There are two critical factors in Big Data environments: size and speed. We already discussed the vast amounts of data and desire to be able to access and process the data fast. The latter is the main differentiator from more traditional data warehouses. Just imagine what you can do when you can access all your data real-time. Enter big data. A common Big Data implementation is an in-memory data grid that lives in a distributed cluster, ensuring both speed, by storing data in-memory, and capacity by using scalability features provided by a cluster. As a bonus, availability is ensured by using a distributed cluster. As for the data storage, there are typically two kinds: in-memory databases and in-memory data grids. But first some background. It is not a new attempt to use main memory as a storage area instead of a disk. In our daily lives there are numerous examples of main memory databases (MMDB), as they perform much faster than disk-based databases. An every day example is a mobile phone. When you SMS or call someone most mobile service providers use MMDB to get the information on your contact as soon as possible. The same applies to your phone. When someone calls you, the caller details are looked up in the contacts application, usually providing a name and sometimes a picture. In memory data grids In Memory Data Grid (IMDG) is the same as MMDB in that it stores data in main memory, but it has a totally different architecture. The features of IMDG can be summarized as follows: Data is distributed and stored on multiple servers. Each server operates in the active mode. A data model is usually object-oriented (serialized) and non-relational. According to the necessity, you often need to add or reduce servers. No traditional database features such as tables. In other words, IMDG is designed to store data in main memory, ensure scalability and store an object itself. These days, there are many IMDG products, both commercial and open source. Some of the most commonly used products are: Hazelcast (http://www.hazelcast.com) JBoss Infinispan (http://www.jboss.org/infinispan) GridGain DataGrid (http://www.gridgain.com/features/in-memory-data-grid/) VMware Gemfire (http://www.vmware.com/nl/products/application-platform/vfabric-gemfire/overview.html) Oracle Coherence (http://www.oracle.com/technetwork/middleware/coherence/overview/index.html) Gigaspaces XAP (http://www.gigaspaces.com/datagrid) Terracotta Enterprise Suite (http://terracotta.org/products/enterprise-suite) Why Memory? The main reasons for using main memory for data storage are once again the two main themes of Big Data: speed and capacity. The processing performance of main memory is 800 times faster than an HDD and up to 40 times faster than an. Moreover, the latest x86 server supports main memory of hundreds of GB per server. It is said that the limit of a traditional processing database’s (OLTP) data capacity is approximately 1 TB and that the OLTP processing data capacity would not increment well. If servers using main memory of 1 TB or larger become more commonly used, you will be able to conduct operations with the entire data placed in main memory, at least in the field of OLTP. IMDG Architecture To use main memory as a storage area, two weak points should be overcome: Limited capacity: involves data that exceeds the maximum capacity of the main memory of the server Reliability: involves data loss in case of a (system) failure. IMDG overcomes the limit of capacity by ensuring horizontal scalability using a distributed architecture, and resolves the issue of reliability through a replication system as part of the grid (or a distributed cluster). Now let’s discuss how an IMDG actually works. First of all, it is important to understand that an IMDG is not the same as an in-memory database, also referred to as MMDB (main memory databases). Typical examples of MMDBs are Oracle TimesTen or Sap Hana. MMDBs are full database products that simply reside in memory. As a result of being a full-blown database, they also carry the weight and overhead of database management features. IMDG is different. No tables, indexes, triggers, stored procedures, process managers etc. Just plain storage. The data model used in IMDG is key-value pairs. A key-value pair is a list with only two parts: a key and a value. The key can be used for storing and retrieving the values in the list. A key can be compared to the index or primary key of a table in a database. Note that IMDG are closely tied to development environments such as Java as the key-value pairs are represented by the structures provided by such a programming environment. Most IMDGs are written in Java, and can only be used within other Java applications. Therefore, the values of key-value pairs can be anything supported by Java, ranging from simple data types such as a string or number, to complex objects. This overcomes the two important hurdles: as you can store complex Java objects as value, there’s no need to translate these objects into a relational datamodel (which is the case in more traditional applications using a database for storage). Furthermore, the seeming limitation of being able to store only one value per key, is actually no limitation at all. Large memory sizes Most of the products introduced above use Java as an implementation language. Java reserves and uses a part of the RAM (internal memory) for dynamic memory allocation. This reserved memory space is called the Java heap. All runtime objects created by a Java application are stored in heap. Using large amounts of data causes two problems. Size limitation: By default, the heap size is 128 MB, but for current business applications, this limit is reached easily. Once the heap is “full”, no new objects can be created and the Java application will show some nasty errors. Performance: It is possible to increase the size of the heap, but this introduces some new problems. When a heap reaches a size of more than 4 gigabytes, Java will have serious issues with memory managements, causing your application to slow down or even freeze. Java has a feature called Garbage Collector, which periodically scans the heap and checks each object if it is still valid and being used. If not, the garbage collector removes the object and defragments the newly available space. The problem is, the larger the heap size, the more work to do for the garbage collector, resulting in performance degradation. Imagine a large bank has a Java application that manages customers, accounts and transactions. We have seen that an IMDG allows the application to store and access all data very quickly by caching it in memory, instead of storing the data in relatively slow databases. Let’s assume the combined data has a size of 40 gigabytes. Storing it in heap is simply not possible, considering the performance penalties of Java’s memory management capabilities. The graph below illustrates the garbage collection pause time when placing cached data in heap: Terracotta’s BigMemory product has a method to overcome these limitations. The method is to use an off-heap memory (direct buffer). Data will not be stored in Java’s heap, but directly in the available internal memory (RAM). Since, this is not subject to Java’s garbage collector, there are no performance penalties. The differences on performance are significant, as can be seen in the graph below: Using off-heap storage has some major benefits: You can use all the available memory on your machine, not just the memory that is allocated to the heap (usually less that 512 Mb). This allows you to store more data in a in-memory data grid, greatly speeding up your application. The heap can be relieved by storing data in native memory, speeding up Java applications as less heap space has to be garbage collected. Clustering, fail over and high availability So far, we have seen IMDG features that are applicable to a single server. However, the real power of IMDG lies in it’s networking and clustering capabilities, providing features as data replication, data synchronization between clients, fail over and high availability. To achieve this, a cluster of servers (or server array) acts a backbone of the infrastructure. Applications (that still can have their own IMDG or off-heap cache) that are connected to the cluster can share, replicate and backup their data with either the cluster or other applications. The graph below depicts a typical setup using Terracotta's BigMemory: The caches on the application servers are usually referred to as “level 1” cache, while the data cache on the server array is referred to as “level 2” cache. There are many different scenarios possible for storing, clustering, synchronizing and replicating data. Covering all these topics goes far beyond the scope of this article. For more information, consult the technical documentation of the product of your choice. Conclusion Big Data brings us some new challenges. First of all, storing and accessing vast amounts of data makes us rethink traditional methods and technologies. Next, there’s the question what to do with all the available data. The potential value for marketing, financial and other businesses is huge. In order to facilitate Big Data, in-memory data grids are considered the best option. IMDGs with off-heap storage are even more powerful, allowing data centric enterprise application to overcome certain limits of the Java platform, such as memory and performance constraints. As the amount of data that (large) companies produce and store, grows exponentially, databases will hit a limit. Accessing your data without a performance penalty simply will not be possible. The answer to this is using an IMDG.

March 13, 2013

by Roy Prins

· 32,687 Views · 5 Likes

JUnit testing of Spring MVC application: Testing the Service Layer

In continuation of my earlier blogs on Introduction to Spring MVC and Testing DAO layer in Spring MVC, in this blog I will demonstrate how to test Service layer in Spring MVC. The objective of this demo is 2 fold, to build the Service layer using TDD and increase the code coverage during JUnit testing of Service layer. For people in hurry, get the latest code from Github and run the below command mvn clean test -Dtest=com.example.bookstore.service.AccountServiceTest Since in my earlier blog, we have already tested the DAO layer, in this blog we only need to focus on testing service layer. We need to mock the DAO layer so that we can control the behavior in Service layer and cover various scenarios. Mockito is a good framework which is used to mock a method and return known data and assert that in the JUnit. As a first step we define the AccountServiceTestContextConfiguration class with AccountServiceTest class. If you notice there are 2 beans defined in that class and we marked the as a @Configuration which shows that it is a Spring Context class. In the JUnit test we @Autowired AccountService class. And AccountServiceImpl @Autowired the AccountRepository class. When creating the Bean in the configuration file we also stubbed the AccountRepository class using Mockito, @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration public class AccountServiceTest { @Configuration static class AccountServiceTestContextConfiguration { @Bean public AccountService accountService() { return new AccountServiceImpl(); } @Bean public AccountRepository accountRepository() { return Mockito.mock(AccountRepository.class); } } //We Autowired the AccountService bean so that it is injected from the configuration @Autowired private AccountService accountService; @Autowired private AccountRepository accountRepository; During the setup of the JUnit we use Mockito mock findByUsername method to return a predefined account object as below @Before public void setup() { Account account = new AccountBuilder() { { address("Herve", "4650", "Rue de la gare", "1", null, "Belgium"); credentials("john", "secret"); name("John", "Doe"); } }.build(true); Mockito.when(accountRepository.findByUsername("john")).thenReturn(account); } Now we write the tests as below and test both the positive and negative scenarios, @Test(expected = AuthenticationException.class) public void testLoginFailure() throws AuthenticationException { accountService.login("john", "fail"); } @Test() public void testLoginSuccess() throws AuthenticationException { Account account = accountService.login("john", "secret"); assertEquals("John", account.getFirstName()); assertEquals("Doe", account.getLastName()); } } Finally we verify if the findByUsername method is called only once successfully as below in the teardown, @After public void verify() { Mockito.verify(accountRepository, VerificationModeFactory.times(1)).findByUsername(Mockito.anyString()); // This is allowed here: using container injected mocks Mockito.reset(accountRepository); } AccountService class looks as below, @Service @Transactional(readOnly = true) public class AccountServiceImpl implements AccountService { @Autowired private AccountRepository accountRepository; @Override public Account login(String username, String password) throws AuthenticationException { Account account = this.accountRepository.findByUsername(username, password); } else { throw new AuthenticationException("Wrong username/password", "invalid.username"); } return account; } } I hope this blog helped you. In my next blog, I will demo how to build a controller JUnit test. Reference: Pro Spring MVC: With Web Flow by by Marten Deinum, Koen Serneels

March 3, 2013

by Krishna Prasad

· 81,662 Views · 3 Likes

Parallelizing Work with Redis

Curator's Note: Here's a Redis how-to from back in 2011. Anyone who has heard of Redis has probably also heard of Resque, which is a lightweight queue'ing system. To the uninitiated it might seem strange, or maybe even impossible, to construct a queue'ing system using just a key-value store. In this article, I’m going to break down some of the primitives redis exposes that make building a queue'ing system over it trivial and show how Redis is so much more than just a key-value store. The problem Let’s say, you are a mathematician and have just come up with this super performant way of computing factors of numbers. You decide to write up the following sinatra service: def compute_factors(number) factors = crazy_performant_computation number end get "/compute_factors" do number, post_back_url = params[:number].to_i, params[:post_back_url] RestClient.post post_back_url, factors => compute_factors(number).to_json "OK" end You soon start seeing crazy traffic and realize, performant as your factor computation algorithm is, it’s not fast enough to keep up with the speed at which you are getting requests to your service. First pass at optimization by forking You realize that it’s going to be far more efficient to fork off a new Process or Thread and have that perform the computation and post back the result. So your code now changes to: get "/compute_factors" do number, post_back_url = params[:number].to_i, params[:post_back_url] Process.fork do RestClient.post post_back_url, factors => compute_factors(number).to_json end "OK" end While, this is great you soon realize that filling up the process table in your OS is not such a good idea. Capping process creation using a process pool It is exactly this problem that a process pool was meant to solve. The basic idea is that you would still like to perform your time-intensive task in the background, but would like to put a cap on the number of background processes you have running. There are some excellent libraries that solve this problem such as Delayed Job and Resque. However, being the hacker that you are, you decide to roll one yourself. There are however a bunch of issues that these libraries solve and you decide to pull a pen and paper and note them down to ensure that you are not missing anything: Cap how many workers you create You need to have a way to cap the number of background workers you create, that way you don’t have the same problem you were having before. Control worker creation and destruction You would like to be able to boot up and bring down your workers reasonably gracefully. Handle race conditions You realize, that spinning new processes means that you now have to ensure your code is concurrent-safe. Redis provides, some wonderful atomic operations out-of-the-box so this shouldn’t be too hard. Second pass using BRPOP Redis supports a couple of interesting data-structures including lists, sets and hashes. Redis lists have a command called RPOP which basically lets you pop an item off the tail of a list, in essence treating it like a queue. The RPOP command comes with a blocking variant of itself called BRPOP that blocks on the call to popping an element from the list. You can also specify a timeout for how long (in seconds) you would like to block on the call. def compute_factors(number) factors = crazy_performant_computation number end NUMBER_OF_WORKERS = (ENV['NUMBER_OF_WORKERS'] || 50).to_i NUMBER_OF_WORKERS.times do Process.fork do redis = Redis.new loop do val = redis.brpop "work_queue", 1 unless val puts "Process: #{Process.pid} is exiting" exit 0 end number, postback_url = Marshal.load val.last RestClient.post postback_url, factors => compute_factors(number).to_json end end end redis = Redis.new get "/compute_factors" do number, post_back_url = params[:number].to_i, params[:post_back_url] redis.lpush "work_queue", Marshal.dump([number, post_back_url]) "OK" end So you now have solved a bunch of problems in this new approach. We have a fixed number of workers running to handle our background processing – so now our process table getting filled is not subject to traffic conditions. Race conditions are handled for us by Redis, since BRPOP is atomic and guarantees no two workers will do duplicate work. And finally, workers destroy themselves if they break out of the brpop call due to their timeout being hit, in this case 1 second. So, that’s quite a slew of problems that have been solved for us by virtue of just using redis. We soon start, seeing a different problem though. As traffic in our site lags, workers seem to be dying off since their timeout is being hit. We’d really like to now have the workers block for a longer time than just 1 second, while also having the option to kill them off sooner if we need to. That way, they’ll not be hanging around for any longer than they have to. Gracefully shutting down workers Our mandate now is to shutdown our workers gracefully, using redis and little bit of UNIX signals magic (for examples of using signals in this area checkout Unicorn Is Unix and the Unicorn web-server. Our code now morphs to: def compute_factors(number) factors = crazy_performant_computation number end NUMBER_OF_WORKERS = (ENV['NUMBER_OF_WORKERS'] || 50).to_i NUMBER_OF_WORKERS.times do Process.fork do redis = Redis.new loop do val = redis.brpop "work_queue", 30 unless val puts "Process: #{Process.pid} is signing off due to timeout!" exit 0 end if val.last == "DIE!" puts "Process: #{Process.pid} has been asked to kill itself by parent" exit 0 end number, postback_url = Marshal.load val.last RestClient.post postback_url, factors => compute_factors(number).to_json end end end redis = Redis.new get "/compute_factors" do number, post_back_url = params[:number].to_i, params[:post_back_url] redis.lpush "work_queue", Marshal.dump([number, post_back_url]) "OK" end `echo #{Process.pid} > /tmp/factors.pid` puts "Parent process wrote PID to /tmp/factors.pid" trap('QUIT') do NUMBER_OF_WORKERS.times do redis.lpush "work_queue", "DIE!" end end We have now bumped up the timeout to 30 seconds and also have in place a way to bring down the workers near instantly. This is accomplished by the web-server trapping the QUIT signal and when it does, it pushes a “DIE!” message onto the redis “work_queue”. It pushes this message the same number of times as the NUMBER_OF_WORKERS. And since BRPOP is an atomic and concurrent-safe operation we are now supporting the bringing down of workers via redis. How cool is that! To gracefully shutdown the server and workers we just need to: kill -s QUIT `cat /tmp/factors.pid` Conclusion The next time you need to get some background job action going, stop yourself from just grabbing a library. Instead, toy around with redis lists a little. You’ll be surprised by how much you can accomplish with just straight redis primitives.

February 28, 2013

by Santosh Kumar

· 8,256 Views