Career Development Resources

The Latest Career Development Topics

Spring Batch Tutorial with Spring Boot and Java Configuration

I’ve been working on migrating some batch jobs for Podcastpedia.org to Spring Batch. Before, these jobs were developed in my own kind of way, and I thought it was high time to use a more “standardized” approach. Because I had never used Spring with java configuration before, I thought this were a good opportunity to learn about it, by configuring the Spring Batch jobs in java. And since I am all into trying new things with Spring, why not also throw Spring Boot into the boat… Before you begin with this tutorial I recommend you read first Spring’s Getting started – Creating a Batch Service, because the structure and the code presented here builds on that original. 1. What I’ll build So, as mentioned, in this post I will present Spring Batch in the context of configuring it and developing with it some batch jobs for Podcastpedia.org. Here’s a short description of the two jobs that are currently part of the Podcastpedia-batch project: addNewPodcastJob reads podcast metadata (feed url, identifier, categories etc.) from a flat file transforms (parses and prepares episodes to be inserted with Http Apache Client) the data and in the last step, insert it to the Podcastpedia database and inform the submitter via emailabout it notifyEmailSubscribersJob – people can subscribe to their favorite podcasts on Podcastpedia.orgvia email. For those who did it is checked on a regular basis (DAILY, WEEKLY, MONTHLY) if new episodes are available, and if they are the subscribers are informed via email about those; read from database, expand read data via JPA, re-group it and notify subscriber via email Source code: The source code for this tutorial is available on GitHub – Podcastpedia-batch. Note: Before you start I also highly recommend you read the Domain Language of Batch, so that terms like “Jobs”, “Steps” or “ItemReaders” don’t sound strange to you. 2. What you’ll need A favorite text editor or IDE JDK 1.7 or later Maven 3.0+ 3. Set up the project The project is built with Maven. It uses Spring Boot, which makes it easy to create stand-alone Spring based Applications that you can “just run”. You can learn more about the Spring Boot by visiting theproject’s website. 3.1. Maven build file Because it uses Spring Boot it will have the spring-boot-starter-parent as its parent, and a couple of other spring-boot-starters that will get for us some libraries required in the project: pom.xml of the podcastpedia-batch project 4.0.0 org.podcastpedia.batch podcastpedia-batch 0.1.0 1.1.6.RELEASE 1.7 org.springframework.boot spring-boot-starter-parent 1.1.6.RELEASE org.springframework.boot spring-boot-starter-batch org.springframework.boot spring-boot-starter-data-jpa org.apache.httpcomponents httpclient 4.3.5 org.apache.httpcomponents httpcore 4.3.2 org.apache.velocity velocity 1.7 org.apache.velocity velocity-tools 2.0 org.apache.struts struts-core rome rome 1.0 rome rome-fetcher 1.0 org.jdom jdom 1.1 xerces xercesImpl 2.9.1 mysql mysql-connector-java 5.1.31 org.springframework.boot spring-boot-starter-freemarker org.springframework.boot spring-boot-starter-remote-shell javax.mail mail javax.mail mail 1.4.7 javax.inject javax.inject 1 org.twitter4j twitter4j-core [4.0,) org.springframework.boot spring-boot-starter-test maven-compiler-plugin org.springframework.boot spring-boot-maven-plugin Note: One big advantage of using the spring-boot-starter-parent as the project’s parent is that you only have to upgrade the version of the parent and it will get the “latest” libraries for you. When I started the project spring boot was in version 1.1.3.RELEASE and by the time of finishing to write this post is already at 1.1.6.RELEASE. 3.2. Project directory structure I structured the project in the following way: └── src └── main └── java └── org └── podcastpedia └── batch └── common └── jobs └── addpodcast └── notifysubscribers Note: the org.podcastpedia.batch.jobs package contains sub-packages having specific classes to particular jobs. the org.podcastpedia.batch.jobs.common package contains classes used by all the jobs, like for example the JPA entities that both the current jobs require. 4. Create a batch Job configuration I will start by presenting the Java configuration class for the first batch job: package org.podcastpedia.batch.jobs.addpodcast; import org.podcastpedia.batch.common.configuration.DatabaseAccessConfiguration; import org.podcastpedia.batch.common.listeners.LogProcessListener; import org.podcastpedia.batch.common.listeners.ProtocolListener; import org.podcastpedia.batch.jobs.addpodcast.model.SuggestedPodcast; import org.springframework.batch.core.Job; import org.springframework.batch.core.Step; import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing; import org.springframework.batch.core.configuration.annotation.JobBuilderFactory; import org.springframework.batch.core.configuration.annotation.StepBuilderFactory; import org.springframework.batch.item.ItemProcessor; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.ItemWriter; import org.springframework.batch.item.file.FlatFileItemReader; import org.springframework.batch.item.file.LineMapper; import org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper; import org.springframework.batch.item.file.mapping.DefaultLineMapper; import org.springframework.batch.item.file.transform.DelimitedLineTokenizer; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.context.annotation.Import; import org.springframework.core.io.ClassPathResource; import com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException; @Configuration @EnableBatchProcessing @Import({DatabaseAccessConfiguration.class, ServicesConfiguration.class}) public class AddPodcastJobConfiguration { @Autowired private JobBuilderFactory jobs; @Autowired private StepBuilderFactory stepBuilderFactory; // tag::jobstep[] @Bean public Job addNewPodcastJob(){ return jobs.get("addNewPodcastJob") .listener(protocolListener()) .start(step()) .build(); } @Bean public Step step(){ return stepBuilderFactory.get("step") .chunk(1) //important to be one in this case to commit after every line read .reader(reader()) .processor(processor()) .writer(writer()) .listener(logProcessListener()) .faultTolerant() .skipLimit(10) //default is set to 0 .skip(MySQLIntegrityConstraintViolationException.class) .build(); } // end::jobstep[] // tag::readerwriterprocessor[] @Bean public ItemReader reader(){ FlatFileItemReader reader = new FlatFileItemReader(); reader.setLinesToSkip(1);//first line is title definition reader.setResource(new ClassPathResource("suggested-podcasts.txt")); reader.setLineMapper(lineMapper()); return reader; } @Bean public LineMapper lineMapper() { DefaultLineMapper lineMapper = new DefaultLineMapper(); DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer(); lineTokenizer.setDelimiter(";"); lineTokenizer.setStrict(false); lineTokenizer.setNames(new String[]{"FEED_URL", "IDENTIFIER_ON_PODCASTPEDIA", "CATEGORIES", "LANGUAGE", "MEDIA_TYPE", "UPDATE_FREQUENCY", "KEYWORDS", "FB_PAGE", "TWITTER_PAGE", "GPLUS_PAGE", "NAME_SUBMITTER", "EMAIL_SUBMITTER"}); BeanWrapperFieldSetMapper fieldSetMapper = new BeanWrapperFieldSetMapper(); fieldSetMapper.setTargetType(SuggestedPodcast.class); lineMapper.setLineTokenizer(lineTokenizer); lineMapper.setFieldSetMapper(suggestedPodcastFieldSetMapper()); return lineMapper; } @Bean public SuggestedPodcastFieldSetMapper suggestedPodcastFieldSetMapper() { return new SuggestedPodcastFieldSetMapper(); } /** configure the processor related stuff */ @Bean public ItemProcessor processor() { return new SuggestedPodcastItemProcessor(); } @Bean public ItemWriter writer() { return new Writer(); } // end::readerwriterprocessor[] @Bean public ProtocolListener protocolListener(){ return new ProtocolListener(); } @Bean public LogProcessListener logProcessListener(){ return new LogProcessListener(); } } The @EnableBatchProcessing annotation adds many critical beans that support jobs and saves us configuration work. For example you will also be able to @Autowired some useful stuff into your context: a JobRepository (bean name “jobRepository”) a JobLauncher (bean name “jobLauncher”) a JobRegistry (bean name “jobRegistry”) a PlatformTransactionManager (bean name “transactionManager”) a JobBuilderFactory (bean name “jobBuilders”) as a convenience to prevent you from having to inject the job repository into every job, as in the examples above a StepBuilderFactory (bean name “stepBuilders”) as a convenience to prevent you from having to inject the job repository and transaction manager into every step The first part focuses on the actual job configuration: @Bean public Job addNewPodcastJob(){ return jobs.get("addNewPodcastJob") .listener(protocolListener()) .start(step()) .build(); } @Bean public Step step(){ return stepBuilderFactory.get("step") .chunk(1) //important to be one in this case to commit after every line read .reader(reader()) .processor(processor()) .writer(writer()) .listener(logProcessListener()) .faultTolerant() .skipLimit(10) //default is set to 0 .skip(MySQLIntegrityConstraintViolationException.class) .build(); } The first method defines a job and the second one defines a single step. As you’ve read in The Domain Language of Batch, jobs are built from steps, where each step can involve a reader, a processor, and a writer. In the step definition, you define how much data to write at a time (in our case 1 record at a time). Next you specify the reader, processor and writer. 5. Spring Batch processing units Most of the batch processing can be described as reading data, doing some transformation on it and then writing the result out. This mirrors somehow the Extract, Transform, Load (ETL) process, in case you know more about that. Spring Batch provides three key interfaces to help perform bulk reading and writing: ItemReader, ItemProcessor and ItemWriter. 5.1. Readers ItemReader is an abstraction providing the mean to retrieve data from many different types of input: flat files, xml files, database, jms etc., one item at a time. See the Appendix A. List of ItemReaders and ItemWriters for a complete list of available item readers. In the Podcastpedia batch jobs I use the following specialized ItemReaders: 5.1.1. FlatFileItemReader which, as the name implies, reads lines of data from a flat file that typically describe records with fields of data defined by fixed positions in the file or delimited by some special character (e.g. Comma). This type of ItemReader is being used in the first batch job, addNewPodcastJob. The input file used is named suggested-podcasts.in, resides in the classpath (src/main/resources) and looks something like the following: FEED_URL; IDENTIFIER_ON_PODCASTPEDIA; CATEGORIES; LANGUAGE; MEDIA_TYPE; UPDATE_FREQUENCY; KEYWORDS; FB_PAGE; TWITTER_PAGE; GPLUS_PAGE; NAME_SUBMITTER; EMAIL_SUBMITTER http://www.5minutebiographies.com/feed/; 5minutebiographies; people_society, history; en; Audio; WEEKLY; biography, biographies, short biography, short biographies, 5 minute biographies, five minute biographies, 5 minute biography, five minute biography; https://www.facebook.com/5minutebiographies;https://twitter.com/5MinuteBios; ; Adrian Matei; [email protected] http://notanotherpodcast.libsyn.com/rss; NotAnotherPodcast; entertainment; en; Audio; WEEKLY; Comedy, Sports, Cinema, Movies, Pop Culture, Food, Games; https://www.facebook.com/notanotherpodcastusa;https://twitter.com/NAPodcastUSA;https://plus.google.com/u/0/103089891373760354121/posts; Adrian Matei; [email protected] As you can see the first line defines the names of the “columns”, and the following lines contain the actual data (delimited by “;”), that needs translating to domain objects relevant in the context. Let’s see now how to configure the FlatFileItemReader: @Bean public ItemReader reader(){ FlatFileItemReader reader = new FlatFileItemReader(); reader.setLinesToSkip(1);//first line is title definition reader.setResource(new ClassPathResource("suggested-podcasts.in")); reader.setLineMapper(lineMapper()); return reader; } You can specify, among other things, the input resource, the number of lines to skip, and a line mapper. 5.1.1.1. LineMapper The LineMapper is an interface for mapping lines (strings) to domain objects, typically used to map lines read from a file to domain objects on a per line basis. For the Podcastpedia job I used the DefaultLineMapper, which is two-phase implementation consisting of tokenization of the line into a FieldSet followed by mapping to item: @Bean public LineMapper lineMapper() { DefaultLineMapper lineMapper = new DefaultLineMapper(); DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer(); lineTokenizer.setDelimiter(";"); lineTokenizer.setStrict(false); lineTokenizer.setNames(new String[]{"FEED_URL", "IDENTIFIER_ON_PODCASTPEDIA", "CATEGORIES", "LANGUAGE", "MEDIA_TYPE", "UPDATE_FREQUENCY", "KEYWORDS", "FB_PAGE", "TWITTER_PAGE", "GPLUS_PAGE", "NAME_SUBMITTER", "EMAIL_SUBMITTER"}); BeanWrapperFieldSetMapper fieldSetMapper = new BeanWrapperFieldSetMapper(); fieldSetMapper.setTargetType(SuggestedPodcast.class); lineMapper.setLineTokenizer(lineTokenizer); lineMapper.setFieldSetMapper(suggestedPodcastFieldSetMapper()); return lineMapper; } the DelimitedLineTokenizer splits the input String via the “;” delimiter. if you set the strict flag to false then lines with less tokens will be tolerated and padded with empty columns, and lines with more tokens will simply be truncated. the columns names from the first line are set lineTokenizer.setNames(...); and the fieldMapper is set (line 14) Note: The FieldSet is an “interface used by flat file input sources to encapsulate concerns of converting an array of Strings to Java native types. A bit like the role played by ResultSet in JDBC, clients will know the name or position of strongly typed fields that they want to extract.“ 5.1.1.2. FieldSetMapper The FieldSetMapper is an interface that is used to map data obtained from a FieldSet into an object. Here’s my implementation which maps the fieldSet to the SuggestedPodcast domain object that will be further passed to the processor: public class SuggestedPodcastFieldSetMapper implements FieldSetMapper { @Override public SuggestedPodcast mapFieldSet(FieldSet fieldSet) throws BindException { SuggestedPodcast suggestedPodcast = new SuggestedPodcast(); suggestedPodcast.setCategories(fieldSet.readString("CATEGORIES")); suggestedPodcast.setEmail(fieldSet.readString("EMAIL_SUBMITTER")); suggestedPodcast.setName(fieldSet.readString("NAME_SUBMITTER")); suggestedPodcast.setTags(fieldSet.readString("KEYWORDS")); //some of the attributes we can map directly into the Podcast entity that we'll insert later into the database Podcast podcast = new Podcast(); podcast.setUrl(fieldSet.readString("FEED_URL")); podcast.setIdentifier(fieldSet.readString("IDENTIFIER_ON_PODCASTPEDIA")); podcast.setLanguageCode(LanguageCode.valueOf(fieldSet.readString("LANGUAGE"))); podcast.setMediaType(MediaType.valueOf(fieldSet.readString("MEDIA_TYPE"))); podcast.setUpdateFrequency(UpdateFrequency.valueOf(fieldSet.readString("UPDATE_FREQUENCY"))); podcast.setFbPage(fieldSet.readString("FB_PAGE")); podcast.setTwitterPage(fieldSet.readString("TWITTER_PAGE")); podcast.setGplusPage(fieldSet.readString("GPLUS_PAGE")); suggestedPodcast.setPodcast(podcast); return suggestedPodcast; } } 5.2. JdbcCursorItemReader In the second job, notifyEmailSubscribersJob, in the reader, I only read email subscribers from a single database table, but further in the processor a more detailed read(via JPA) is executed to retrieve all the new episodes of the podcasts the user subscribed to. This is a common pattern employed in the batch world. Follow this link for more Common Batch Patterns. For the initial read, I chose the JdbcCursorItemReader, which is a simple reader implementation that opens a JDBC cursor and continually retrieves the next row in the ResultSet: @Bean public ItemReader notifySubscribersReader(){ JdbcCursorItemReader reader = new JdbcCursorItemReader(); String sql = "select * from users where is_email_subscriber is not null"; reader.setSql(sql); reader.setDataSource(dataSource); reader.setRowMapper(rowMapper()); return reader; } Note I had to set the sql, the datasource to read from and a RowMapper. 5.2.1. RowMapper The RowMapper is an interface used by JdbcTemplate for mapping rows of a Result’set on a per-row basis. My implementation of this interface, , performs the actual work of mapping each row to a result object, but I don’t need to worry about exception handling: public class UserRowMapper implements RowMapper { @Override public User mapRow(ResultSet rs, int rowNum) throws SQLException { User user = new User(); user.setEmail(rs.getString("email")); return user; } } 5.2. Writers ItemWriter is an abstraction that represents the output of a Step, one batch or chunk of items at a time. Generally, an item writer has no knowledge of the input it will receive next, only the item that was passed in its current invocation. The writers for the two jobs presented are quite simple. They just use external services to send email notifications and post tweets on Podcastpedia’s account. Here is the implementation of the ItemWriterfor the first job – addNewPodcast: package org.podcastpedia.batch.jobs.addpodcast; import java.util.Date; import java.util.List; import javax.inject.Inject; import javax.persistence.EntityManager; import org.podcastpedia.batch.common.entities.Podcast; import org.podcastpedia.batch.jobs.addpodcast.model.SuggestedPodcast; import org.podcastpedia.batch.jobs.addpodcast.service.EmailNotificationService; import org.podcastpedia.batch.jobs.addpodcast.service.SocialMediaService; import org.springframework.batch.item.ItemWriter; import org.springframework.beans.factory.annotation.Autowired; public class Writer implements ItemWriter{ @Autowired private EntityManager entityManager; @Inject private EmailNotificationService emailNotificationService; @Inject private SocialMediaService socialMediaService; @Override public void write(List items) throws Exception { if(items.get(0) != null){ SuggestedPodcast suggestedPodcast = items.get(0); //first insert the data in the database Podcast podcast = suggestedPodcast.getPodcast(); podcast.setInsertionDate(new Date()); entityManager.persist(podcast); entityManager.flush(); //notify submitter about the insertion and post a twitt about it String url = buildUrlOnPodcastpedia(podcast); emailNotificationService.sendPodcastAdditionConfirmation( suggestedPodcast.getName(), suggestedPodcast.getEmail(), url); if(podcast.getTwitterPage() != null){ socialMediaService.postOnTwitterAboutNewPodcast(podcast, url); } } } private String buildUrlOnPodcastpedia(Podcast podcast) { StringBuffer urlOnPodcastpedia = new StringBuffer( "http://www.podcastpedia.org"); if (podcast.getIdentifier() != null) { urlOnPodcastpedia.append("/" + podcast.getIdentifier()); } else { urlOnPodcastpedia.append("/podcasts/"); urlOnPodcastpedia.append(String.valueOf(podcast.getPodcastId())); urlOnPodcastpedia.append("/" + podcast.getTitleInUrl()); } String url = urlOnPodcastpedia.toString(); return url; } } As you can see there’s nothing special here, except that the write method has to be overriden and this is where the injected external services EmailNotificationService and SocialMediaService are used to inform via email the podcast submitter about the addition to the podcast directory, and if a Twitter page was submitted a tweet will be posted on the Podcastpedia’s wall. You can find detailed explanation on how to send email via Velocity and how to post on Twitter from Java in the following posts: How to compose html emails in Java with Spring and Velocity How to post to Twittter from Java with Twitter4J in 10 minutes 5.3. Processors ItemProcessor is an abstraction that represents the business processing of an item. While theItemReader reads one item, and the ItemWriter writes them, the ItemProcessor provides access to transform or apply other business processing. When using your own Processors you have to implement the ItemProcessor interface, with its only method O process(I item) throws Exception, returning a potentially modified or a new item for continued processing. If the returned result is null, it is assumed that processing of the item should not continue. While the processor of the first job requires a little bit of more logic, because I have to set the etag andlast-modified header attributes, the feed attributes, episodes, categories and keywords of the podcast: public class SuggestedPodcastItemProcessor implements ItemProcessor { private static final int TIMEOUT = 10; @Autowired ReadDao readDao; @Autowired PodcastAndEpisodeAttributesService podcastAndEpisodeAttributesService; @Autowired private PoolingHttpClientConnectionManager poolingHttpClientConnectionManager; @Autowired private SyndFeedService syndFeedService; /** * Method used to build the categories, tags and episodes of the podcast */ @Override public SuggestedPodcast process(SuggestedPodcast item) throws Exception { if(isPodcastAlreadyInTheDirectory(item.getPodcast().getUrl())) { return null; } String[] categories = item.getCategories().trim().split("\\s*,\\s*"); item.getPodcast().setAvailability(org.apache.http.HttpStatus.SC_OK); //set etag and last modified attributes for the podcast setHeaderFieldAttributes(item.getPodcast()); //set the other attributes of the podcast from the feed podcastAndEpisodeAttributesService.setPodcastFeedAttributes(item.getPodcast()); //set the categories List categoriesByNames = readDao.findCategoriesByNames(categories); item.getPodcast().setCategories(categoriesByNames); //set the tags setTagsForPodcast(item); //build the episodes setEpisodesForPodcast(item.getPodcast()); return item; } ...... } the processor from the second job uses the ‘Driving Query’ approach, where I expand the data retrieved from the Reader with another “JPA-read” and I group the items on podcasts with episodes so that it looks nice in the emails that I am sending out to subscribers: @Scope("step") public class NotifySubscribersItemProcessor implements ItemProcessor { @Autowired EntityManager em; @Value("#{jobParameters[updateFrequency]}") String updateFrequency; @Override public User process(User item) throws Exception { String sqlInnerJoinEpisodes = "select e from User u JOIN u.podcasts p JOIN p.episodes e WHERE u.email=?1 AND p.updateFrequency=?2 AND" + " e.isNew IS NOT NULL AND e.availability=200 ORDER BY e.podcast.podcastId ASC, e.publicationDate ASC"; TypedQuery queryInnerJoinepisodes = em.createQuery(sqlInnerJoinEpisodes, Episode.class); queryInnerJoinepisodes.setParameter(1, item.getEmail()); queryInnerJoinepisodes.setParameter(2, UpdateFrequency.valueOf(updateFrequency)); List newEpisodes = queryInnerJoinepisodes.getResultList(); return regroupPodcastsWithEpisodes(item, newEpisodes); } ....... } Note: If you’d like to find out more how to use the Apache Http Client, to get the etag and last-modifiedheaders, you can have a look at my post – How to use the new Apache Http Client to make a HEAD request 6. Execute the batch application Batch processing can be embedded in web applications and WAR files, but I chose in the beginning the simpler approach that creates a standalone application, that can be started by the Java main() method: package org.podcastpedia.batch; //imports ...; @ComponentScan @EnableAutoConfiguration public class Application { private static final String NEW_EPISODES_NOTIFICATION_JOB = "newEpisodesNotificationJob"; private static final String ADD_NEW_PODCAST_JOB = "addNewPodcastJob"; public static void main(String[] args) throws BeansException, JobExecutionAlreadyRunningException, JobRestartException, JobInstanceAlreadyCompleteException, JobParametersInvalidException, InterruptedException { Log log = LogFactory.getLog(Application.class); SpringApplication app = new SpringApplication(Application.class); app.setWebEnvironment(false); ConfigurableApplicationContext ctx= app.run(args); JobLauncher jobLauncher = ctx.getBean(JobLauncher.class); if(ADD_NEW_PODCAST_JOB.equals(args[0])){ //addNewPodcastJob Job addNewPodcastJob = ctx.getBean(ADD_NEW_PODCAST_JOB, Job.class); JobParameters jobParameters = new JobParametersBuilder() .addDate("date", new Date()) .toJobParameters(); JobExecution jobExecution = jobLauncher.run(addNewPodcastJob, jobParameters); BatchStatus batchStatus = jobExecution.getStatus(); while(batchStatus.isRunning()){ log.info("*********** Still running.... **************"); Thread.sleep(1000); } log.info(String.format("*********** Exit status: %s", jobExecution.getExitStatus().getExitCode())); JobInstance jobInstance = jobExecution.getJobInstance(); log.info(String.format("********* Name of the job %s", jobInstance.getJobName())); log.info(String.format("*********** job instance Id: %d", jobInstance.getId())); System.exit(0); } else if(NEW_EPISODES_NOTIFICATION_JOB.equals(args[0])){ JobParameters jobParameters = new JobParametersBuilder() .addDate("date", new Date()) .addString("updateFrequency", args[1]) .toJobParameters(); jobLauncher.run(ctx.getBean(NEW_EPISODES_NOTIFICATION_JOB, Job.class), jobParameters); } else { throw new IllegalArgumentException("Please provide a valid Job name as first application parameter"); } System.exit(0); } } The best explanation for SpringApplication-, @ComponentScan- and @EnableAutoConfiguration-magic you get from the source – Getting Started – Creating a Batch Service: “The main() method defers to the SpringApplication helper class, providing Application.class as an argument to its run() method. This tells Spring to read the annotation metadata from Application and to manage it as a component in the Spring application context. The @ComponentScan annotation tells Spring to search recursively through theorg.podcastpedia.batchpackage and its children for classes marked directly or indirectly with Spring’s @Component annotation. This directive ensures that Spring finds and registers BatchConfiguration, because it is marked with @Configuration, which in turn is a kind of @Component annotation. The @EnableAutoConfiguration annotation switches on reasonable default behaviors based on the content of your classpath. For example, it looks for any class that implements the CommandLineRunner interface and invokes its run() method.” Execution construction steps: the JobLauncher, which is a simple interface for controlling jobs, is retrieved from the ApplicationContext. Remember this is automatically made available via the@EnableBatchProcessing annotation. now based on the first parameter of the application (args[0]), I will retrieve the correspondingJob from the ApplicationContext then the JobParameters are prepared, where I use the current date - .addDate("date", new Date()), so that the job executions are always unique. once everything is in place, the job can be executed: JobExecution jobExecution = jobLauncher.run(addNewPodcastJob, jobParameters); you can use the returned jobExecution to gain access to BatchStatus, exit code, or job name and id. Note: I highly recommend you read and understand the Meta-Data Schema for Spring Batch. It will also help you better understand the Spring Batch Domain objects. 6.1. Running the application on dev and prod environments To be able to run the Spring Batch / Spring Boot application on different environments I make use of the Spring Profiles capability. By default the application runs with development data (database). But if I want the job to use the production database I have to do the following: provide the following environment argument -Dspring.profiles.active=prod have the production database properties configured in the application-prod.properties file in the classpath, right besides the default application.properties file Summary In this tutorial we’ve learned how to configure a Spring Batch project with Spring Boot and Java configuration, how to use some of the most common readers in batch processing, how to configure some simple jobs, and how to start Spring Batch jobs from a main method. Note: As I mentioned, I am fairly new to Spring Batch, and especially to Spring Boot and Spring Configuration with Java, so if you see any potential for improvement (code, job design etc.) please make a pull request or leave a comment below. Thanks a lot.

September 9, 2014

by Adrian Matei

· 146,349 Views · 7 Likes

Remote JMX Monitoring of a Mule Instance

in this post i will describe how to enable monitoring of a remote mule instance using jmx. in addition i will also enable the mx4j web interface that will expose the jmx properties of the mule instance in a web application and i will install the jolokia mule agent, which makes it possible to use hawtio to monitor the mule instance. enabling jmx for a mule instance using a mule application as of writing this, the finest granularity for which jmx monitoring can be enabled is an entire mule instance. thus it is not possible to enable jmx monitoring per application basis in a mule instance. this of course has advantages and disadvantages and we are going to use this fact to our advantage by deploying an application to a mule server that does nothing but enable jmx monitoring for the server. create a directory named “mulejmxenabler”. in this new directory, create a file named “mule-config.xml” with the following contents: in the above file, replace the ip address “192.168.1.73” with either the external ip address or the dns name of the computer running the mule server that you want to monitor. optionally, you may change the port number 1096 to any port that you rather use. deploy the application by copying the directory mulejmxenabler with its contents to the apps directory of the mule instance that is to be monitored. verify that the application was successfully deployed in the log of the mule instance (mule.log if you are using the community edition, mule_ee.log if you are using the enterprise edition). you should see something like this: info 2014-08-24 16:24:24,009 [mule.app.deployer.monitor.1.thread.1] org.mule.module.launcher.muledeploymentservice: ================== new exploded application: mulejmxenabler info 2014-08-24 16:24:24,010 [mule.app.deployer.monitor.1.thread.1] org.mule.module.launcher.application.defaultmuleapplication: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + new app 'mulejmxenabler' + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ info 2014-08-24 16:24:24,668 [mule.app.deployer.monitor.1.thread.1] org.mule.module.launcher.muledeploymentservice: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + started app 'mulejmxenabler' + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ the above is the minimum configuration that i need to enable remote jmx access to a mule instance. try connect to the mule instance using some jmx monitoring application, like jvisualvm or java mission control . the following steps describe how to do when using java mission control: launch java mission control. right-click in the jvm browser pane and select new connection. in the host field, enter the ip address or dns name that you used earlier in the mule-config.xml file. in my case that will be “192.168.1.73”. in the port field, enter the port number that you also entered in the mule-config.xml file earlier. in my case it was “1096”. click the test connection button. the status should be reported as ok as in this picture. click the finish button. you are now ready to connect to the mbean server of the mule instance and start monitoring it. this approach has the advantage of allowing modifications to the jmx configuration of a mule instance without having to restart it. if i want to make any modifications to the jmx configuration, enable or disable jmx in the mule instance i just edit the mule-config.xml file in the apps/mulejmxenabler directory in the mule instance home and save it. if hot deployment has not been disabled, the mule server will automatically re-deploy the application and pick up the changes. enabling jmx for a mule instance using wrapper parameters one alternative for enabling jmx monitoring of a mule instance that i have had success with is to add a set of java vm parameters in the wrapper configuration file of the mule instance. the wrapper configuration file is named “wrapper.conf” and can be found in the conf directory of the mule instance. to enable remote jmx access, add the following additional jvm parameters to your wrapper.conf file: # enables remote jmx management without authentication or ssl over port 1096. wrapper.java.additional.4=-dcom.sun.management.jmxremote wrapper.java.additional.5=-dcom.sun.management.jmxremote.port=1096 wrapper.java.additional.6=-dcom.sun.management.jmxremote.authenticate=false wrapper.java.additional.7=-dcom.sun.management.jmxremote.ssl=false wrapper.java.additional.8=-djava.rmi.server.hostname=192.168.1.73 important notes! you must adjust the numbering of the additional jmv parameters, as described in the comments in the wrapper.conf file! thus 4, 5, 6, 7 and 8 may not be the correct numbers for your wrapper.conf file. you must change the ip address to the remote ip address of the computer on which the mule server is running. you may want to change the port number. the mule instance must be restarted after the modifications, in order for them to come into effect. the drawback with this approach is that the mule instance needs to be restarted each time there is a modification to the jmx-related configuration parameters. with the target computer running mac os x, i need to chose one of the two described ways to enable remote jmx monitoring – if i use wrapper parameters, the mulejmxenabler application will not start properly. on ubuntu i can chose either of the two approaches but was also able to use both approaches simultaneously without errors. enabling the mx4j web interface if you are using the first approach, enabling jmx using a mule application, then adding the following xml element to the mule-config.xml file as a child element of the element enables the mx4j web interface exposing the jmx beans of the mule instance in question: as before, the ip address needs to be changed to the external ip address of the computer on which the mule instance is run. the port number may be changed from 1100 to any free port that you rather use. the mx4j web interface may, in my case, then be accessed using the url http://192.168.1.73:1100. install the jolokia mule agent with the first approach to enable jmx in a mule instance described above, you can also enable jolokia – the jmx-to-http bridge which enables use of the hawtio – an extensible web-based management console for java applications. note! version 1.2.2 of the jolokia mule agent, which at the time of writing is the latest, does not work with mule 3.5 due to the jetty libraries having been updated from version 6 to version 8. in this example i used mule 3.4, which is the latest version of mule that still use the jetty 6 libraries, as far as i know. download the jolokia mule agent from http://www.jolokia.org/download.html move the agent jar file to the lib/opt directory in the mule instance that you want to monitor. add the following configuration as a child element of the element in the mule-config.xml of the mulejmxenabler application created earlier: the port number may be modified as desired. open the url http://192.168.1.73:1095/jolokia/ in a web browser. note that you have to modify the ip address to the remote ip address of the computer running the mule instance and the port to the port number entered in the configuration earlier. if the jolokia agent was successfully enabled, you should see something like this in your web browser: {"timestamp":1408898064,"status":200,"request":{"type":"version"},"value":{"protocol":"7.2","config":{"maxdepth":"5","maxobjects":"10000","historymaxentries":"10","agentid":"192.168.1.73-1277-41c62ae4-mule","agenttype":"servlet","debug":"false","debugmaxentries":"100"},"agent":"1.2.2","info":{"product":"jetty","vendor":"mortbay","version":"6.1.26"}} with the jolokia agent in place, we can now monitor the mule instance using hawtio: download the hawtio jar from http://hawt.io/getstarted/index.html launch hawtio using the command “java -jar hawtio-app-1.4.17.jar” (the name of the jar file needs to match that of the file you downloaded earlier). hawtio should open a web browser and display a welcome page. if not, try the url http://localhost:8080/hawtio/welcome click the connect button in the upper left corner of the hawtio webpage. click the remote button. on the right, enter the remote ip of the computer on which your mule instance is running in the host field. 192.168.1.73 in my case. enter the port, 1095 in my case, from the configuration file in the port field. make sure the use proxy checkbox is checked. click the connect to remote server button. a new window or tab should open in your web browser. click the dashboard button in the upper left corner. you should now see information about the target computer, such as cpu and memory usage. click the jmx button next to the dashboard button. here you find the regular jmx management features. click the threads button next to the jmx button. this tab shows information about the different threads running in the target jvm. final words there are more things to tweak, but i hope that this will get you started with remote jmx monitoring of a mule instance. from there you can start modifying the parameters, adding security etc according to your requirements. more information on mule jmx management can be found here: http://www.mulesoft.org/documentation/display/current/jmx+management the oracle webpage on jmx technology can be found here: http://www.oracle.com/technetwork/java/javase/tech/javamanagement-140525.html finally, the oracle jmx tutorial can be found here: http://docs.oracle.com/javase/tutorial/jmx/

August 26, 2014

by Ivan K

· 18,603 Views · 1 Like

Java Unit Testing Interview Questions

The article presents some of the frequently asked interview questions in relation with unit testing with Java code. Please suggest other questions tthat you came across and I shall include in the list below. What is unit testing? Which unit testing framework did you use? What are some of the common Java unit testing frameworks? Ans: Read the definition of Unit testing on Wikipedia page for unit testing. Simply speaking, unit testing is about testing a block of code in isolation. There are two popular unit testing framework in Java named as Junit, TestNG. In SDLC, When is the right time to start writing unit tests? Ans: Test-along if not test-driven; Writing unit tests towards end is not very effective. Test-along technique recommends developers to write the unit tests as they go with their development. With Junit 4, do we still need methods such as setUp and tearDown? Ans: No. This is taken care with help of @Before and @After annotations respectively What do following junit test annotations mean? Ans: Following is a list of frequently used JUnit 4 annotations:@Test (@Test identifies a test method) @Before (Ans: @Before method will execute before every JUnit4 test)@After (Ans: @After method will execute after every JUnit4 test)@BeforeClass (Ans: @BeforeClass method will be executed before JUnit test for a Class starts)@AfterClass (Ans: @AfterClass method will be executed after JUnit test for a Class is completed)@Ignore (@Ignore method will not be executed) How do one do exception handling unit tests using @Test annotation? Ans: @Test(expected={exception class}. For example: @Test(expected=IllegalArgumentException.class) Write a sample unit testing method for testing exception named as IndexOutOfBoundsException when working with ArrayList? @Test(expected=IndexOutOfBoundsException.class) public void outOfBounds() { new ArrayList

August 6, 2014

by Ajitesh Kumar

· 48,471 Views · 3 Likes

Using the OpenXML SDK Productivity Tool to "decompile" Office Documents

Ode To Code - Easily Generate Microsoft Office Files From C# "... These days, Office files are no longer in a proprietary binary format, and are we can create the files directly without using COM automation. A .docx Word file, for example, is a collection of XML documents zipped into a single file. The official name of the format is Open XML. There is an SDK to help with reading and writing OpenXML, and a Productivity Tool that can generate C# code for a given file. All you need to do is load a document, presentation, or workbook into the tool and press the “Reflect Code” button. The downside to this tool is that even a simple document will generate 4,000 lines of code. Another downside is that the generated code assumes it will write directly to the file system, however it is easy to pass in an abstract Stream object instead. So while this code isn’t perfect, the code does produce valid document and..." I've been blogging about the OpenXML SDK for years now, but I think this is the first time I've seen this part of it, this utility. And like he says, 4K LoC is like, well, allot, it does look like an awesome way to learn the low level OpenXML SDK ins and outs. Related Past Post XRef: Open Sesame - Open XML SDK is now open source Using OpenXML to load an Excel Worksheet into a DataTable (or just how different OpenXML is from the old Excel API we're used too) Using OpenXML SDK to generate Word documents via templates (and without Word being installed) Checking for Microsoft Word DocX/DocM Revisions/Track Changes without using Word... (via OpenXML SDK, LINQ to XML or XML DOM) LINQ to XlsX... Using VB.Net, LINQ, the OpenXML SDK and a little C# helper, to query an Excel XlsX Using native OpenXML to create an XlsX (Which provides an example of why I highlight tools that make OpenXML easier...) Generating Xlsx's on the Server? You're using OpenXML, right? With help from the PowerTools for OpenXML? Official boat-load, as in supertanker, sized OpenXML content list (Insert "One OpenXML content list to rule them all" here) So how do I get from here to OpenXML? Got a map for you, an Open XML SDK Blog Map… Where to go to scratch your OpenXML dev info itch… "Open XML Explained" Free eBook (PDF) The Noob's Guide to Open XML Dev (If you know how to spell OpenXML but that's about it, this is your Getting Started guide...) Reusing the PowerShell PowerTools for Open XML in your C# or VB.Net world PowerShell, OpenXML, WMI and the PowerTools for OpenXML = Doc generation for our inner geek Because it’s a PowerShell kind of day… PowerTools for Open XML V1.1 Released OpenXML PowerTools updated – Cell your Excel via PowerShell Powering into OpenXML with PowerShell Open XML SDK 2.0 for Microsoft Office Released – Automate Office documents without Office Open XML 2.0 Code Snippets for VS2010 (and VS2008 too) Open XML Format SDK 2.0 Code Snippets for Visual Studio 2008 – 52 C#/VB Code Snippets to help ease your Open XML coding Open XML File Format Code Snippets for Visual Studio 2005 (Office 2007 NOT required) Open XML SDK v1 Released OpenXML Viewer 1.0 Released – Open source DocX to HTML conversion, with IE, Firefox and Opera (and/or command line) support

July 31, 2014

by Greg Duncan

· 16,608 Views

A Simple Cron Wrapper Script With Logging

When working with crontab service, one thing I often need is to capture the output of the job. Having the job script aware of this output and logging is tedious, and often make the script harder to read. So I wrote a shell wrapper that will redirect all job script's STDOUT into a log file. This way I can inspect it when a job has run and the job script can just focus on the task itself. # file: runcmd.sh # Helper/wrapper script to run any command in the crontab env. This script will ensure # user profile script is loaded and to log any command output into log files. It also # ensure not to print anything to STDOUT to avoid crontab system mail alert. # # NOTE: be sure to pass in absolute path of the command to be run so it can be found. # # Usage: # ./runcmd.sh find $HOME/crontab/test.sh # Simple use case # LOG_NAME=mytest ./runcmd.sh $HOME/crontab/test.sh # Change the log name to something specific # # Options DIR=`dirname $0` CMD="$@" CMD_NAME=`basename $1` LOG_NAME=${LOG_NAME:=$CMD_NAME} LOG="$DIR/logs/$LOG_NAME.log`date +%s`" # Ensure logs dir exists if [[ ! -e $DIR/logs ]]; then mkdir -p $DIR/logs fi # Run cron command source $HOME/.bash_profile echo "`date` Started cron cmd=$CMD, logname=$LOG_NAME" >> $LOG 2>&1 $CMD >> $LOG 2>&1 echo "`date` Cron cmd is done." >> $LOG 2>&1 With this wrapper, you can run any shell script and their output will be recorded. For example this job script below will clean up the logs accumulated in our logs folder. Note that the wrapper will also auto source the ".bash_profile". Often this this is needed if your job script expect all the env variables you already have setup in your login shell scripts. # file: remove-crontab-logs.sh DIR=`dirname $0`/logs echo "Checking and removing logs in $DIR" find $DIR -type f -mtime +31 -print -delete echo "Done" Now in the crontab file, you may run the job script like this: # Clean up crontab logs @montly $HOME/crontab/runcmd.sh $HOME/crontab/remove-crontab-logs.sh

June 9, 2014

by Zemian Deng

· 8,174 Views

New report looks at the role of Chambers of Commerce

The business world is an increasingly complex one. In the past few IBM CEO surveys, they have highlighted the growing importance of both being able to manage this complexity, and to do so in a collaborative way. This shifting zeitgeist was reflected in a series of seminars hosted by Xincus, and prompted the launch of a two phase study into how Chambers of Commerce can evolve within this new landscape. The study, consisting of in depth one on one interviews and a nationwide online survey, aimed to better understand both how Chambers can adapt, and what changes would be required to do so. The findings from this research are now available in a new paper called Chamber 2.0: Digital – Connected – Global. The paper outlines both the main challenges currently facing Chambers, and the steps they can take to thrive in such an environment. Amongst the main challenges identified by the research was a fundamental desire to change and modernize, with a strategic positioning and business model that would allow Chambers to flourish. There was also a strong desire to work more effectively with partners, both inside and outside of the Chamber network, sharing both resources and insights. The report then concluded with a road map derived by molding these findings from within the network with best practice from the wider business world. The road-map consists of five broad stages, with each one containing more detailed steps Chambers can take to prepare for the modern world. Become a one stop shop for members, including positioning the Chamber brand for the modern world as centers for Business, Innovation, and Economic Development with a new and modernized approach to business that sees an adaptive and responsive leadership style essential to a revitalize business model. Offer new value, with a new emphasis on virtual services to reflect modern ways of working. Chambers will become a solution hub that connects and match makes members, with co-working spaces connecting the physical and virtual worlds. Collaborate beyond borders, by building an extensive Chamber alliance network, allowing Chambers to become specialized regional hubs, whilst tapping into the collective wisdom of the entire network as well as offering “health-club” type e-memberships to professionals, academics, entrepreneurs and “free agent” millennials alike. Nurture new economic development, by facilitating entrepreneurial collaboration between members and stakeholders, connecting the right people with the right resources, helping to forge an innovation economy and a thriving business community and jobs. Foster global innovation ecosystems, by tying all of these communities together to form a hyperconnected ecosystem, with Chambers at its heart, thus empowering the next wave of new economic development around the world. The report makes clear that whilst change is desired, the network remains positive that the right developments will occur. With Chambers striving to maintain their position at the heart of the business community, this report will go some way towards helping them achieve that goal. You can get your copy of the report here. Original post

May 22, 2014

by Adi Gaskell

· 3,506 Views

3 Reasons Why Knowledge Worker Engagement Is Decreasing

Due to the technological development with the Internet and social media, markets are no longer created and controlled with broadcast marketing. People can now find and connect with people like themselves all over the world – and no longer limited to the people in their close proximity and to existing ties such as family members, friends, colleagues or neighbors. They can connect with anyone, and they all influence each other, immediately and with multiplier effects. The power is shifting from companies to consumers. It is a radical shift, but it was predicted already in the mid 90’ies by marketing guru Philip Kotler as a consequence of the Internet. So we shouldn’t be too surprised. Yet a lot of companies are. And they haven’t prepared at all for this. Companies and organizations are waking up to a new reality, and the wake-up call can sometimes be harsh. A number of things are changing, and I will mention four of these here. 1. Change and uncertainty is the new normal To start with, today’s business environment is anything but static. It’s changing faster and faster, and in new ways. It’s becoming more and more unpredictable. This means that companies and organizations can’t do long-term planning like they used to. Instead they have to be prepared for change, to quickly adapt to new conditions and situations, such as changing consumer behaviors, new competition, new innovations, and so forth. 2. Diminishing return on optimization efforts The second big change is that the return on optimization efforts is diminishing. The companies that lead the development in their industries, and get all the profit are those that are able to create new value. They don’t do that with optimization. They do it by innovating new product and services, by creating and developing relationships with consumers and others, by collaborating internally and externally, and by constantly learning how change theirs strategies 3. Growth and efficiency is not enough Thirdly, being able to grow in terms of production volumes, market presence and market share is not enough to be successful, neither is it to produce and market products or services as efficiently as possible. Instead, continuous innovation and high responsiveness to change and customer demands is becoming more and more critical. This obviously can’t be addressed solely by streamlining and optimizing transactional processes, as we have done for the last few decades with the help of information technology. Innovation and responsiveness requires empowered people that can collaborate efficiently and effectively. That is why collaboration is the new productivity frontier. 4. Non-routine knowledge work is increasing in importance Finally, we can see that work is shifting from manual work to knowledge work, but most importantly from routine work to non-routine work. Computers and software are taking over repetitive and routine-based knowledge work, just as robots have replaced workers doing repetitive and routine manual work in the factories. The work that is remaining and increasing is the non-routine knowledge work that is often highly interdependent such as problem solving, product development, sales and so forth. Knowledge work is something completely different than most of the work that organizations have tried to improve and optimize during the 20th century. It’s fluid, dynamic, unpredictable, and non-repeatable. Knowledge workers need to look beyond the standard ways of doing things, to question information, rules, and ways of working. This is something completely different to how it is to work at a production line in a factory, where workers follow predefined and highly repeatable processes and procedure. Most organizations have been designed for efficiency and economies of scale, not for empowering people enabling collaboration, innovation and responsiveness. Too often, knowledge workers feel like they are just cogs in a big machinery. And unfortunately, in most cases, their feelings are motivated. We see all over that employees are increasingly dissatisfied with their jobs. Employee engagement is falling. It is especially bad in large and distributed organizations. The consequences are many and severe. Innovation is stifling. Productivity is, if not falling, not improving. People are leaving, or they want to leave, their jobs. It is hard to sustain and improve quality. And it’s not possible to recruit and retain talent by the quality and numbers that are needed. So what is causing this? I have grouped a number of causes into three overall themes; complexity, inflexibility, and disconnectedness. 1. Complexity As knowledge workers we often find ourselves stuck between a rock and a hard place. Workload and complexity at work is increasing, while we at the same time are expected to produce more, faster and faster. And adapt to new conditions. Not only that, we are expected to be creative and innovative as well. Still, if we look at an average day in the life of a knowledge worker, we struggle a lot with finding answers to basic questions, such as what is happening in our work environment, who is doing what, where I can fiend a piece of information, when it is my turn to contribute, and so on. This means that we spend a lot of time on things that are not creating value, just getting ready to create value. For example, Intel estimated that their employees spent one day per week on trying to find information and locating the expertise they needed to do their job. Although the tasks of knowledge workers come in all shapes and sizes, many of them rely on a number of basic capabilities, such as finding information or locating expertise. These capabilities are vital to knowledge worker productivity, but also to innovation, and it is evident that poor capabilities generate a lot of waste. Of course, we constantly get new tools that aim to help us. But when new tools and features are introduced to knowledge worker, there often is no guidance, and little customizing it to fit our needs. The problem is that we already have this huge pile of complex products to deal with, and we need to fit these. This technology-centric approach adds complexity instead of reducing it, instead of making things simpler for us. A study by Oracle found that productivity of enterprise application users had fallen almost 1/5 over a period of only three years. It’s like giving everybody Friday off. How can that be? I would argue that it’s the increasing complexity that is hampering productivity. 2. Inflexibility The second theme is inflexibility. By this I mean that our organizations and the systems that are there to help us get our work done are designed in a way that makes change, creativity and improvisation hard. Instead of empowering knowledge workers, our organizations often constrain and prevent us from being productive and innovative. First of all, there is a mismatch between what science knows and what organizations do when it comes to how they try to motivate knowledge workers to perform better. In most organizations, existing performance models are built on extrinsic motivators, or carrots and sticks if you like. These models worked pretty fine for routine, left-brain, rule-based work of 20th century, but they are not working very well for right-brained, creative, and self-propelled people performing non-routine and highly collaborative conceptual tasks. For example, bonuses and commissions don’t work for this kind of work. As a matter of fact, science shows they have the opposite effect than intended; the higher the extrinsic rewards, the worse the performance gets. Organizations are apparently making important decisions about their future based on the wrong assumptions. The left circle in this venn diagram represents things that have been considered important for trying to maximize the productivity of manual routine work. The right circle represents things that are important for motivating knowledge workers doing non-repetitive work. There is still little understanding and experience of how to do the things the right circle, so organizations and managers tend to stick with the things they know how to do. Those are the things in the left circle. Furthermore, knowledge workers need to have flexible working conditions. When it comes to knowledge work, work is not a place, it is something you do. Most knowledge worker tasks can be performed from any location, even those that require close collaboration with others. Organizations need to support this, not only to increase performance, but also to make people more engaged at work. Research shows that what employees of all age groups want is the flexibility to determine for themselves where, when, and how they work, and that increasing workplace flexibility has a positive effect on employee engagement and thereby also on employee productivity. A Virgin Media Business study found that 40% of the surveyed organizations often overhear employees complain about being tied to their desks and 7 in 10 organizations believe flexible working would make their employees both happier and more productive, boosting employee engagement. 3. Inconnectedness Finally, we have a theme that I call disconnectedness. It is about people and information being disconnected from each other, and thereby unable to share, cooperate and collaborate as is required to be productive and deal with the challenges organizations face. Collaborating isn’t as easy as it sometimes might sound, especially not in large and distributed organizations; there are too many barriers to collaborate naturally across an organization and across locations. In a complex and constantly changing work environment, it becomes even harder to find time and energy to overcome these barriers. It is only natural that we tend to share, cooperate and collaborate with people in our close proximity and that we already know and trust, failing to help and collaborate with others or share information that they might have use for. People work in silos. Silo thinking is a typical phenomenon in large organizations. Teams tend to focus on the parts they are responsible for and specialize in. They sub-optimize and focus on their own goals. They become organizational barriers that limit communication and impede sharing, collaboration, and innovation within the enterprise. Organizations have also created digital work environments to optimize personal productivity and teamwork, but doing so they have neglected the fact that knowledge work is increasingly relying on collaboration in networks across locations and organizations and stretching far beyond teams. It might seem as a paradox, but the modern and increasingly digital work environments have in fact made people more isolated and unaware of what is happening at work. This disconnectedness means that people become less engaged. And in a rapidly changing and complex work environment, this has serious implications, such as lost productivity and innovations. Or worse – talent is wasted and people leave. What do to about it? So what should organizations do to avoid the negative consequences of complexity, inflexibility and disconnectedness? The simple answer is that they should start working towards increased simplicity, flexibility and connectedness. What they should do and how, I will return to in my next post.

April 28, 2014

by Oscar Berg

· 3,428 Views

The Programmer Productivity Paradox

Programmers seem to be fairly productive people. You always see them typing at their desks; they chafe for meetings to finish so that they can go back to their desks and code. When asked, they will say that there is not enough time to produce the code, and the sooner they can start coding, the sooner they will be done. So writing code must be the most important thing, correct? If the average programmer writes about 50 lines of production code a day. A 50,000 line program would take 1,000 man days to produce. The 50,000 line listing can be entered by a programmer at about 1,000 lines a day or about 50 man days. So what the heck are the developers doing for the other 950 days? Before addressing that issue, lets make a simple observation. Capers Jones has compared many methodologies (RUP, XP, Agile, Waterfall, etc) and programming languages over thousands of projects and determined that programmers write between 325 and 750 lines of code (LOC) per month, which is less than the 1,000 LOC per month suggested above 1. Even if programmers do not average 50 lines of code per day, the following is clear 2. Methodology does not explain the apparent productivity gap No language accounts for the apparent productivity gap The reality is that only a fraction of a developer's time is actually spent writing production code. If a developer is typing in code all the time then they are really trying different combinations of code until they finally find the combination of code that works. Or more correctly, the combination that seems to match the requirements until either QA or the business analyst comes back and lets them know there is a problem. That is why developers that plan their code before using the keyboard tend to outperform other developers. Not only do only a few developers really plan out their code before coding but also years of experience do not teach developers to learn to plan. In fact studies over 40 years show that developer productivity does not change with years of experience. (see No Experience Required!) Years of experience do not lead to higher productivity Interestingly enough, there are methodologies that have been around for a long time that emphasize planning code. Watts Humphrey is the creator of the Personal Software Process (PSP) 3. Using PSP has been measured to: PSP can raise productivity by 21.2% and quality by 31.2% If you are interested there are many other proven methods of raising code quality that are not commonly used (see Not Planning is for Losers). If your developers at their keyboard and not planning at a white board then odds are that your productivity is not as high as it could be. Bibliography 1 The The Mythical Man Month is even more pessimistic suggesting that programmers produce 10 production lines of code per day 2 Jones, Capers and Bonsignour, Olivier. The Economics of Software Quality. Addison Wesley. 2011 3 Watts, Humphrey. Introduction to the Personal Software Process, Addison Wesley Longman. 1997

March 17, 2014

by Dalip Mahal

· 111,862 Views · 24 Likes

Setting Job Goals for Your Team: Senior Developer and Designer

If your employees aren't continuing to grow, your company will become stagnant. Here we examine goal-setting case study with the senior dev and the designer.

March 15, 2014

by Christina Popova

· 94,289 Views · 3 Likes

MongoDB and its locks

Sometimes, you need your jobs to be persisted to a database. Existing solutions such as Gearman only used relational or file-based persistence, so they were a no-go for us and we went with MongoDB. Fast-forward a few months, and we have some problems with the database load. However, it's not that workers are pestering it too much: the problem was related to locks. MongoDB locking model As of 2.4, MongoDB holds write locks on an entire database for each write operation. Since atomicity is guaranteed only on a single document, this isn't usually a problem because even if you are inserting thousands of documents you are doing so in thousands of different operations that can be interleaved with queries and other inserts with a fair policy. This sometimes results in count() queries being inconsistent as documents are moved and indexes are asynchronously updated. However, write corruption is inexistent as documents are a very cohesive entity. However, atomic operations over a single document still lock the whole database, as in the case of findAndModify(), which looks for a document matching a certain query and updates it with a $set operation before returning it; all in a single shot and with the guarantee no other process will be able to perform the same operation of reading and writing at the same time. You can see this operation is ideal for implementing workers based on a pull model, each asking the database for a new job to do and locking it with '$set: {locked: true}'. However, after the number of workers increases a little bit, locks become a problem. Lock duration We cleaned up the working space collection of our MongoDB database by keeping in it only the unfinished jobs, and moving all the rest (completed or failed) to a different collection for archival. As the load increases due to new contracts, we saw the locking time increase as well: the application and the workers were insisting on the same database. The first of the problems was that after reducing the specs of our primary server, we started seeing timeouts of unrelated code even if the CPU and IO usage were low. The locks taken by workers to pick jobs were starting to take seconds or tens of seconds. Moreover, the MongoDB server started filling the logs with: Fri Dec 6 00:01:07 [conn280998] warning: ClientCursor::yield can't unlock b/c of recursive lock... I'm a user, not MongoDB guru but that seems not very good, especially given hundreds of these messages were written every day (although the queues continued to work correctly.) We did not find any explanation for these messages in the documentation, but I suppose they mean some operations are taking so long that they have to yield to make room for others, but in the case of atomic operations they can't to preserve consistency. An easy solution Since MongoDB does not have collection-wide locks yet, we decided to move the job pool and the completed job collections to a different database. In this way, we had a main database with the usual collections and one containing just these two, named with a '_queue' suffix. Note that we're still writing to the same database server: there is still the same number of connections being created by each process. This solution preallocates more space given two databases are involved, but as you know space is cheap nowadays. Both insertion of jobs and worker reads must take place on the same database. Here is where we discovered cohesion pays: if you have this information in a single place it is very easy to change configuration. If you have a singleton database, because "we should only have one database in this application, it will never change" this feature would cost you a lot. Fortunately, in our case it was about 10 lines of code, including the refactoring on the Factory Methods that created MongoDB database objects. Long term This solution is not for the long term, as we know the numbers of machines and their workers pool will increase in the future; a sufficiently high number of workers will saturate the connections available on the MongoDB server and lock the common collection until a pick of a job takes dozens of seconds. The design towards which we are moving includes one "foreman" to each machine, and many workers under his control; only the foreman polls the database and may lock the common collection. Distributing the job pool is not what we want for ease of retrieval of a job in case something goes bad (ever done a query on multiple databases?). Also, we don't want a push solution as it will involve the registration of workers or foremen to a central point of failure that assignes them their jobs. Since most of our servers are shutdown and rebooted according to the user load, we prefer a dynamic solution where a server can start picking jobs whenever it wants and stop without notifying remote machines.

December 6, 2013

by Giorgio Sironi

· 27,651 Views

Securing Docker’s Remote API

One piece to Docker that is interesting AMAZING is the Remote API that can be used to programatically interact with docker. I recently had a situation where I wanted to run many containers on a host with a single container managing the other containers through the API. But the problem I soon discovered is that at the moment when you turn networking on it is an all or nothing type of thing… you can’t turn networking off selectively on a container by container basis. You can disable IPv4 forwarding, but you can still reach the docker remote API on the machine if you can guess the IP address of it. One solution I came up with for this is to use nginx to expose the unix socket for docker over HTTPS and utilize client-side ssl certificates to only allow trusted containers to have access. I liked this setup a lot so I thought I would share how it’s done. Disclaimer: assumes some knowledge of docker! Generate The SSL Certificates We’ll use openssl to generate and self-sign the certs. Since this is for an internal service we’ll just sign it ourselves. We also remove the password from the keys so that we aren’t prompted for it each time we start nginx. # Create the CA Key and Certificate for signing Client Certs openssl genrsa -des3 -out ca.key 4096 openssl rsa -in ca.key -out ca.key # remove password! openssl req -new -x509 -days 365 -key ca.key -out ca.crt # Create the Server Key, CSR, and Certificate openssl genrsa -des3 -out server.key 1024 openssl rsa -in server.key -out server.key # remove password! openssl req -new -key server.key -out server.csr # We're self signing our own server cert here. This is a no-no in production. openssl x509 -req -days 365 -in server.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out server.crt # Create the Client Key and CSR openssl genrsa -des3 -out client.key 1024 openssl rsa -in client.key -out client.key # no password! openssl req -new -key client.key -out client.csr # Sign the client certificate with our CA cert. Unlike signing our own server cert, this is what we want to do. openssl x509 -req -days 365 -in client.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out client.crt Another option may be to leave the passphrase in and provide it as an environment variable when running a docker container or through some other means as an extra layer of security. We’ll move ca.crt, server.key and server.crt to /etc/nginx/certs. Setup Nginx The nginx setup for this is pretty straightforward. We just listen for traffic on localhost on port 4242. We require client-side ssl certificate validation and reference the certificates we generated in the previous step. And most important of all, set up an upstream proxy to the docker unix socket. I simply overwrote what was already in /etc/nginx/sites-enabled/default. upstream docker { server unix:/var/run/docker.sock fail_timeout=0; } server { listen 4242; server localhost; ssl on; ssl_certificate /etc/nginx/certs/server.crt; ssl_certificate_key /etc/nginx/certs/server.key; ssl_client_certificate /etc/nginx/certs/ca.crt; ssl_verify_client on; access_log on; error_log /dev/null; location / { proxy_pass http://docker; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; client_max_body_size 10m; client_body_buffer_size 128k; proxy_connect_timeout 90; proxy_send_timeout 120; proxy_read_timeout 120; proxy_buffer_size 4k; proxy_buffers 4 32k; proxy_busy_buffers_size 64k; proxy_temp_file_write_size 64k; } } One important piece to make this work is you should add the user nginx runs as to the docker group so that it can read from the socket. This could be www-data, nginx, or something else! Hack It Up! With this setup and nginx restarted, let’s first run a curl command to make sure that this setup correctly. First we’ll make a call without the client cert to double check that we get denied access then a proper one. # Is normal http traffic denied? curl -v http://localhost:4242/info # How about https, sans client cert and key? curl -v -s -k https://localhost:4242/info # And the final good request! curl -v -s -k --key client.key --cert client.crt https://localhost:4242/info For the first two we should get some run of the mill 400 http response codes before we get a proper JSON response from the final command! Woot! But wait there’s more… let’s build a container that can call the service to launch other containers! For this example we’ll simply build two containers: one that has the client certificate and key and one that doesn’t. The code for these examples are pretty straightforward and to save space I’ll leave the untrusted container out. You can view the untrusted container on github (although it is nothing exciting). First, the node.js application that will connect and display information: https = require 'https' fs = require 'fs' options = host: 172.42.1.62 port: 4242 method: 'GET' path: '/containers/json' key: fs.readFileSync('ssl/client.key') cert: fs.readFileSync('ssl/client.crt') headers: { 'Accept': 'application/json'} # not required, but being semantic here! req = https.request options, (res) -> console.log res req.end() And the Dockerfile used to build the container. Notice we add the client.crt and client.key as part of building it! FROM shykes/nodejs MAINTAINER James R. Carr ADD ssl/client* /srv/app/ssl ADD package.json /srv/app/package.json ADD app.coffee /srv/app/app.coffee RUN cd /srv/app && npm install . CMD cd /srv/app && npm start That’s about it. Run docker build . and docker run -n >IMAGE ID< and we should see a json dump to the console of the actively running containers. Doing the same in the untrusted directory should present us with some 400 error about not providing a client ssl certificate. I’ve shared a project with all this code plus a vagrant file on github for your own prusual. Enjoy!

October 31, 2013

by James Carr

· 14,313 Views

A MindMap for Java Developer Interviews

Over the years I have been a panelist in many of the interviews for Java Developers. I have previously written a post titled Top 7 tips for succeeding in a technical interview for software engineers which covers few of the general guidelines. In this post I will share a mind map containing general topics covered in a Java developer interview. I have prepared this as a general reference for myself to remember the pointers and to keep a common standard across the multiple interviews. XMind gives a nice listing of the map. You can find the map here. Here is Image which you can download and use. Finally here is a old fashioned tabbed content list which is easier to copy paste. Java-Topics OOPs Encapsulation Abstraction Inheritance Interface - Abstract Class Casting IS-A vs HAS-A Relationships Aggregation vs Composition Plymorphism Method overloading vs Method Overloading Compile time vs Runtime Threads Creating threads Multitasking Synchronization Thread Transitions Marker Interface Serialization Clonnable Shallow copy vs Deep Copy Collections Map, List and Set Equals - Hashcode Legacy - Synchronized Classes JVM Stack vs Heap Memory Garbage Collection JRE, JVM, JDK Class loaders Exception Checked Vs Unchecked Exceptions Exception handling best practices try, catch, finally, throw, throws APIs Files String - StringBuffer - String Builder Java IO XML SAX Based & DOM Based JAXB - Java API for XML Binding Access specifier Access modifier public protected deafult private final static synchronized abstract transient volatile Inner/Nested Classes JavaEE Basics Packaging the Applications WAR EAR Basics MVC Servlets Listeners Lifecycle JSPs APIs JPA JAX-WS SOAP, WSDL Webservices basics Contract first vs JAX-RS RESTful and its advantages JSF This is a work in progress and I hope to refine it further. Let me know if you have any comments. - See more at: http://jyops.blogspot.ie/2013/10/a-mindmap-for-java-developer-interviews.html#sthash.K0A5wDAz.dpuf

October 27, 2013

by Manu Pk

· 20,398 Views · 1 Like

The Blogging Programmer's Style Guide: Front-End or Frontend?

Even among the large IT/development publications, I see inconsistencies in the use of the word front-end. Is it hyphenated or not?

October 1, 2013

by Mitch Pronschinske

· 55,793 Views · 4 Likes

This is how Facebook develops and deploys software. Should you care?

A recently published academic paper by Prof. Dror Feitelson at Hebrew University, Eitan Frachtenberg a research scientist at Facebook, and Kent Beck (who is also doing something at Facebook), describes Facebook’s approach to developing and deploying its front-end software. While it would be more interesting to understand how back-end development is done (this is where the real heavy lifting is done scaling up to handle hundreds of millions of users), there are a few things in the paper that are worth knowing about. Continuous Deployment at Facebook is Not Continuous Deployment Rather than planning work out into projects or breaking work into time-boxed Sprints, Facebook developers do most of their work in independent, small changes that are released frequently. This makes sense in Facebook’s online business model, everyone constantly tuning the platform and trying out new options and applications in different user communities, seeing what sticks. It’s a credit to their architecture that so many small, independent changes can actually be done independently and cheaply. Facebook says that it follows Continuous Deployment, but it’s not Continuous Deployment the way that IMVU made popular where every change is pushed out to customers immediately, or even how a company like Etsy does Continuous Deployment. At Facebook, code can be released twice a day, but this is done mostly for bug fixes and internal code. New production code is released once per week: thousands of changes by hundreds of developers are packaged up by their small release team on Sundays, run through automated regression testing, and released on Tuesday if the developers who contributed the changes are present. Release engineers assess the risk of changes based on the size of the change, the amount of discussion done in code reviews (which is recorded through an internal code review tool), and on each developer’s “push karma”: how many problems they have seen from code by this developer before. A tool called “Gatekeeper” controls what features are available to which customers to support dark launching, and all code is released incrementally – to staging, then a subset of users, and so on. Changes can be rolled-back if necessary – individually, or, as a last resort, an entire code release. However, like a lot of Silicon Valley DevOps shops, they mostly follow the “Real Men only Roll Forward” motto. Code Ownership A key to the culture at Facebook is that developers are individually responsible for the code that they wrote, for testing it and supporting it in production. This is reflected in their code ownership model: Developers must also support the operational use of their software — a combination that’s become known as “DevOps.” This further motivates writing good code and testing it thoroughly. Developers’ personal stake in keeping the system running smoothly complements the engineering procedures and lets the system maintain quality at scale. Methodologies and tools aren’t enough by themselves because they can always be misused. Thus, a culture of personal responsibility is critical. Consequently, most source files are modified by only a few engineers. Although at least one other engineer reviews all changes before they’re committed, a third of the source files have only been edited by one engineer, and another quarter by two. Only 10 percent of the files are handled by more than seven engineers. On the other hand, the distribution of engineers per file has a heavy tail, with the most widely shared file handled by no fewer than 870 distinct engineers. These widely shared files are predominantly library files and also include major configuration and top-level PHP files. Testing? We don’t need no stinking testing … Facebook doesn't have an independent test team, because, it says, doesn'tneed one. First, they depend a lot on code reviews to find bugs: At Facebook, code review occupies a central position. Every line of code that’s written is reviewed by a different engineer than the original author. This serves multiple purposes: the original engineer is motivated to ensure that the code is of high quality, the reviewer comes with a fresh mind and might find defects or suggest alternatives, and, in general, knowledge about coding practices and the code itself spreads throughout the company. Developers are also responsible for writing unit tests and their own regression tests – they have “tens of thousands of regression tests” (which doesn't sound like nearly enough for 10+ million lines of mostly PHP code compiled into C++, in both of which languages coding mistakes are easy to make) and automated performance tests. And developers also test the software by using the development version of Facebook for their personal Facebook use. According to the authors, “this is just one aspect of the departure from traditional software development”. But Facebook developers using their own software internally (and passing this off as “testing”) is no different than the early days at Microsoft where employees were supposed to “eat their own dog food”, a practice that did little if anything to improve the quality of Microsoft products. Facebook also depends on customers to test the software for it. Software is released in steps for A/B testing and “live experimentation” on subsets of the user base, whether customers want to participate in this testing or not. Because its customer base is so large, it can get meaningful feedback from testing with even a small percentage of users, which at least minimizes the risk and inconvenience to customers. Security??? While performance is an important consideration for developers at Facebook, there is no mention of security checks or testing anywhere in this description of how Facebook develops and deploys software. No static analysis, dynamic analysis/scanning, pen testing or explanation of how the security team and developers work together, not even for “privacy sensitive code” – although this code is “held to a higher standard” it doesn’t explain what this “higher standard” is. Presumably it relies on the use of libraries and frameworks to handle at least some AppSec problems, and possibly to look for security bugs in its code reviews, but it doesn't say. There isn’t much information available on Facebook’s AppSec program anywhere. The security team at Facebook seems to spend a lot of time educating people on how to use Facebook safely and how to develop Facebook apps safely and running their bug bounty program which pays outsiders to find security bugs for them. A search on security on Facebook mostly comes back with a long list of public security failures, privacy violations and application security vulnerabilities found over the years and continuing up to the present day. Maybe the lack of an effective AppSec program is the reason for this. This is the way Facebook is Developed. Should you care? While it’s interesting to get a look inside a high-profile organization like Facebook and how it approaches development at scale, it’s not clear why this paper was written. There is little about what Facebook is doing (on its front-end development at least) that is unique or innovative, except maybe the way it uses BitTorrent to push code changes out to thousands of servers like Twitter does, something that I already heard about a few years ago at Velocity and that has been written about before. I like the idea of developers being responsible for their work, all the way into production, which is a principle that we also follow. Code reviews are good. Dark launching features is a good practice and has been a common practice in systems for a long time (even before it was called "dark launching"). Not having testers or doing AppSec is not good. Otherwise, I'm not sure what the rest of us can learn from or would want to use from this.

September 4, 2013

by Jim Bird

· 43,055 Views · 1 Like

Jersey Client: Testing External Calls

Jim and I have been doing a bit of work over the last week which involved calling neo4j’s HA status URI to check whether or not an instance was a master/slave and we’ve been using jersey-client. The code looked roughly like this: class Neo4jInstance { private Client httpClient; private URI hostname; public Neo4jInstance(Client httpClient, URI hostname) { this.httpClient = httpClient; this.hostname = hostname; } public Boolean isSlave() { String slaveURI = hostname.toString() + ":7474/db/manage/server/ha/slave"; ClientResponse response = httpClient.resource(slaveURI).accept(TEXT_PLAIN).get(ClientResponse.class); return Boolean.parseBoolean(response.getEntity(String.class)); } } While writing some tests against this code we wanted to stub out the actual calls to the HA slave URI so we could simulate both conditions and a brief search suggested that mockito was the way to go. We ended up with a test that looked like this: @Test public void shouldIndicateInstanceIsSlave() { Client client = mock( Client.class ); WebResource webResource = mock( WebResource.class ); WebResource.Builder builder = mock( WebResource.Builder.class ); ClientResponse clientResponse = mock( ClientResponse.class ); when( builder.get( ClientResponse.class ) ).thenReturn( clientResponse ); when( clientResponse.getEntity( String.class ) ).thenReturn( "true" ); when( webResource.accept( anyString() ) ).thenReturn( builder ); when( client.resource( anyString() ) ).thenReturn( webResource ); Boolean isSlave = new Neo4jInstance(client, URI.create("http://localhost")).isSlave(); assertTrue(isSlave); } which is pretty gnarly but does the job. I thought there must be a better way so I continued searching and eventually came across this post on the mailing list which suggested creating a custom ClientHandler and stubbing out requests/responses there. I had a go at doing that and wrapped it with a little DSL that only covers our very specific use case: private static ClientBuilder client() { return new ClientBuilder(); } static class ClientBuilder { private String uri; private int statusCode; private String content; public ClientBuilder requestFor(String uri) { this.uri = uri; return this; } public ClientBuilder returns(int statusCode) { this.statusCode = statusCode; return this; } public Client create() { return new Client() { public ClientResponse handle(ClientRequest request) throws ClientHandlerException { if (request.getURI().toString().equals(uri)) { InBoundHeaders headers = new InBoundHeaders(); headers.put("Content-Type", asList("text/plain")); return createDummyResponse(headers); } throw new RuntimeException("No stub defined for " + request.getURI()); } }; } private ClientResponse createDummyResponse(InBoundHeaders headers) { return new ClientResponse(statusCode, headers, new ByteArrayInputStream(content.getBytes()), messageBodyWorkers()); } private MessageBodyWorkers messageBodyWorkers() { return new MessageBodyWorkers() { public Map> getReaders(MediaType mediaType) { return null; } public Map> getWriters(MediaType mediaType) { return null; } public String readersToString(Map> mediaTypeListMap) { return null; } public String writersToString(Map> mediaTypeListMap) { return null; } public MessageBodyReader getMessageBodyReader(Class tClass, Type type, Annotation[] annotations, MediaType mediaType) { return (MessageBodyReader) new StringProvider(); } public MessageBodyWriter getMessageBodyWriter(Class tClass, Type type, Annotation[] annotations, MediaType mediaType) { return null; } public List getMessageBodyWriterMediaTypes(Class tClass, Type type, Annotation[] annotations) { return null; } public MediaType getMessageBodyWriterMediaType(Class tClass, Type type, Annotation[] annotations, List mediaTypes) { return null; } }; } public ClientBuilder content(String content) { this.content = content; return this; } } If we change our test to use this code it now looks like this: @Test public void shouldIndicateInstanceIsSlave() { Client client = client().requestFor("http://localhost:7474/db/manage/server/ha/slave"). returns(200). content("true"). create(); Boolean isSlave = new Neo4jInstance(client, URI.create("http://localhost")).isSlave(); assertTrue(isSlave); } Is there a better way? In Ruby I’ve used WebMock to achieve this and Ashok pointed me towards WebStub which looks nice except I’d need to pass in the hostname + port rather than constructing that in the code.

August 1, 2013

by Mark Needham

· 10,837 Views

Bucketing, Multiplexing and Combining in Hadoop - Part 1

this is the first blog post in a series which looks at some data organization patterns in mapreduce. we’ll look at how to bucket output across multiple files in a single task, how to multiplex data across multiple files, and also how to coalesce data. these are all common patterns that are useful to have in your mapreduce toolkit. we’ll kick things off with a look at bucketing data outputs in your map or reduce tasks. by default when using a fileoutputformat-derived outputformat (such as textoutputformat), all the outputs for a reduce task (or a map task in a map-only job) are written to a single file in hdfs. imagine a situation where you have user activity logs being streamed into hdfs, and you want to write a mapreduce job to better organize the incoming data. as an example a large organization with multiple products may want to bucket the logs based on the product. to do this you’ll need the ability to write to multiple output files in a single task. let’s take a look at how we can make that happen. multipleoutputformat there are a few ways you can achieve your goal, and the first option we’ll look at is the multipleoutputformat class in hadoop. this is an abstract class that lets you do the following: define the output path for each and every key/value output record being emitted by a task. incorporate the input paths into the output directory for map-only jobs. redefine the key and value that are used to write to the underlying recordwriter . this is useful in situations where you want to remove data from the outputs as it duplicates data in the filename. for each output path, define the recordwriter that should be used to write the outputs. ok enough with the words - let’s look at some data and code. first up is the simple data we’ll use in our example - imagine you work at a fruit market with locations in multiple cities, and you have a purchase transaction stream which contains the store location along with the fruit that was purchased. cupertino apple sunnyvale banana cupertino pear to help bucket your data for future analysis, you want to bin each record into city-specific files. for the simple data set above you don’t want to filter, project or transform your data, just bucket it out, so a simple identity map-only job will do the job. to force more than one mapper, we’ll write the data to two separate files. $ tab="$(printf '\t')" $ hdfs -put - file1.txt << eof cupertino${tab}apple sunnyvale${tab}banana eof $ hdfs -put - file2.txt << eof cupertino${tab}pear eof here’s the code which will let you write city-specific output files. import org.apache.commons.lang.stringutils; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.filesystem; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.identitymapper; import org.apache.hadoop.mapred.lib.multipletextoutputformat; import org.apache.hadoop.util.progressable; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; import java.io.ioexception; import java.util.arrays; /** * an example of how to use {@link org.apache.hadoop.mapred.lib.multipleoutputformat}. */ public class mofexample extends configured implements tool { /** * create output files based on the output record's key name. */ static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } /** * the main job driver. */ public int run(final string[] args) throws exception { string csvinputs = stringutils.join(arrays.copyofrange(args, 0, args.length - 1), ","); path outputdir = new path(args[args.length - 1]); jobconf jobconf = new jobconf(super.getconf()); jobconf.setjarbyclass(mofexample.class); jobconf.setnumreducetasks(0); jobconf.setmapperclass(identitymapper.class); jobconf.setinputformat(keyvaluetextinputformat.class); jobconf.setoutputformat(keybasedmultipletextoutputformat.class); fileinputformat.setinputpaths(jobconf, csvinputs); fileoutputformat.setoutputpath(jobconf, outputdir); return jobclient.runjob(jobconf).issuccessful() ? 0 : 1; } /** * main entry point for the utility. * * @param args arguments * @throws exception when something goes wrong */ public static void main(final string[] args) throws exception { int res = toolrunner.run(new configuration(), new mofexample(), args); system.exit(res); } } run this code and you’ll see the following files in hdfs, where /output is the job output directory: $ hadoop fs -lsr /output /output/cupertino/part-00000 /output/cupertino/part-00001 /output/sunnyvale/part-00000 if you look at the output files you’ll see that the files contain the correct buckets. $ hadoop fs -lsr /output/cupertino/* cupertino apple cupertino pear $ hadoop fs -lsr /output/sunnyvale/* sunnyvale banana awesome, you have your data bucketed by store. now that we have everything working, let’s look at what we did to get there. we had to do two things to get this working: extend multipletextoutputformat this is where the magic happened - let’s look at that class again. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } you are working with text, which is why you extended multipletextoutputformat , a class that in turn extends multipleoutputformat . multipletextoutputformat is a simple class which instructs the multipleoutputformat to use textoutputformat as the underlying output format for writing out the records. if you were to use multipleoutputformat as-is it behaves as if you were using the regular textoutputformat , which is to say that it’ll only write to a single output file. to write data to multiple files you had to extend it, as with the example above. the generatefilenameforkeyvalue method allows you to return the output path for an input record. the third argument, name , is the original fileoutputformat -created filename, which is in the form “part-nnnnn”, where “nnnnn” is the task index, to ensure uniqueness. to avoid file collisions, it’s a good idea to make sure your generated output paths are unique, and leveraging the original output file is certainly a good way of doing this. in our example we’re using the key as the directory name, and then writing to the original fileoutputformat filename within that directory. specify the outputformat the next step was easy - specify that this output format should be used for your job: jobconf.setoutputformat(keybasedmultipletextoutputformat.class); earlier we also mentioned that you can use the input path as part of the output path, which we will look at next. using the input filename as part of the output filename in map-only jobs what if we wanted to keep the input filename as part of the output filename? this only works for map-only jobs, and can be accomplished by overriding the getinputfilebasedoutputfilename method. let’s look at the following code to understand how this method fits into the overall sequence of actions that the multipleoutputformat class performs: public void write(k key, v value) throws ioexception { // get the file name based on the key string keybasedpath = generatefilenameforkeyvalue(key, value, myname); // get the file name based on the input file name string finalpath = getinputfilebasedoutputfilename(myjob, keybasedpath); // get the actual key k actualkey = generateactualkey(key, value); v actualvalue = generateactualvalue(key, value); recordwriter rw = this.recordwriters.get(finalpath); if (rw == null) { // if we don't have the record writer yet for the final path, create // one // and add it to the cache rw = getbaserecordwriter(myfs, myjob, finalpath, myprogressable); this.recordwriters.put(finalpath, rw); } rw.write(actualkey, actualvalue); }; the getinputfilebasedoutputfilename method is called with the output of generatefilenameforkeyvalue , which contains our already-customized output file. our new keybasedmultipletextoutputformat can now be updated to override getinputfilebasedoutputfilename and append the original input filename to the output filename: static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(object key, object value, string name) { return key.tostring() + "/" + name; } @override protected string getinputfilebasedoutputfilename(jobconf job, string name) { string infilename = new path(job.get("map.input.file")).getname(); return name + "-" + infilename; } if you run with your modified outputformat class you’ll see the following files in hdfs, confirming that the input filenames are now concatenated to the end of each output file. $ hadoop fs -lsr /output /output/cupertino/part-00000-file1.txt /output/cupertino/part-00001-file2.txt /output/sunnyvale/part-00000-file1.txt the implementation of getinputfilebasedoutputfilename in multipleoutputformat doesn’t do anything interesting by default, but if you set the value of the mapred.outputformat.numoftrailinglegs configurable to an integer greater than 0, then the getinputfilebasedoutputfilename will use part of the input path as the output path. let’s see what happens when we set the value to 1: jobconf.setint("mapred.outputformat.numoftrailinglegs", 1); the output files in hdfs now exactly mirror the input files used for the job: $ hadoop fs -lsr /output /output/file1.txt /output/file2.txt if we set mapred.outputformat.numoftrailinglegs to 2, and our input files exist in the /inputs directory, then our output directory looks like this: $ hadoop fs -lsr /output /output/input/file1.txt /output/input/file2.txt basically as you keep incrementing mapred.outputformat.numoftrailinglegs , then multipleoutputformat will continue to go up the parent directories of the input file and use them in the output path. modifying the output key and value it’s very possible that the actual key and value you want to emit are different from those that were used to determine the output file. in our example, we took the output key and wrote to a directory using the key name. if you do that keeping the key in the output file may be redundant. how would we modify the output record so that the key isn’t written? multipleoutputformat has your back with the generateactualkey method. class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override protected text generateactualkey(text key, text value) { return null; } } the returned value from this method replaces the key that’s supplied to the underlying recordwriter , so if you return null as in the above example, no key will be written to the file. $ hadoop fs -lsr /output/cupertino/* apple pear $ hadoop fs -lsr /output/sunnyvale/* banana you can achieve the same result for the output value by overriding the generateactualvalue method. changing the recordwriter in our final step we’ll look at how you can leverage multiple recordwriter classes for different output files. this is accomplished by overriding the getrecordwriter method. in the example below we’re leveraging the same textoutputformat for all the files, but it gives you a sense of what can be accomplished. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override public recordwriter getrecordwriter(filesystem fs, jobconf job, string name, progressable prog) throws ioexception { if (name.startswith("apple")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } else if (name.startswith("banana")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } return super.getrecordwriter(fs, job, name, prog); } } conclusion when using multipleoutputformat , give some thought to the number of distinct files that each reducer will create. it would be prudent to plan your bucketing so that you have a relatively small number of files. in this post we extended multipletextoutputformat , which is a simple extension of multipleoutputformat that supports text outputs. multiplesequencefileoutputformat also exists to support sequencefiles in a similar fashion. so what are the shortcomings with the multipleoutputformat class? if you have a job that uses both map and reduce phases, then multipleoutputformat can’t be used in the map-side to write outputs. of course, multipleoutputformat works fine in map-only jobs. all recordwriter classes must support exactly the same output record types. for example, you wouldn’t be able to support a recordwriter that emitted for one output file, and have another recordwriter that emitted . multipleoutputformat exists in the mapred package, so it won’t work with a job that requires use of the mapreduce package. all is not lost if you bump into either one of these issues, as you’ll discover in the next blog post.

May 20, 2013

by Alex Holmes

· 6,333 Views

Monitoring Background Jobs in Ruby’s Resque

How to get visibility into an important component of any complex system: the messaging queue Here at AppNeta, we get to see a lot about how people build their web applications. From simple PHP scripts to heavily service-oriented Java clouds to monolithic Django apps, everybody’s product is architected a little differently. We’re still out to trace everything, and today I want to talk how to get visibility into an important component of any complex system: the messaging queue. Specifically, let’s look at how to trace a job from Rails using Resque. Messaging Queues If you haven’t used a messaging queue in your app, the idea is simple. Instead of forcing all the work to happen during the request, while the user is waiting, you can delay some of the more time-consuming tasks. You can do anything in these tasks, ranging from a simple insert to kicking off a series of user analytics that touch all parts of your infrastructure. The advantage is that you can return a speedy response to the user, or, if they are actually waiting on the task results, give them a better loading interface than a white screen and browser loading bar. A Quick Resque Tutorial In Ruby, Resque is a task runner, which by default stores the task descriptions in Redis (though other options are available). Resque jobs are just Ruby classes, with a single mandatory method perform. Resque will call perform with the arguments given in the task description. Let’s look at a minimal task, that takes a single argument and prints it. (Useless, I know.) The @queue variable defines a name that a worker can bind to, in case you want to spread different types of jobs across different machines. To create a task that this worker could run, we just call it from our request: And that’s our job! Maybe not the most interesting job, and probably not prone to performance issues, but we don’t know that yet. So let’s measure it! Tracing a Resque Task Now that we’ve added this to our system, we should have monitoring around it. The easiest way to do this would be to just measure the time each task takes, and log that information: Unfortunately, the data presentation here leaves a bit to be desired, so I’m going to use TraceView to log this information instead. This also has the benefit of logging any SQL queries, cache accesses, or service calls that we might do in a more complex task, as well as reporting errors. To start a trace fresh, we can wrap this call in the start_trace block: That’s a start! We’ve now got some visibility into our Resque jobs, and we can rest easy knowing that this is running smoothly in production. Tracing a Resque Job (with multiple tasks!) For cron-style jobs, the approach of tracing each task individually works fine. For reference, let’s look at the events we’re generating with that code: Pretty straightforward. Now let’s consider a more complicated set of tasks: a document-processing pipeline. That code might look like this: In this case, our first task takes a document, and the second one archives it. If we have multiple tasks, each one gets logged separately, and we can figure out same statistics for each — average, std. dev., percentiles, and the like. But what if you have a job that spans multiple tasks? We can further aggregate the stats, but we might be starting to miss things, like large inputs that cause the entire pipeline to slow down. What we’d really like is to correlate the related tasks, so instead of timing the each task, we’re timing the entire job. Under the hood, TraceView generates a token for each request. If we pass this ID (generally stored in xtrace, after the X-Trace header it’s passed around in) to each task, we can correlate those timings before storing them, and retrieve them all together. To do this, we can modify each task to take this token, and trace using that ID. ProcessDoc then becomes: Now we need to start the trace somewhere, but we’re not doing it in the job. We could start it in the first task, or we could link this one step further up the chain and tie it back to the web request that started it in the first place. In a default rails stack, that request generates the following events: To add in the task queue call to the logged request, we can call the following function: We have to force a fork in the execution path to indicate that we’re running an asynchronous task, possibly in parallel, with the rest of the web request, which is done with the call to fromString. Aside from that, this is the same underlying call as is done by start_trace above — log that we’re entering a named block of code, and start timing it. When we put it all together, we get a secondary execution path attached to the web request, and the logged events look like this: Now we’ve got everything: the original request, all tasks, individual timing information, and a global view of how the process performed. Not that we now have an additional timing measurement here: the delay before starting the task at all. In this case, we waited a full 500ms between queuing the job actually executing it! Once we were in the pipeline, the tasks happened much faster (only 25ms between processing and archiving). Caveats Lest you think that everything was easy, there’s a couple things to keep in mind when you use this in your own application. Because we’re starting the timing in the web request and ending it in a task queue, we’re relying on those two processes to have an identical clock. If they’re on the same machine, it won’t be a problem, but on different machines, any clock skew will effect the timing. I’ve quietly assumed everything in this system is reliable, which is almost certainly wrong. Whatever your error handling is, make sure you always log the exit event for ‘job’, or you may never know that you have errors! As long as I haven’t totally dissuaded you from trying this out, all the code is available in one place in this gist, and you can try in out in your application today with our free version of TraceView! (Source) Related Articles Ruby 2.0 Released: Let The Tracing Begin! AppNeta Rubygems Verified Relieve Event Binding Aches in Backbone.js

May 17, 2013

by TR Jordan

· 8,022 Views

Synchronising Multithreaded Integration Tests revisited

I recently stumbled upon an article Synchronising Multithreaded Integration Tests on Captain Debug's Blog. That post emphasizes the problem of designing integration tests involving class under test running business logic asynchronously. This contrived example was given (I stripped some comments): public class ThreadWrapper { public void doWork() { Thread thread = new Thread() { @Override public void run() { System.out.println("Start of the thread"); addDataToDB(); System.out.println("End of the thread method"); } private void addDataToDB() { // Dummy Code... try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } } }; thread.start(); System.out.println("Off and running..."); } } This is only an example of common pattern where business logic is delegated to some asynchronous job pool we have no control over. Roger Hughes (the author) enumerates few techniques of testing such code, including: arbitrary ("long enough") sleep() in test method to make sure background logic finishes refactoring doWork() so that it accepts CountDownLatch and agrees to notify it when job is done making the method above package private and @VisibleForTesting only "The" solution - refactoring doWork() so that it accepts arbitrary Runnable. In test we can wrap this Runnable (decorator pattern) and wait for inner Runnable to complete Last solution is not bad but it changes the responsibilities of ThreadWrapper significantly. Now it's up to the caller to decide what kind of job should be executed asynchronously while previously ThreadWrapper was encapsulating business logic completely. I am not saying it's a bad design, but it's drastically different from original method. Awaitility Can we write a test without such a massive refactoring? First solution involves handy library called Awaitility. This library is not a silver bullet, it simply evaluates given condition periodically and makes sure it's fulfilled within given time. It's the kind of code you probably wrote once or twice - wrapped in a library with well designed API. So here is our initial approach: import static com.jayway.awaitility.Awaitility.await; import static java.util.concurrent.TimeUnit.SECONDS; //... await().atMost(10, SECONDS).until(recordInserted()); //... private Callable recordInserted() { return new Callable() { @Override public Boolean call() throws Exception { return dataExists(); } }; } I think there is nothing to explain here. dataExists() is simply a boolean method that initially returns false but will eventually return true once the background task (addDataToDB()) is done. In other words we assume that background task introduces some side effect and dataExists() can detect that side effect. BTW I happened to have JDK 8 with Lambda support installed and IntelliJ IDEA gives me this nice tooltip: Suddenly I get this Java 8-compatible alternative suggested: private Callable recordInserted() { return () -> dataExists(); } But there's more: Which transforms my code to: private Callable recordInserted() { return this::dataExists; } this:: prefix means that recordInsterted is a method of current object. Just as well we can say someDao::dataExists. Simply put this syntax turns method into a function object we can pass around (this process is called eta expansion in Scala). By now recordInsterted() method is no longer that needed so I can inline it and remove it completely: await().atMost(10, SECONDS).until(this::dataExists); I am not sure what I love more - the new lambda syntax or how IntelliJ IDEA takes pre-Java 8 code and retrofits it for me automatically (well, it's still a bit experimental, just reported IDEA-106670). I can run this intention in IntelliJ project-wide, Lambda-enabling my whole code base in seconds. Sweet! But back to original problem. Awaitility helps a lot by providing decent API and some handy features. I use it extensively in combination with FluentLenium. But periodically polling for state changes feels a bit like a workaround and still introduces minimal latency. But notice that running and synchronizing on asynchronous tasks is quite common and JDK already provides necessary facilities: Future abstraction! java.util.concurrent.Future To limit the scope of refactoring I will leave the original new Thread() approach for now and use SettableFuture from Guava. It is a Future implementation that allows triggering completion or failure at any time, from any thread (see DeferredResult - asynchronous processing in Spring MVC for more advanced usage). As you can see the changes are quite small: public class ThreadWrapper { public ListenableFuture doWork() { final SettableFuture future = SettableFuture.create(); Thread thread = new Thread() { @Override public void run() { addDataToDB() //... //last instruction future.set(null); } private void addDataToDB() { // Dummy Code... // ... } }; thread.start(); return future; } } doWork() now returns ListenableFuture with lifecycle controlled inside asynchronous task. We use Void but in reality you might want to return some asynchronous result instead. future.set(null) invocation in the end is crucial. It signals that future is fulfilled and all threads waiting for that future will be notified. Once again, in practice you would use e.g. Future and then instead of null we would say future.set(someInteger). Here null is just a placeholder for Void type. How does this help us? Test code can now rely on future completion: final ListenableFuture future = wrapper.doWork(); future.get(10, SECONDS); future.get() blocks until future is done (with timeout), i.e. until we call future.set(...). BTW I use ListenableFuture from Guava but Java 8 introduces equivalent and standard CompletableFuture - I will write about it soon. So, we are getting somewhere. Future is a useful abstraction for waiting and signalling completion of background jobs. But there is also one immense advantage of Future which are not taking, ekhm, advantage from - exception handling and propagation. Future.get() will block until future is complete and return asynchronous result or throw an exception initially thrown from our job. This is really useful for asynchronous tests. Currently if Thread.run() throws an exception it may or may not be logged or visible to us and future will never be completed. With Awaitility it's slightly better - it will timeout without any meaningful reason, which have to be tracked down manually in console/logs. But with minor modification our test is much more verbose: public void run() { try { addDataToDB() //... future.set(null); } catch (Exception e) { future.setException(e); } } If some exception occurs in asynchronous job, it will pop-up and be shown as JUnit/TestNG failure reason. (Listening)ExecutorService That's it. If addDataToDB() throws an exception it will not be lost. Instead our future.get() in test will re-throw that exception for us. Our test won't simply timeout leaving us with no clue what went wrong. Great, but do we really have to create this special SettableFuture instance, can't we just use existing libraries that already give us Future with correct underlying implementation? Of course! By this requires further refactoring: import com.google.common.util.concurrent.ListeningExecutorService; import com.google.common.util.concurrent.MoreExecutors; import java.util.concurrent.Executors; import java.util.concurrent.Future; public class ThreadWrapper { private final ListeningExecutorService executorService = MoreExecutors.listeningDecorator( Executors.newSingleThreadExecutor() ); public ListenableFuture doWork() { Runnable job = new Runnable() { @Override public void run() { //... } }; return executorService.submit(job); } } This is what you've all been waiting for. Don't start new Thread all the time, use thread pool! I actually went one step further by using ListeningExecutorService - an extension to ExecutorService that returns ListenableFuture instances (see why you want that). But the solution doesn't require this, I just spread good practices. As you can see Future instance is now created and managed for us. The test is exactly the same but production code is cleaner and more robust. MoreExecutors.sameThreadExecutor() The final trick I want to show you involves dependency injection. First let's externalize the creation of a thread pool from ThreadWrapper class: private final ListeningExecutorService executorService; public ThreadWrapper() { this(Executors.newSingleThreadExecutor()); } public ThreadWrapper(ExecutorService executorService) { this.executorService = MoreExecutors.listeningDecorator(executorService); } We can now optionally supply custom ExecutorService. This is good for various other reasons, but for us it opens brand new testing opportunity: MoreExecutors.sameThreadExecutor(). This time we modify our test slightly: final ThreadWrapper wrapper = new ThreadWrapper(MoreExecutors.sameThreadExecutor()); wrapper.doWork().get(); See how we pass custom ExecutorService? It's a very special implementation that doesn't really maintain thread pool of any kind. Every time you submit() some task to that "pool" it will be executed in the same thread in a blocking manner. This means that we no longer have asynchronous test, even though the production code wasn't changed that much! wrapper.doWork() will block until "background" job finishes. The extra call to get() is still needed to make sure exceptions are propagated, but is guaranteed to never block (because the job is already done). Using the same thread to execute asynchronous task instead of a thread pool might have an unexpected results if you somehow depend on thread-based properties, e.g. transactions, security, ThreadLocal. However if you use standard ThreadPoolExecutor with CallerRunsPolicy, JDK already behaves this way if thread pool is overflowed. So it's not that unusual. Summary Testing asynchronous code is hard, but you have options. Several options. But one conclusion that strikes me is the side effect of our efforts. We refactored original code in order to make it testable. But the final production code is not only testable, but also much better structured and robust. Surprisingly it's even source-code compatible with previous version as we barely changed return type from void to Future. It seems to be a rule - testable code is often better designed and implemented. Unit test is the first client code using our library. It naturally forces us to to think more about consumers, not the implementation.

May 7, 2013

by Tomasz Nurkiewicz

· 9,000 Views · 1 Like

Coalition or Council: Which One Are You?

I have been thinking about institutions that strive for change. Sometimes we call them communities or organizations, sometimes we call them alliances or parties. But whatever their nature, these institutions are usually led and managed by a small group of people. I see two kinds of leading groups: coalitions and councils. coalition A temporary alliance of distinct parties, persons, or states for joint action council A group elected or appointed as an advisory or legislative body Coalitions A coalition is a self-selecting team. The persons seek each other out because they want to be active agents for change, and by working together they can be more successful in achieving a common goal. In his change management books John Kotter referred to them as guiding coalitions. They are not elected. They are not appointed. They select each other because they want to. And they can even work undercover, because their goal is to influence, not to govern. The allied powers in World War II were a coalition. The Google founders were a coalition. The originators of the Stoos Network were a coalition. Councils A council is a group of representatives. These people also want to be active agents for change. But, their primary concern is to have buy-in from the larger group of people they are representing within the institute (community, organization, or party). The concept of democracy has led to many different versions of these councils. Sometimes we call them a government. Sometimes a committee. And everything has to be out in the open, because if it’s not, we call them cronies. Their goal is primarily to govern or advise the institute. The United Nations has a council. My former students society had a council. And many workplaces have management teams acting as councils. And you? If you have a group of people who all desire change, do you lead with a coalition or with a council? This is the big problem with some alliances and consortiums for change. They have directors who try to be both. It is a recipe for disaster. Maybe the best institutions have both: a coalition and a council. (image from Veni Markovski)

April 21, 2013

by Jurgen Appelo

· 7,118 Views

Job Chaining in Quartz and Obsidian Scheduler

n this post i’m going to cover how to do job chaining in quartz versus obsidian scheduler . both are java job schedulers, but they have different approaches so i thought i’d highlight them here and give some guidance to users using both options. it’s very common when using a job scheduler to need to chain one job to another. chaining in this case refers to executing a specific job after a certain job completes (or maybe even fails). often we want to do this conditionally, or pass on data to the target job so it can receive it as input from the original job. we’ll start with demonstrating how to do this in quartz, which will take a fair bit of work. obsidian will come after since it’s so simple. chaining in quartz quartz is the most popular job scheduler out there, but unfortunately it doesn’t provide any way to give you chaining without you writing some code. quartz is a low-level library at heart, and it doesn’t try to solve these types of problems for you, which in my mind is unfortunate since it puts the onus on developers. but despite this, many teams still end up using quartz, so hopefully this is useful to some of you. i’m going to outline probably the most basic way to perform chaining. it will allow a job to chain to another, passing on its jobdatamap (for state). this is simpler than using listeners, which would require extra configuration, but if you want to take a look, check out this listener for a starting point. sample code this will rely on an abstract class that will provided basic flow and chaining functionality to any subclasses. it acts as a very simple template class. first, let’s create the abstract class that gives us chaining behaviour: import static org.quartz.jobbuilder.newjob; import static org.quartz.triggerbuilder.newtrigger; import org.quartz.*; import org.quartz.impl.*; public abstract class chainablejob implements job { private static final string chain_job_class = "chainedjobclass"; private static final string chain_job_name = "chainedjobname"; private static final string chain_job_group = "chainedjobgroup"; @override public void execute(jobexecutioncontext context) throws jobexecutionexception { // execute actual job code doexecute(context); // if chainjob() was called, chain the target job, passing on the jobdatamap if (context.getjobdetail().getjobdatamap().get(chain_job_class) != null) { try { chain(context); } catch (schedulerexception e) { e.printstacktrace(); } } } // actually schedule the chained job to run now private void chain(jobexecutioncontext context) throws schedulerexception { jobdatamap map = context.getjobdetail().getjobdatamap(); @suppresswarnings("unchecked") class jobclass = (class) map.remove(chain_job_class); string jobname = (string) map.remove(chain_job_name); string jobgroup = (string) map.remove(chain_job_group); jobdetail jobdetail = newjob(jobclass) .withidentity(jobname, jobgroup) .usingjobdata(map) .build(); trigger trigger = newtrigger() .withidentity(jobname + "trigger", jobgroup + "trigger") .startnow() .build(); system.out.println("chaining " + jobname); stdschedulerfactory.getdefaultscheduler().schedulejob(jobdetail, trigger); } protected abstract void doexecute(jobexecutioncontext context) throws jobexecutionexception; // trigger job chain (invocation waits for job completion) protected void chainjob(jobexecutioncontext context, class jobclass, string jobname, string jobgroup) { jobdatamap map = context.getjobdetail().getjobdatamap(); map.put(chain_job_class, jobclass); map.put(chain_job_name, jobname); map.put(chain_job_group, jobgroup); } } there’s a fair bit of code here, but it’s nothing too complicated. we create the basic flow for job chaining by creating an abstract class which calls a doexecute() method in the child class, then chains the job if it was requested by calling chainjob() . so how do we use it? check out the job below. it actually chains to itself to demonstrate that you can chain any job and that it can be conditional. in this case, we will chain the job to another instance of the same class if it hasn’t already been chained, and we get a true value from new random().nextboolean() . import java.util.*; import org.quartz.*; public class testjob extends chainablejob { @override protected void doexecute(jobexecutioncontext context) throws jobexecutionexception { jobdatamap map = context.getjobdetail().getjobdatamap(); system.out.println("executing " + context.getjobdetail().getkey().getname() + " with " + new linkedhashmap(map)); boolean alreadychained = map.get("jobvalue") != null; if (!alreadychained) { map.put("jobtime", new date().tostring()); map.put("jobvalue", new random().nextlong()); } if (!alreadychained && new random().nextboolean()) { chainjob(context, testjob.class, "secondjob", "secondjobgroup"); } } } the call to chainjob() at the end will result in the automatic job chaining behaviour in the parent class. note that this isn’t called immediately, but only executes after the job completes its doexecute() method. here’s a simple harness that demonstrates everything together: import org.quartz.*; import org.quartz.impl.*; public class test { public static void main(string[] args) throws exception { // start up scheduler stdschedulerfactory.getdefaultscheduler().start(); jobdetail job = jobbuilder.newjob(testjob.class) .withidentity("firstjob", "firstjobgroup").build(); // trigger our source job to triggers another trigger trigger = triggerbuilder.newtrigger() .withidentity("firstjobtrigger", "firstjobbtriggergroup") .startnow() .withschedule( simpleschedulebuilder.simpleschedule().withintervalinseconds(1) .repeatforever()).build(); stdschedulerfactory.getdefaultscheduler().schedulejob(job, trigger); thread.sleep(5000); // let job run a few times stdschedulerfactory.getdefaultscheduler().shutdown(); } } sample output executing firstjob with {} chaining secondjob executing secondjob with {jobvalue=5420204983304142728, jobtime=sat mar 02 15:19:29 pst 2013} executing firstjob with {} executing firstjob with {} chaining secondjob executing secondjob with {jobvalue=-2361712834083016932, jobtime=sat mar 02 15:19:31 pst 2013} executing firstjob with {} chaining secondjob executing secondjob with {jobvalue=7080718769449337795, jobtime=sat mar 02 15:19:32 pst 2013} executing firstjob with {} chaining secondjob executing secondjob with {jobvalue=7235143258790440677, jobtime=sat mar 02 15:19:33 pst 2013} executing firstjob with {} deficiencies well, we’re up and chaining, but there are some problems with this approach: it doesn’t integrate with a container like spring to use configured jobs. more code would be required. it forces you to know up front which jobs you want to chain, and write code for it. configuration is fixed, unless, once again, you write more code. no real-time changes (unless you write more code). a fair bit of code to maintain , and high likelihood you will have to expand it for more functionality. the theme here is that it’s doable, but it’s up to you to do the work to make it happen. obsidian avoids these problems by making chaining configurable, instead of it being a feature of the job itself. read on to find out how. chaining in obsidian in contrast to quartz, chaining in obsidian requires no code and no up-front knowledge of which jobs will chain or how you might want to chain them later. chaining is a form of configuration, and like all job-related configuration in obsidian, you can make live changes at any time without a build or any code at all. job configuration can use a native rest api or the web ui that’s included with obsidian. the following chaining features are available for free: no code and no redeploy to add or remove chains. you can chain specific configurations of job classes. you can chain only on certain states, including failure. chain conditionally based on source job saved state (equivalent to quartz’s jobdatamap), including multiple conditions. regexp/equals/greater than, etc. chain only when matching a schedule. check out the feature and ui documentation to find out more. now that we know what’s possible, let’s see an example. once you have your jobs configured , just create a new chain using the ui. rest api support will be here shortly but as of 1.5.1 chaining isn’t included in the api. if you need to script this right now, we can provide pointers . in the ui, it looks like the following: easy, huh? all configuration is stored in a database, so it’s easy to replicate it in various environments or to automate it via scripting. as a bonus, obsidian tracks and shows you all chaining state including what job triggered a chained job. it will even tell you why a job chain didn’t fire, whether it’s because the job status didn’t match, or one of your conditions didn’t. conclusion that summarizes how you can go about chaining in quartz and obsidian. quartz definitely has a minimalist approach, but that leaves developers with a lot of work to do. meanwhile, obsidian provides rich functionality out of the box to keep developers working on their own rich functionality, instead of the plumbing that so often seems to consume their time. if you have any suggestions or feature requests for obsidian, drop us a note by leaving a comment or by contacting us .

March 10, 2013

by Carey Flichel

· 16,861 Views · 1 Like