We had finished up the first set of requirements for some project and obtained a fully working application. It was an application for retail industry which stores article information and it was cloud based, where each company paid for a subscription, uploaded data and gained access to all kinds of statistics. There was only one problem, it was hard for customers with a lot of articles to upload them, and it was essential for them to download them in a format their ERP system could handle (usually XML or CSV). We needed a framework which could provide this for us, we needed easy parsing and generating of documents, some kind of scheduling or triggering to start import or export, restartability feature, and monitoring of jobs. Since that project was mainly Spring and Java based, we decided to go with Spring Batch. It has met all of our requirements except monitoring but there is a project called Spring Batch Admin which can be used as a web application on top of Spring Batch meta tables and provides good monitoring capabilities.
Spring Batch is a Spring tool for executing batch jobs. It has a job configuration which consists of steps. A step can have a reader, one or multiple processors and writer or it can have a tasklet which is basically a single action step with execute method. A job can have various configurations where you can specify chunk processing for a step (how many items you want to process in one go), commit interval (how many items to write in a step), restartability option, the maximum number of invocations, job parameters etc. This is the only subset of options, for all possibilities or specific cases, reference documentation should be checked.
Spring Batch provides item readers and item writers for most common cases. Use cases where Spring Batch is powerful is reading from a file, database, message queue or various lists and writing to mostly the same formats. We had a case where we provided id and wanted to do HTTP request to fetch list to process so we needed to use a custom reader which extended ItemReader and in initialization we fetched list from HTTP endpoint. That way we could give the id as input job parameter and step will fetch list when needed. Each time you restart the job it will try to fetch list so it is isolated.
Spring Batch is using the database for meta tables for job and step state. Mostly all relational databases can be used, spring batch provides connectors and scripts to create and tear down a schema for tables. You can configure Spring Batch to deal with tables each time the application is started.
A good place to check various examples and jobs is GitHub Spring Batch examples repository.
Spring Batch Admin
Spring Batch Admin is an open source project which provides a web-based interface and REST-like API for Spring Batch. It provides a nice way to monitor your jobs, trace exceptions which occur during job run, restart jobs, stop jobs or abandon them, upload the whole job configuration or job-specific configuration file. It works as a web application, uses tomcat server, you can customize the port on which it can work and it is GUI on top of regular Spring Batch tables in the database. It provides visualization of Spring Batch meta tables with a convenient way of controlling jobs.
Each job configuration must be placed to META-INF/spring/batch/jobs/ as XML, each job should be self-contained and must have all dependencies for successful processing.
When you want to work with CSV files, relational database or message queue with minimal configuration you get fully functional job with little or no configuration. There is a great selection of ItemReaders and ItemWriters out of the box shipped with Spring Batch project and its usage is is only a matter of configuration.
As the number of jobs starts growing, it gets much easier, since all jobs look alike, you can even create abstract steps and jobs and reuse them across different jobs.
Testability support is really great, there is JobLauncherTestUtils which can start jobs in a controlled environment and you can verify the output against the expected one.
It has nice scaling options where the master step is creating many threads and giving work to each thread so you can parallelize execution. This is called Partitioning and you can have an aggregator to collect all of the data in the end. This is one way of scaling that we have tried, for more options you can check the scaling documentation.
We wanted to use Java configuration since many examples on the internet are XML based, so sometimes it can be hard to find an example of configuration you are trying to implement.
When you move out of conventional use cases, you have to pull up your sleeves to make things work. Our input was SOAP server so we needed to create a custom reader which pulled initial data from SOAP server to list.
Spring Batch with Spring Boot for multiple jobs is not well documented. We created a parent project which defined a multiple child context for each job and each job had a launcher. Spring Batch examples with Spring Boot are mostly single job examples, so some tweaking is needed to make things work. This CodeCentric starter project helped us a lot in our baby steps. We have explained in this StackOverflow answer how to achieve multiple job configuration.
Integration of Spring Batch and Spring Batch Admin with Spring Boot was more problematic than beneficial. Spring Batch Admin works best if you deploy and start jobs through it. We wanted to have our own launchers and use them only for monitoring. Importing properties and using scheduler are just some of the problems explained in StackOverflow answers. We finally decided to drop the single application approach and separated Spring Batch and Spring Batch Admin applications into two, connected to same metatable. By doing so, we lost the ability to launch jobs through Spring Batch Admin, which we did not need, but we lowered the complexity and moved the monitoring application from the main processing application.
From the initial read of Spring Batch documentation, it looks like an easy straightforward framework. It has readers, processors and writers and it basically functions as an input-output framework. However, it has a steep learning curve when it comes to custom use cases. There are many possibilities and it is easy to choose the wrong one, kill performance or mess up multithreaded execution. It is a fun framework to work with, great for file parsing and database integration use cases and certainly a tool we at SmartCat would choose again for solving this kind of problems.