Originally Written by Robert Buck
We’ve all seen this: The car that won’t start on a cold morning, the air-conditioner that won’t work mid-August. We cope and move on. But in the age of Cloud, when it comes to personal information access, coping can be somewhat more challenging.
We’ve all heard the horror stories of Amazon Cloud failures, their business impact, etc. Just two weeks ago while preparing an application in AWS for a prospect, I was greeted by this web page, downed services, hours before an engagement:
And a week before the most recent spat of DDoS attacks against Github.com I was similarly faced with the dilemma of not being able to access source code repositories.
In an age where information is increasingly online, data availability is becoming ever more essential than ever.
Enter NuoDB, a distributed SQL database that solves the data availability problem. I will leave it to the reader to review Seth Proctor's tech-blog article on data availability to understand how NuoDB solves continuous data availability and protection. Apart from that, comprehensive resiliency is not just a database concern, it’s a concern that needs to be holistically applied to the entire application stack. This article illustrates some strategies for how to apply comprehensive resiliency and take advantage of a cloud-enabled database such as NuoDB – a database that supports elastic scaling, continuous availability, and geo-distribution.
As the material is freshly on my mind and is of interest to some banking sector organizations I've spoken with recently, I will demonstrate techniques to achieve client-side comprehensive resiliency so that end users will not see the above sorts of web site errors. The solution presumes that NuoDB is used, is configured and deployed in a continuously available manner, and that the database already provides fully online backup/restore, schema evolution, rolling upgrades, etc.
For many, Mule is ESB nirvana; its Anypoint Studio provides an Eclipse-based development environment with UI-driven designers and hundreds of out-of-the-box micro-services components that produces an XML-based workflow description and source code that is easily deployable and extensible. What Mule is great at is enterprise service composition, but this may also be one of Mule’s greatest areas of weakness. Whereas the Mule environment frequently encourages one to adopt new components, its ever-evolving API and trail of deprecated service components frequently leave developers stranded on prior releases, facing rewrites to adapt newer components. The greatest risk, however, may be coding logic into XML for what clearly belongs in application space, as XML makes for a terrible programming language.
Making the Hard Look Simple
Mule makes the hard look simple. To illustrate, take this example, a financial services REST API that receives a JSON object, performs a database INSERT, and returns a result. The XML definition for this example is marvelously simple and can be found in the Mule documentation.
But to the point of the major thrust of this article, provided a distributed cloud-scale database, does the application itself also support comprehensive resiliency? What if you scale out the database: Does the connection pool automatically rebalance connections over the set of available transactional endpoints? What if those transactional endpoints fail, are taken down during part of a rolling upgrade, or you scale in the database tier after hours of peak traffic: Do you lose some transactions and end-users see database availability messages on web pages? What if you need to evolve the schema of the database, can it be done during peak hours without loss of service availability? NuoDB uniquely provides many capabilities to help solve these concerns, but it also demands that applications be written with resiliency in mind.
So in the following sections we will look at these application-side concerns:
- being able to take advantage of additional transactional endpoints (scale out)
- being able to dynamically adapt to fewer transactional endpoints (scale-in)
- being able to reason about failure and take appropriate actions (retry)
In the NuoDB architecture, the broker is responsible for directing clients to suitable transactional endpoints (transaction engines). Since the routing algorithm is flexible, you can route connection requests in a round-robin fashion, and preferentially route to affinity-based (local-host) transactional endpoints. Alternatively, you may route based upon geo-region. The key point here is that the connection string for the database is typically the address for a broker, or may optionally be a comma-separated list of broker addresses. The broker plays the central role in load-balancing connections amongst the available transaction endpoints. However, the client application also plays a role in that it needs to be able to automatically adapt to changing counts of transactional endpoints by periodically reconnecting, which will, as a matter of course, rebalance connections over newly available resources.
Approaches to dynamically adapt herein conceptually apply to more than simply Mule; the same considerations would be applied to applications deployed on competitive ESB technologies, or in non-ESB environments such as JBoss/Wildfly, or those that are not Java-based. One approach to achieve greater efficiency, and a technology common to platforms such as these, though perhaps different in implementation, is that of connection pooling. Whether you use Mule ESB, or JBoss/Wildfly, underlying each stack are implementations of connection pools in varying degrees of maturity. It’s this central element of the platform that most significantly participates in scale-out, and it’s this component, that if woefully inapplicable, will not allow the application to dynamically adapt to additionally available transactional endpoints.
There are several connection pool technologies available in the FOSS community, and several platforms have taken to write their own implementations as well. Some great analysis has already been performed comparing some of these and is available online. But in summary, their ability to seamlessly take advantage of elastically scalable distributed databases is hinged upon two properties:
- if they’re capable of setting a maximum connection age
- if they’re capable of configuring a connection validation check (string)
In summary, here is an abbreviated list of connection pool technologies commonly used with JDBC and their capabilities along these two vectors:
However, these are not to sole selection criteria; there are several other criteria that may also be applicable that the reader ought to investigate when choosing:
- when validation checks are performed, upon return or retrieval
- whether validation checks are performed in isolated transactions, or is a transaction in progress when a connection is handed to the caller
- whether or not it checks for and closes abandoned statements
- whether or not settings are reapplied when connections are returned to the pool
- whether or not SQL warnings are cleared when connections are returned to the pool
- whether or not rollback is called when connections are returned to the pool
- whether or not statement caching is available
HikariCP, although a relative newcomer, has become rather popular and to the extent that authors of other implementations have declared their libraries deprecated in favor of HikariCP for several reasons detailed in analysis on the HikariCP site. As such, the sample code here makes use of HikariCP for the MuleESB demo.
With respect to the two configuration properties previously mentioned, as for general guidance, I typically recommend setting the maximum connection age to the maximum of either twice the slowest statement time in the event you have long running statements, or five minutes. The application will periodically perform reconnects, asking the broker which endpoint to connect to, and thusly rebalance connections.
The complete definition of a data source, the connection pool, and its settings, that support elastic scaling are such as these:
Likewise, as NuoDB is an elastically scalable database, it supports scaling in the database in real-time, no re-sharding is required. Given an application has existing connections already in its pool, occasionally connections in the pool will be invalid, so by what means are connections identified as invalid, and how are replacement connections created?
Most connection pool technologies provide some sort of hook for performing validation checks for its pooled connections. For databases, this is typically expressed as a simple side-effect free statement; for NuoDB, this statement is SELECT 1 FROM DUAL. The connection pool itself executes a validation statement to check validity of a pooled connection, provided that the connection pool technology provides useful semantics around validation, then scaling in simply becomes a matter of reconnecting to a surviving endpoint, and retrying your transaction.
That seems straight-forward enough, but what impact do validation tests have on latency, when are validation checks performed, and are these checks performed in an isolated manner?
For pools that perform background validation, observable latency interacting with the pool is low; however, it would also be the case that for elastically scalable databases, there is greater likelihood the client will be handed an invalid connection. And granted, for the case where validation is performed at the moment of retrieval of a connection from the pool, the observable latency will be slightly higher, but there is also less likelihood that a connection handed to the client is invalid.
The timing of when the validation checks are performed differs between connection pool implementations; some perform the checks when connections are returned, others perform it when it is retrieved, and others offer both options as a matter of configuration. The breakdowns of the type of validation checks supported for the remaining connection pool candidates are as follows:
Though it’s generally a trade-off between decreased latency and safety, I tend to prefer safety, which is to prefer validating the connection upon retrieval. Others may prefer otherwise, but clearly the most important detail here is what happens when validation checks are performed, whether or not they are run in their own isolated transaction when the default auto-commit level is set to false. For example, BoneCP will leave an open transaction in progress when the connection is handed to the user; this is why BoneCP received a red ding in the chart above.
Reasoning About Failure
The final line of conversation herein is related to error handling; Java forms the basis of all the examples here, and while every language doesn’t provide identical capabilities in terms of expression and classification of error types, certainly Java and C# have identical concepts: those of transient and non-transient database exceptions. If one were to define these, transient exceptions are those thrown in situations where a previously failed operation might be able to succeed when the operation is retried without any intervention by application-level functionality. As for non-transient exceptions, these are those where a retry of the same operation would fail unless the cause of the exception is corrected. Examples of the former include timeouts, connection failures, and rollbacks, whereas examples for the latter include SQL syntax errors, authorization issues, and unsupported features. The distinction between the transient and non-transient is a key capability toward being able to reason about database failures as seen by clients. Being able to reason about failure viz. the class of exception, allows users to programmatically determine whether or not a user transaction should be retried. Microsoft has an Enterprise Practices library for C#, and for Java one exists for Spring, and another for standard POJOs; we will be using the latter in the enclosed samples. By applying a declarative retry policy to the code, the code is made more resilient, and end-users will only see success.
Here are some sample Spring Bean definitions that adopt declarative retry:
Those are the Spring Bean definitions within my Mule ESB workflow, the resilient REST API code is now simply the following; the retry framework handles retries automatically according to the policy configured:
By using an Inversion of Control (IoC) container, we have separated the retry policy declaration from implementation so that through configuration, the policy may be changed, perhaps to use fixed-interval retries, or some other pluggable retry strategy that you come up with.
In the article we discussed some strategies to deliver greater client-side resiliency so applications can realize benefits from NuoDB’s unique architecture, namely its ability to elastically scale, and to provide continuous data availability. In the article we chose and incorporated the use of a connection pool technology; we configured checks for connection state, and lastly we added in policy-based transaction retry. The sample code related to this article is available online at Github.