ESB Logging (Part 2)
ESB Logging (Part 2)
In this article, I am going to describe some techniques that can be adopted to implement this functionality.
Join the DZone community and get the full member experience.Join For Free
In my earlier article ESB Logging (Part 1) I have discussed ESB logging functionality at a high level. In this second part, I am going to describe some techniques that can be adopted to implement this functionality.
Different techniques and technical choices can be applied at each level of the following generic logging architecture shown below:
In this article, I will focus essentially on the component of the architecture which is closer to the ESB proper, and notably on the logic of the Logging Publisher, which needs to be called at each logging point of our ESB integrations.
The role of the Logging Transport is to decouple the Logging Publisher from the Logging Data Store and is an essential one. A tightly-coupled design that forces Logging Publisher API calls from integration code to block (until data is committed to the Logging Data Store) would dramatically slow down our integrations. Worse still, it may even leave them hanging in case the datastore is temporarily unavailable. Mission-critical integrations must NOT in any case be impacted by the availability of the logging infrastructure!
The Logging Publisher component must be directly callable from any ESB component via a simple API. It is imperative that:
- The logging call is simple. If not, developers will need too much effort to place logging calls throughout the integration code in a consistent manner. The simplicity of the logging call largely depends on the number and nature of its mandatory inputs. In most designs, the severity level and the log message should be the only mandatory inputs that the developer is explicitly required to supply.
- The logging call is efficient, imposing the minimum possible overhead on the calling code. The Logging Publisher must decide in the shortest possible time whether to hand over the data to the Logging Transport and, if so the handover itself must be as quick as possible.
The logging API constitutes a very simple contract between the logging solutions and its clients: the logging solution can freely evolve as long as the API contract is not broken.
Explicit and Implicit Inputs to Logging Calls
Part of the logged data is passed explicitly by the caller:
- Severity Level of the logging entry (e.g., DEBUG, INFO, WARNING, ERROR, FATAL)
- A Log Message. This can be a potentially long string of text, which can include dynamic sections. It is not, however, to be used to store a whole message (which is the purpose of the document attachments mentioned below)
- A Log Id to identify the context from which the logging call is made; use of this identifier may be optional or mandatory, depending on the logging design
- Optionally, or more document attachments (representations of data objects treated by the ESB component, like for example a request payload); the persistent storage of documents associated with logging entries is often called message archiving and is treated later in this article.
Considering the ubiquitous Log4J API, the four data elements above are supported: the LogId is related to the identity
Logger the object itself (which may be instantiated with a name), the Severity Level derives from the
Logger a method called, the Log Message is always required by the
Logger the method, while document attachments optional objects, as shown in this
Logger method definition:
void error(String message, Object... params)
A logging time-stamp will be of course generated internally within the logging call.
However the above is not generally enough for good logging. Additional data elements we need are :
- The name of the ESB component calling the Logging Publisher function (this is a technical name)
- The server and port associated with the execution of the ESB component
- The user-id under which the ESB component is executing
- The consumer id for which the ESB component is executing; this may not be present in case the ESB component does not implement a service (e.g., in the case the component is of the polling kind); also, depending on the authorization design, the consumer id may be the same as the user id mentioned above
- Tracking or Correlation Identifiers that unequivocally relate to the service execution and correlate all logging entries originating from this execution (more on this later)
- Optionally, a set of tags that associate a functional domain with the component (e.g., #creditcheck or #billing) and/or tags that associate the organization on behalf of which the integration is executed (e.g. #company_uk20); the combination Logging Data Store + Search/Analysis Tool + Logging UI may be set up in such a way that tag searches are possible (e.g., what billing integration errors did we have for company uk20 in the last 24 hours?)
- Optionally, an incremental sequence number that sequences logging entries generated by the same execution (i.e., that have the same correlation identifier value)
Two considerations can be immediately made.
First, to include all (or most) of the above additional data elements into the logging event stream, it is impractical to directly invoke Log4J methods: the code of each integration would have to bundle every time the information into an object or two (the payload, if logged, would probably make an object of its own) and pass it to Log4J. This would be a clear violation of the DRY principle, and would all but ensure inconsistency.
The additional data elements should be extracted implicitly and automatically by the Logging Publisher function from the execution context of the running integration and passed to the lower level logging API (typically Log4J) after packaging this information into standard objects. Therefore the Logging Publisher needs to be a shared custom component in the "ESB stack", to be reused across all integrations.
Second, to retain all the above information across logging calls from the same execution the Logging Publisher must be stateful. To achieve this, two main approaches are used:
- Deferring logging state to the client of the Logging Publisher API (that is, to the integration code), by having logging state information "carried around" throughout the service implementation logic in some kind of standard "control header" or "execution context" structure.
- Caching logging state.
The first approach is simpler, but relying on the integration code to correctly hold executions context information is problematic. This requires careful coding and good QA on the produced integration code. The second approach, which is to be considered a best practice, is described next.
Caching Logging State
When the Logging Publisher API is called, it will pass its data (received from the caller + collected automatically) to the Logging Transport to persist it.
The decision to persist (or not) a logging entry normally depends on its Severity Level (as passed by the caller) relative to the Severity Level Threshold set at the system or ESB component level.
It makes sense in most cases to immediately deliver to the Logging Transport any logging entry whose severity exceeds the applicable Severity Level Threshold, so there is the shortest possible lag time between the generation and the persistence of these events.
For better supportability, it is crucial that Severity Level Thresholds can be set into a data store that is accessed dynamically at runtime, so logging can be "turned up" and "turned down" on the fly by application support staff.
The logging thresholds (global, and by integration/integration component) must be configurable in a persistent store and exposed via an API. Since this configuration is accessed every time the Logging Publisher function is called, caching it is essential to avoid excessive overhead. Application Support can use a custom UI, a script, or a standard REST client to alter the logging thresholds in the persistent store, and then force a refresh/reload of the cache to have the new values take effect in real-time.
In most designs, logging events having severity lower than the applicable Severity Level Threshold are simply ignored but is some cases they may still be processed later in case of an error (see section Deferred Logging at the end of this article).
The use of (automatically assigned) sequence numbers associated with logging entries allows logical ordering of the entries from a certain execution instance, regardless of which entries were persisted first.
The role of the Logging Transport is to decouple the Logging Publishercfrom the downstream part of the logging architecture, which starts with the Logging Data Store. Generally, once the Logging Publisher passed a logging entry (or a set of related logging entries) to the Logging Transport, the data passed over is considered as "committed" and will be eventually persisted.
The Logging Transport must:
- expose a very simple API towards the Logging Publisher,
- internally operate asynchronously concerning the Logging Publisher.
Once the Logging Publisher logic decides that an entry is to be persisted, it simply calls the transport API to enqueue the entry and then returns very quickly to its caller.
The processing from the Logging Transport to the Logging Data Store is thus completely decoupled.
In the simplest implementations, the Logging Transport just consists of a wrapper around the access protocol of the Logging Data Store. For example, with a relational database used as Logging Data Store, the Logging Transport API would just wrap a DB call (or set of calls), executed in a separate thread. This solution shows its shortcomings in the face of peaks of logging traffic, where the DB pools size is exhausted and logging entries must be discarded (with loss of potentially important logging information).
A robust solution for the Logging Transport should use persistent queues for the temporary storage of log entries to be persisted into the Logging Data Store. This provides the necessary buffer to cope with variable loads.
An important point is that the queuing solution used by logging should be independent (both architecturally and from an infrastructural standpoint) from the asynchronous messaging solution used for business-critical messages (as part of the integration logic). A very high logging volume should not compromise the throughput and latency of business-critical messages, and for sure we want to completely avoid the risk of a flood of log entries bringing the main messaging system to its knees.
As an alternative to using a custom solution based message queue software, the Logging Transport can be more easily implemented by leveraging a prebuilt solution such as Logstash, which is part of the ELK stack.
Finally, there are interesting possibilities for advanced use of the Logging Transport, such as:
- Holding entries in queue until a "flush" method is called, and releasing them to the Logging Data Store only then (this is required by a particular implementation of the Deferred Logging technique, described at the end of this article).
- Aggregating related entries from the queue before persisting them into the Logging Data Store. This is useful above all when a NoSQL (Document-oriented) DB is used as Logging Data Store, to bundle multiple related entries into a single document.
Logging Data Store and Search/Analysis Tool
As already stated in part 1, the Logging Data Store must be logically centralized to be effective, even if it physically distributed.
The most common way of implementing a centralized data store for logging is via some kind of database. In the past, ESB tools tended to use a traditional relational DBMS also for this purpose, then the trend moved towards NoSQL, document-oriented databases such as MongoDB.
More recently, the emphasis on the need of searching logs more openly by tags or free text has increasingly driven the choice towards data stores that are built from the ground up for flexible searching, the obvious example being Elasticsearch.
What matters, anyway, is that the datastore is optimized for two things:
- insertions (there is no modification of logs beside the eventual purge of old log data)
- queries by indexed fields (e.g., correlation ID) and tags (which as we discussed typically have functional meaning)
The search functionality, although shown in the architecture diagram as a separate component, is very often an integral part of the data storage solution as it is for any database, with analytics capabilities often an option that can be added as a "bolt-on" or via custom development.
Whenever there is a functional audit requirement of logging the full message payload (payload archiving) we have to be careful: the inclusion of heavy payloads (in the order of 1 MB or more) has a performance impact in the logging data store which most of the time is not designed for this. Besides, transmitting a large volume of big messages over the network just for logging purposes is not the best use of bandwidth.
A couple of best practices on this topic:
- Log the full payload only when it is strictly required
- Use a separate data store as Message Archiving Store for the payloads; in B2B scenarios, all messages exchanged with trading partners are normally persisted in a dedicated database that is already available as part of the B2B solution (e.g. Software AG webMethods Trading Networks); in non-B2B scenarios, an alternative message store may have to be implemented
- Include as part of the log entry only a unique reference to the payload stored in the Message Archiving Store;
Logging User Interface
I am not going to say much on this topic as I am no UI/UX specialist. Supporting different levels of "drill down" and easy customization of the user interface are important factors here.
Alerting (whether triggered by errors or by trends detected by a log analysis tool) is not in the scope of this article, but I believe a key requirement is that URL's can be generated (and included into the alert message), through which users can directly access the relevant log entries in the Logging UI (e.g., to show the log entries with give correlation Id in a given environment).
Clearly, the tighter the integration between the Logging UI is and the Search/Analysis Tool (and hence the Logging Data Store), the easier the setup. This is the case of the ELK stack, in which the Kibana UI was designed upfront to work with Elasticsearch.
In the case of complex integrations and complex networks of service, the sheer number of logging entries produced, spanning many components having complex mutual relationships, can make it a challenge to understand the execution flow for somebody who is not well acquainted with the design, even with a good UI. For this, I can point you to this excellent article, where the set of logging entries related to the same service execution is called a traceview.
Specific Logging Functionality
The correlation of all logging entries generated as part of the particular execution of integration (or top-level service/API) to a single set (traceview) is a fundamental aspect of every logging solution. Without such a correlation, a lot of value of logging is no realized.
Local Correlation Identifier
This globally unique identifier is generated at the beginning of each thread involved in the execution of a service.
Only logging entries generated within the same synchronous execution context share the same value for this identifier. This local correlation identifier can be normally directly derived from the ESB runtime context (e.g., root context Id in webMethods ESB or correlation id in Mule ESB).
Global Correlation Identifier
This is is the most important correlation ID and also needs to be globally unique. Many times it gets the value of the Local Correlation Identifier of the first ESB component that is executed as part of a service invocation. The service/integration logic must then propagate it across the whole execution, regardless of whether the components involved are loosely or tightly coupled. This is achieved technically via protocol properties depending on the implementation of the service (e.g., HTTP headers, SOAP headers, JMS properties, etc.)
The ubiquitous availability of this correlation identifier throughout the execution chain is what allows all logging entries to be effectively correlated.
For request-response services, it is a best practice to return the Global Correlation Identifier as part of the service response. This allows a consumer application to create a cross-reference between the ESB Global Correlation Identifier and any internal identifier (for instance the one used as "Tracking ID", described below).
External Correlation Identifier
Technically, this identifier may be handled exactly like the Global Correlation Identifier, but there is an important difference: it is generated externally by the service/API consumer, not internally by the ESB. Its main purpose is to allow tracking via a "source identifier" that has meaning in the context of the consumer application. It can thus be called the "Consumer Tracking ID"
The ESB can enforce that each consumer supply a Tracking ID via the service/API contract, rejecting service requests that lack this ID. However, the ESB cannot easily enforce the uniqueness of this ID (even at the consumer level). For this reason, a Consumer Tracking ID cannot be used as a substitute for the Global Correlation Identifier. Rather than propagating the External Correlation Identifier throughout service execution, it is sufficient to simply associate it with the Global Correlation Identifier in the Logging Data Store.
Logging and Business Process Management (BPM)
When using a Business Process Management System (BPMS) to orchestrate ESB services, custom logging is usually less necessary as most BPMS tools offer ample process audit and monitoring capabilities. For each process instance, the sequence of states (i.e., the sequence of traversed process steps) is tracked, usually with the possibility of logging selected input and outputs for process model steps.
Virtually all BPM engines natively support the definition of a Process Correlation ID to join events to running process instances. This correlation identifier normally has functional significance and is to be considered the equivalent of the External Correlation Identifier discussed above.
Each process instance is assigned a unique Process Instance ID (PID) by the BPMS, which is to be considered the equivalent of the Local Correlation Identifier (although it is not in general associated with a single thread of execution).
For the Global Correlation Identifier, a similar logic can be followed as for non-BPM integrations, by adopting the PID of the "root" process instance and propagating this value to all sub-processes, regardless of whether they are detached or not. Detached sub-processes will have their instance ID (= Local Correlation Identifier), but can still share the propagated Global Correlation Identifier (and the Process Correlation ID = External Correlation Identifier).
In case the out-of-the-box functionality of the BPMS does not allow easy audit logging, the techniques described in this article are perfectly applicable to automated business processes, with the equivalences just mentioned.
In Production environments, it is common practice to log only at INFO level and higher. This limits the logging information available when an error occurs as DEBUG entries occurring before the error was not logged, and these would be useful to troubleshoot the incident.
When a component or service instance experiences an error, we may want the maximum amount of logging information available before the error that was made available irrespective of the Severity Level Threshold.
The technique called Deferred Logging consists in generating and temporarily holding all logging information (also at DEBUG level), until the execution of the API/service instance terminates, at which point a "flushing" action is triggered:
- If the instance terminated successfully, the pending logging entries are filtered based on the current Severity Level Threshold. Only the entries meeting or exceeding the threshold are flushed to the Logging Data Store (e.g., if the threshold is set to "WARNING", then "INFO" and "DEBUG" entries are discarded).
- If the instance terminates with an error, then all pending logging entries are flushed to the Logging Data Store, this storing a complete "history" of what happened before the error.
Deferred Logging may be implemented either within the Logging Publisher or the Logging Transport and must use some kind of memory cache (e.g., ehcache or Redis DB) to temporarily hold the information until there is an outcome for the execution. The impact of performance and memory occupation must, of course, be considered, based on the volumes, the number of log entries per execution, and the duration of the execution (= period of caching of the entries before flush).
A Final Word
Although this article has not been especially brief, we need to recognize that the topic is rather broad and its scope goes very much beyond ESB's. The discussion about distributed logging in microservices remains a very active one.
The advice (as always) is to start small and evolve incrementally, based on the supportability requirements that are specific to your organization. Once a flexible logging API is carefully designed, its implementation (the logging solution) can be made more and more sophisticated over time.
Published at DZone with permission of Francesco Ganora . See the original article here.
Opinions expressed by DZone contributors are their own.