ESB Logging (Part 1)
ESB Logging (Part 1)
Logging is a necessary cross-cutting concern in any software system, and specifically also for an Enterprise Service Bus (ESB).
Join the DZone community and get the full member experience.Join For Free
Logging is a necessary cross-cutting concern in any software system, and specifically also for an Enterprise Service Bus (ESB).
This article is the first of a two: in this one (part 1) I am going discuss what functionality an ESB logging system should have, while in the second article (part 2) I will introduce some techniques to implement this functionality.
Note on terminology: in these articles I am going use the somewhat older terms of ESB Service Component and ESB Service Component Endpoint, to generically refer to ESB components and their exposed endpoint, respectively, regardless of technology, communication protocol, and exchange pattern. So the discussion is equally applicable to ESB components that expose REST API, a SOAP Web Service, a Websocket, or are triggered via a queuing mechanism.
Functional and Technical Requirements of ESB Logging
In ESB systems, there are several reasons why we need to do logging, the main ones being:
- To keep a functional audit of integration activities (sometimes called "transaction logging"); this is essentially a traceability requirement.
- To have detailed troubleshooting information available in case of errors or malfunctions
- To collect performance information, which helps us identify bottlenecks in our service implementation
These three points can be considered as very high-level functional requirements for ESB logging. As always, depending on the specifics, each of these translates into technical requirements that must be appropriately addressed. The following subsections will go more in-depth discussing these three macro-requirements.
The creation of a functional audit trail is normally mandated by business policies and/or by external regulation. In most cases, it is required to include a copy of the payload of the service (API) requests and responses. In other cases, not only the payload but also all control information (in the form of SOAP headers for SOAP services, HTTP headers and status code for REST services, JMS properties for asynchronous JMS services, etc.) must be logged.
Logging that includes payloads is sometimes called archiving, but strictly speaking, this term should be used only in those cases where the system keeps persistent copies of the payloads in normal operation, which is often the case for B2B gateways. After a set period, these messages are moved from the "online" data store to the "offline" archive store.
Looking at the audit requirement more in detail, we need to define:
- Where to log (audit points)
- What to log (logging content)
- How long to keep the logs (retention policy)
For discussing the first point (perhaps the most important of the three), consider the diagram below that shows a simple example of an ESB service implemented via a single ESB component. The component exposes a request-response interface, implements it by composing functionality from endpoints of two different backend systems but does not compose other ESB services.
Note that such as design is not a best practice: we should expose functionality from each backend via a separate lower-level service ("System API"), and then orchestrate this lower-level service via a "Process API". However, I am keeping the design very simple for the sake of the discussion.
The diagram shows six different audit logging points:
- two at the ESB service endpoint, facing the consumer (logging its request and the response delivered to it)
- six facing the backend systems, logging requests and responses exchanged between the ESB (in a service consumer role) and the native endpoints of the backend systems.
As a minimum, audit logging must be done at the two consumer-facing endpoints, but for comprehensive logging, all messages that enter and exit an ESB component must be included:
- The inbound request from the service consumer
- Outbound request to native backend service
- Inbound response (or error) from native backend service
- Outbound request to composed ESB service
- Inbound response (or error) from composed ESB service
- Outbound request (or error) to the service consumer
Logging points of type 2 through 4 are due to service composition, so the number of such logging points is proportional to the number of service/API operations being composed.
It is necessary to correlate the log entries generated at each audit logging point, which can be done in two ways within a component:
- by logging independently at each audit logging point (generating a separate log entry at each point), but having the log entries share a common unique identifier
- by grouping the log information generated at each audit logging point into single log entry which is written when the service completes execution (either successfully or with an error). This can be called delayed logging.
The choice between minimal audit logging (consumer request and response only) and full audit logging should be configurable at the level of service and service operation (or resource / sub-resource / operation level for a REST service), with the usual pattern of configuration inheritance and override between parent and child resource (the setting for service is inherited by service operations unless specifically overridden at operation level).
The second point relates to the content of the log entries. A minimal set of information fields for the logging function will be suggested in part 2 of this article, but here the important point to remark is that information logged at each log audit point may be heterogeneous and greatly vary in size.
Often the information is already in textual form (e.g., XML payloads, JSON payload, SOAP headers, HTTP headers), but in some cases, we have binary information (e.g., scanned documents).
Message payloads can be very large, and this must take into account as well. This is why ESB's normally ingest them as stream objects rather than as a huge string of objects that would be prohibitive to handle in terms of memory footprint.
Thus, the format and size of the data to be logged has a significant impact on the design of ESB logging, which must be flexible in this respect without introducing excessive overhead.
The third point relates to the retention period of the entries created into the audit log. Here, like in the case of the determination of the audit logging points, it should be possible to configure the retention period hierarchically by service/operation (or parent/child resource), so that information related to more critical operations or resources can be retained longer. A significant plus, on top of this functionality, would be to allow a retention policy that also takes into account the identity of the consumer.
While the preceding subsection was about audit logging (that is, logging that must be done at every service invocation due to functional requirements), this section briefly discusses the additional technical logging that can be done in addition to audit logging to provide useful information about service behavior. This can be essential to troubleshoot obscure malfunction detected in the Production environment.
First of all, let's clear the field from the rather common misconception that mixes such technical logging and error handling.
Technical logging is NOT the same as error handling. Error handling is ESB logic that must be executed in all cases where an error occurs that either requires manual recovery (part of an Incident Management workflow), or that must anyway be notified to a human user (even if the error was recovered from automatically), and the error response to the service consumer (for a request-response interaction) is not sufficient.
On the other hand, technical logging, when enabled, is done from ESB components independently of whether error conditions are happening or not. It simply traces the execution logic within components at a fine-grained level of detail, providing detailed insight into what happens between the audit logging points.
This does not mean that the implementation of logging can never share anything with that of error handling. In most ESB environments, designers have decided to make use of a logging framework such as Log4J as part of both their logging and their error-handling solutions.
The first consideration of this kind of logging is that, for performance reasons, it should be completely turned off in normal Production operation. Such logging is normally enabled in environments that support System Integration Testing, but normally disabled not just in Production but also in every environment that must closely mirror Production conditions, like in Pre-Production environments where performance testing happens.
Still, enabling technical logging in a very granular way (e.g., per component or component/consumer combination) can be occasionally invaluable to troubleshoot incorrect behavior in Production. This must be done, of course, very carefully, due to the performance impact in case a component implements a service that is heavily used.
It is imperative here to have the capability of hierarchical, fine-grained configuration, so the appropriate level of logging can be set very flexibly.
Normally, technical logging should not include payloads but just short dynamic log messages that can contain the values of selected data elements in the component's logic.
An interesting option for technical logging (in between the "on" and "off" settings) is that of conditional logging: technical log entries are logged (at the end of the execution of a component, that is with delayed logging), but only if the component experienced an error. By choosing this option we recognize that detailed tracking of successful service executions is superfluous, which is almost always the case. On the other hand, "mass" errors, typically due to the unavailability of a backend resource, will produce under this option a very large amount of logging information that is utterly redundant (since all service invocations failed for the same reason).
Performance Logging (Profiling)
Since logging entries (both audit and technical) are associated with a timestamp, it is obvious that logging can be used to collect performance statistics (profiling) for service operations. The relevant log entries for are only those logged at the audit logging points, for two reasons:
- The technical logging points are normally too dense within a component
- Technical logging in Production and during Performance Testing introduces an overhead which is normally unacceptable and that in any case will bias the profiling results (i.e., the very act of measuring affects the measurements).
The main statistics (min, max, average, standard deviation, etc.) that can be collected are:
- those related to the distribution of service operation invocations (by period, consumer, etc.)
- those related to the total service operation execution time, within the ESB component exposing the operation
- those related to the execution time of composing services; these, combined with the total service execution time statistics of the services being composed, can provide statistics for message transit times between components (significant when the services being composed are dispersed geographically)
Log entry Consolidation, Correlation, and Sequencing
The diagram used above to illustrate the audit logging points shows a very simple situation where all logging points are reached sequentially within the execution thread of the (only) component. In this case, correlation and sequencing of the produce log entries are trivial.
The situation changes a lot when we have to deal with complex service compositions, distributed ESB components (e.g. deployed in a hybrid cloud environment), asynchronous interactions, clustered environments, or services orchestrated by a BPM engine.
Even with a rather simple synchronous composition of ESB services, it not guaranteed that these services run in the same thread: a supposedly "local" service invocation in a clustered environment may be executed the service in a cluster node other than that where the caller service executes.
We must always assume that composing services are implemented by distributed components when they are invoked asynchronously.
Each component involved in a service execution makes creates multiple log entries in the Logging Data Store, and all these log entries must be mutually correlated and sequenced when browsing the log.
An example, purposefully a bit involved, is given in the following diagram:
In the picture above a distributed service composition combines functionality from 4 different backend systems.
The main service (implemented on component 1) hosted on runtime A directly or indirectly invokes three other services (one hosted in the same runtime, and the other two on different runtimes). The service hosted on runtime C (component 4) is invoked asynchronously through a messaging queue.
Each service component performs logging at its audit logging points and perhaps some components will also have technical logging enabled. As a result, logging entries will originate from different runtimes and must be correlated and consolidated into a single place (the Logging Data Store). That, by itself, can pose a concern: we can realize upfront that a trivial logging implementation like, for example, using a separate synchronous INSERT to a central database for every individual log entry can be inefficient and hurt our service response times.
The Logging Data Store does not need to be physically centralized, but it needs to be centralized logically, allowing a User Interface to access all of its information no matter where it is physically stored.
For the audit and logging information to make sense, log entries originating from different components that relate, directly or indirectly, to the same invocation by the External Service Consumer must be:
- Logically grouped together
- Accessible as a whole in a location-independent fashion (i.e., even when not all related entries are stored in the same physical data store)
- Organized hierarchically based on service composition logic (e.g., showing the execution of component 3 as a subordinate of the execution of component 2, which is, in turn, a subordinate of the execution of component 1)
- Sequenced correctly (regardless of deviation of the system clocks of the different machines). This is necessary for proper profiling.
Different requirements must be satisfied to put in place effective and efficient logging in a distributed ESB environment
The next article (ESB Logging - part 2) will be discussing some techniques to help us implement these requirements.
Published at DZone with permission of Francesco Ganora . See the original article here.
Opinions expressed by DZone contributors are their own.