Over a million developers have joined DZone.

Loggregator: The Voice Of Cloud Foundry

· Cloud Zone

Download the Essential Cloud Buyer’s Guide to learn important factors to consider before selecting a provider as well as buying criteria to help you make the best decision for your infrastructure needs, brought to you in partnership with Internap.

Last week I met with Alex Jackson from Pivotal who is one of the creators of the Loggregator logging system of Cloud Foundry. We discussed the past, present and future of this critical component of the PaaS open-source project.

I asked Alex for his elevator pitch for Loggregator. "It's the Logging and Metrics subsystem of Cloud Foundry." He said Erik Jasiak, the Product Manager of Loggregator calls it "the voice of the system" which tells the world what's going on on the inside.

Where it began

So when did this voice start speaking to us? Alex said it started around the beginning of version 2 of Cloud Foundry. There was no debugging information coming from applications and everything was limited to static logs. This was especially important during an application's staging process. There were no streaming logs coming back in real-time from the staging process like you had with something like Heroku.

Before creating Loggregator they looked at a number of open-source solutions, such as Herkou's Logplex, Heka from Mozilla, rsyslog, fluentd and various other logging agents. The major issue with these other solutions was that none of them supported multi-tenancy. They also considered Stackato's Logyard, but that was not open-source at the time.

Loggregator started with a single cluster of "loggregator" processes and without the Metron agent (that currently resides on each node). Therefore emitting to Loggregator involved the Cloud Foundry components having a dependency on an in-process library. Now, the Metron agent sits on every node, so it is easily discoverable on the local machine by components emitting logging and metrics. Metron itself finds the Loggregator (now named "Doppler") servers using service discovery facilitated by etcd. When a Doppler server comes online it registers itself with etcd to tell the cluster it is available and which zone it resides in. Metron agents listen to etcd for changes to the Doppler cluster and then directs logging and metrics to the available Doppler servers.

Alex said the biggest evolution since the creation of Loggregator was the recently added metrics, which are supported via an additional event type. There is also the firehose, which allows operators to see all logging and metrics emitted by all applications and Cloud Foundry components.

Added early last year (2014) was the ability to bind syslog drains for applications. TLS and HTTPS drains are also supported.

protobufs

Google Protocol Buffers have been used to encode logging messages since the beginning. This is a popular encoding format and provides fast marshalling and unmarshalling of data. They also looked at Thrift, although that project seemed less popular. It was a similar story for Cap'n Proto - it was interesting, but too new. JSON, BSON and MessagePack were not efficient enough for marshalling and unmarshalling of the data. An added benefit of Google Protocol Buffers was that it was possible to add and remove fields without the system breaking.

When Metron was introduced the Protocol Buffer format used by Loggregator was changed to support more message types. The legacy format is still supported by Metron which is still emitted by Ruby components such as the Cloud Controller and DEA, via the loggregator_emitter gem. Metron internally converts messages in this legacy format to the new format before sending them on to a Doppler server. At the other of the system the legacy format is also supported by the TrafficController to enable older consumers to still plug into the Loggregator system. At both ends of Loggregator, different ports are used to represent legacy and non-legacy format channels.

Scalability

I asked Alex what the challenges have been with Loggregator. He said scalability is always an issue. For most clusters you will generally scale up the DEA nodes, which then makes the Doppler servers the bottleneck. It is also possible for a single app to overwhelm the logging system, although there is rate limiting which is configurable by cluster administrator.

Nozzles

Loggregator has the concept of "nozzles" that are added to the edge and provide a way to stream logging and metrics into endpoints such as Syslog and Graphite. Previously this was supported by the Collector's "historians", but adding a historian meant rebuilding Cloud Foundry. Nozzles, in contrast, allow for a more drop-in solution and Alex expects that the Cloud Foundry community will implement a broader set of these. There are not yet many examples of nozzles, but Alex pointed me at this syslog nozzle.

Lossy

From the beginning the Loggregator system was built to be lossy. It uses UDP packets throughout, which means that there is no guarantee that when a component sends a logging message, that message will ever appear in the firehose. Alex said they have considered changing from UDP in some areas, such as the communication with the Metron. As Cloud Foundry users depend more on the metrics that Loggregator is supporting, there may be greater demand for delivery guarantees.

Diego

Alex said that, with Diego, Loggregator is a first-class citizen. When they applied Loggregator to the DEA they essentially had to do it with shims to retrofit it. With Diego they are now able to emit system metrics directly out of the runtime components and they get better metrics on CPU, memory and potentially networking. The major difference is that the metrics are streamed rather than doing static polling. This was enabled by the re-implementation of the Warden interface as "Garden", which is implemented in Go.

Router

The GoRouter (Cloud Foundry's default router) is involved in Loggregator in two ways. First it is emitting metrics on all application HTTP requests, which includes things such as request latency measurements and HTTP response codes. Secondly, it sits in front of the TrafficController and proxies all requests. Alex said that the router does very little other than forwarding the requests and in fact you could configure HAProxy or another load-balancer to forward traffic directly to TrafficController instances. He said bypassing the router is undocumented and not supported out-of-the-box. The reason the router was initially used was to ease development.

Testing

I asked about the testing that Loggregator goes through. Alex said they have a lot of unit tests, as well as integration tests. These integration tests are being "beefed up". He made a distinction between "integration" tests, where you mock out other components and "system" tests where you actually run the other components. There is also the CAT (Cloud Foundry Acceptance Tests) suite of tests and ad-hoc load testing against AWS and vSphere. There is an intention to make the ad-hoc tests more regular and formal.

Lattice

While looking at Loggregator I came across "Lattice", so I asked Alex about it. It is a project run by a Pivotal team in Chicago which Alex's Loggregator team has been supporting. It is a tool for developers that combines Diego, Loggregator and the GoRouter. It has a much smaller footprint than the entire Cloud Foundry stack and provides something that is just functional enough to run an application. It is designed to provide fast feedback to application developers.

System logs

Loggregator focuses on application logs and there is no support yet for system logs. Alex said this is a priority for later this year. The current recommendation is to use rsyslog to forward system logs directly from the host VMs to an external store such as Splunk, Papertrail or an ELK stack.

etcd

etcd is currently used in Loggregator for discovery of Doppler servers by the Metron agents and TrafficController instances. I'm very interested in the latest rise in service discovery solutions, so I asked Alex about their experience with etcd and whether they have looked at Consul from HashiCorp, which is also growing in popularity. He said that their usage of etcd is quite minimal and the Diego team is doing a lot more in this area. So far they have been happy with etcd. There has been some weirdness in etcd library code, such as errors getting buried, but CoreOS has been good at fixing these issues quickly and accepting pull requests. As far a Consul goes, Alex said the Diego team are looking at it.

BOSH release?

As I discussed with Onsi Fakhouri previously, Loggregator is currently deployed as part of the larger cf-release, whereas Diego is deployed as its own diego-release. Onsi mentioned that it would make sense to break up cf-release into smaller units of deployment and that Loggregator was an obvious first choice for this. These separate releases would then be composable. Alex echoed this idea.

Buffering

The Doppler servers cache log lines in memory using a circular buffer. The size of the buffer is configurable via the maxRetainedLogMessages setting. Alex said there is some consideration as to whether to move this buffer to disk and provide a more variable size. Alex also mentioned a "crazy idea" of potentially getting rid of the Doppler servers and have every node, where the Metron currently resides, buffer the log lines locally and provide a more distributed architecture. This would be closer to the architecture of Stackato's Logyard.

Clients

There is currently a client implementation in Go, which is used by the Cloud Foundry CLI. In fact, all of Loggregator is written in Go. Pivotal also internally have a proof-of-concept implementation for JavaScript. Alex said that clients in other languages should be fairly straight-forward to implement as websockets and protocol buffers are widely supported now.

Conclusion

Thanks to Alex Jackson for taking the time to talk Loggregator with me. It seems that the evolution of Loggregator is increasing in pace as it expands to handle metrics, more data, larger clusters and more integrations.

The Cloud Zone is brought to you in partnership with Internap. Read Bare-Metal Cloud 101 to learn about bare-metal cloud and how it has emerged as a way to complement virtualized services.

Topics:

Published at DZone with permission of Phil Whelan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}