The Mystery of Eureka Health Monitoring

We take a look at how you can be sure your microservices architecture is performing as it should by using this health monitoring tool.

Fahim Farook

Dec. 20, 17 · Tutorial

Likes (14)

Comment

Save

43.6K Views

Even though I'm fairly experienced with designing and configuring systems with health monitoring, when I first started configuring Eureka health-checks, I had to put some considerable effort into finding out answers to a few ' why' and 'when' questions. Therefore I'm writing this post to complement the Spring Cloud Netflix Eureka documentation on Health Checks with my findings.

First of all, let's look at the description of one of the configurations from the documentation.

eureka.instance.health-check-url - Gets the absolute health check page URL for this instance. The users can provide the healthCheckUrlPath if the health check page resides in the same instance talking to eureka, otherwise, in the cases where the instance is a proxy for some other server, users can provide the full URL. If the full URL is provided it takes precedence.

If it's hard to comprehend, read on.

Learned or Taught?

Health checks are performed in order to identify and evict down/non-reachable microservices from the Eureka server registry. However, the Eureka server never sends keep-alive requests to its registered clients in order to learn whether they are active or not (as opposed to some Traffic Managers). Instead, Eureka clients send heartbeats to the Eureka server in order to educate the server on their status.

On a side note, I would like to coin the term Learned Monitoring for the approach where servers send keep-alive requests to clients; and the term Taught Monitoring for the approach where clients send heartbeats to servers.

Eureka clients sending heartbeats to server.

Taught monitoring is OK for microservices since it's not much of a burden to embed a client (i.e. Eureka client) who knows how to send heartbeats. Also, it's obvious that the clients themselves have to determine their health status and Eureka server has to expose some REST operations for clients to publish heartbeats.

Eureka Server REST Operations

The Eureka server exposes the following resource to which the clients send heartbeats.

PUT /eureka/apps/{app id}/{instance id}?status={status}

I've omitted few other query parameters for clarity. {instance id} takes the form of hostname:app id:port where it identifies a unique Eureka client instance. The Eureka server recognizes a few statuses: UP; DOWN; STARTING; OUT_OF_SERVICE; UNKNOWN.

Thus, when values are assigned, it would look like this:

PUT /eureka/apps/ORDER-SERVICE/localhost:order-service:8886?status=UP

Upon receiving a heartbeat request, the Eureka server renews the lease of that instance. If it's the very first heartbeat, the Eureka server responds with a 404 and right after that the client sends a registration request.

Furthermore, the Eureka server exposes the following operations to allow the overriding and undo-overriding of health statuses:

PUT /eureka/apps/{app id}/{instance id}/status?value={status}

DELETE /eureka/apps/{app id}/{instance id}/status

The overriding operation (i.e. the PUT operation above) is used to take an otherwise healthy instance OUT_OF_SERVICE manually or by using administration tools such as Asgard (to temporarily disallow traffic to some instance).

Asgard overriding status published by instance heartbeat

This will be useful for 'red/black' deployments where you run older and newer versions of a microservice for some period of time (in order to easily rollback to an older version if the new version is unstable). Once deployment of the new version is complete and the new version started serving requests, the older version can be taken OUT_OF_SERVICE (without bringing them down) so that they will just stop serving requests. i.e.

PUT /eureka/apps/ORDER-SERVICE/localhost:order-service:8886/statusvalue=OUT_OF_SERVICE

The above-overridden status can be discarded and we can instruct the Eureka server to start honoring the status as published by the instance itself, as follows:

DELETE /eureka/apps/ORDER-SERVICE/localhost:order-service:8886/status

This will be useful when you find the new version of a microservice is unstable and you want to get an older version (i.e. which is already in OUT_OF_SERVICE) to start serving requests.

Eureka Client Self-Diagnosis

The Eureka client (or server) never invokes the /health endpoint to determine the instance's health status. The health status of a Eureka instance is determined by a HealthCheckHandler implementation. The default HealthCheckHandler always announces that the application is in an UP state as long as the application is running.

Eureka allows custom HealthCheckHandlers to be plugged-in through theEurekaClient#registerHealthCheck() API. Spring Cloud leverages this extension point to register a new handler - EurekaHealthCheckHandler - if the following property is set.

eureka.client.healthcheck.enabled=true

The EurekaHealthCheckHandler works by aggregating the health status from multiple health indicators such as:

DiskSpaceHealthIndicator
RefreshScopeHealthIndicator
HystrixHealthIndicator

It then maps that status into one of the Eureka-supported statuses. This status will then be propagated to the Eureka server through heartbeats.

Eureka Client Health Endpoints

Eureka clients POST a healthCheckUrl in the payload when registering themselves with the server. The value of healthCheckUrl is calculated from following instance properties.

eureka.instance.health-check-url
eureka.instance.health-check-url-path

The default value of .health-check-url-path is /health which is the Springboot default health actuator endpoint and will be ignored if .heath-check-url is configured.

These properties should be configured if you implement a custom health endpoint or change the properties impacting the default health endpoint path. i.e.

If you change the default health endpoint;

endpoints.health.path=/new-heath
# either relative path
eureka.instance.health-check-url-path=${endpoints.health.path}
# or absolute path
eureka.instance.health-check-url=http://${eureka.hostname}:${server.port}/${endpoints.health.path}

If you introduce a management.context-path

management.context-path=/admin
# either relative path
eureka.instance.health-check-url-path=${management.context-path}/health
# or absolute path
eureka.instance.health-check-url=http://${eureka.hostname}:${server.port}/${management.context-path}/health

Making Use of Health Status

The Eureka server doesn't care much about what a client's status is - it just records it. When somebody queries its registry, it will publish the clients' health statuses as well. i.e.

GET /eureka/apps/ORDER-SERVICE

<application>
   <name>DISCOVERY-EUREKA-CLIENT</name>
   <instance>
      <instanceId>localhost:discovery-eureka-client:8886</instanceId>
      <ipAddr>192.168.1.6</ipAddr>
      <port>8886</port>
      <status>UP</status>
      <overriddenstatus>UP</overriddenstatus>
      <healthCheckUrl>http://localhost:8886/health</healthCheckUrl>
      ...
      ...
   </instance>
</application>

The response has three important health-related pieces of information - status, overridenstatus, and healthCheckUrl.

status is the health status as published by the Eureka instance itself.
overriddenstatus is the health status that is enforced either manually or by tools. The PUT /eureka/apps/{app id}/instance id}/status?value={status} operation is used override the status published by the Eureka instance and, once invoked, both status and overriddenstatus will be changed to the new status.
healthCheckUrl is the endpoint which the client exposes to GET its health status.

This information can be leveraged by tools for various purposes.

Client-side load balancers like Ribbon to make load balancing decisions — Ribbon reads the status attribute and considers only the instances with an UP status for load balancing. Ribbon, however, does not invoke the healthCheckUrl but relies on a published instance status available in the registry.

Ribbon relies on instance status available in the registry to make load balancing decisions

Deployment tools like Asgard to make deployment decisions — During rolling deployments, Asgard deploys a new version of a microservice and waits until the instance is transitioned to the UP status before deploying the rest of the instances (as a risk mitigation strategy). However, rather than relying on an instance status available in the registry (i.e. the status attribute), Asgard learns the instance status by invoking healthCheckUrl. It could be because the value of the status attribute can be stale (since it’s dependent on a few factors as described in the next section) but a live health status is important in this case in order to avoid deployment delays.

Asgard invoking healthCheckUrl until first instance becomes UP

Accuracy of Health Status

The Eureka server registry (hence the health status) is not always accurate due to the reasons listed below.

AP in CAP - As Eureka is a Highly Available system in terms of the CAP theorem, information in the registry may not be consistent between Eureka servers in the cluster - during a network partition.
Server response cache - The Eureka server maintains a response cache which is updated every 30 seconds by default. Therefore, an instance which is actually DOWN may appear as UP in the GET /eureka/apps/{app id}/ response.
Scheduled heartbeats - Since the clients send heartbeats every 30 seconds by default, the health status of an instance in the server registry can be inaccurate.
Self-preservation - The Eureka server stops expiring clients from the registry when it does not receive heartbeats beyond a certain threshold, which in turn makes the registry inaccurate.

Therefore the clients should follow proper failover mechanisms to complement this inaccuracy.

Health (Apple)

Published at DZone with permission of Fahim Farook. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending