The Mystery of Eureka Health Monitoring
We take a look at how you can be sure your microservices architecture is performing as it should by using this health monitoring tool.
Join the DZone community and get the full member experience.
Join For FreeEven though I'm fairly experienced with designing and configuring systems with health monitoring, when I first started configuring Eureka health-checks, I had to put some considerable effort into finding out answers to a few ' why' and 'when' questions. Therefore I'm writing this post to complement the Spring Cloud Netflix Eureka documentation on Health Checks with my findings.
First of all, let's look at the description of one of the configurations from the documentation.
eureka.instance.health-check-url - Gets the absolute health check page URL for this instance. The users can provide the healthCheckUrlPath if the health check page resides in the same instance talking to eureka, otherwise, in the cases where the instance is a proxy for some other server, users can provide the full URL. If the full URL is provided it takes precedence.
If it's hard to comprehend, read on.
Learned or Taught?
Health checks are performed in order to identify and evict down/non-reachable microservices from the Eureka server registry. However, the Eureka server never sends keep-alive requests to its registered clients in order to learn whether they are active or not (as opposed to some Traffic Managers). Instead, Eureka clients send heartbeats to the Eureka server in order to educate the server on their status.
On a side note, I would like to coin the term Learned Monitoring for the approach where servers send keep-alive requests to clients; and the term Taught Monitoring for the approach where clients send heartbeats to servers.
Taught monitoring is OK for microservices since it's not much of a burden to embed a client (i.e. Eureka client) who knows how to send heartbeats. Also, it's obvious that the clients themselves have to determine their health status and Eureka server has to expose some REST operations for clients to publish heartbeats.
Eureka Server REST Operations
The Eureka server exposes the following resource to which the clients send heartbeats.
PUT /eureka/apps/{app id}/{instance id}?status={status}
I've omitted few other query parameters for clarity. {instance id} takes the form of hostname:app id:port
where it identifies a unique Eureka client instance. The Eureka server recognizes a few statuses: UP; DOWN; STARTING; OUT_OF_SERVICE; UNKNOWN.
Thus, when values are assigned, it would look like this:
PUT /eureka/apps/ORDER-SERVICE/localhost:order-service:8886?status=UP
Upon receiving a heartbeat request, the Eureka server renews the lease of that instance. If it's the very first heartbeat, the Eureka server responds with a 404 and right after that the client sends a registration request.
Furthermore, the Eureka server exposes the following operations to allow the overriding and undo-overriding of health statuses:
PUT /eureka/apps/{app id}/{instance id}/status?value={status}
DELETE /eureka/apps/{app id}/{instance id}/status
The overriding operation (i.e. the PUT
operation above) is used to take an otherwise healthy instance OUT_OF_SERVICE manually or by using administration tools such as Asgard (to temporarily disallow traffic to some instance).
This will be useful for 'red/black' deployments where you run older and newer versions of a microservice for some period of time (in order to easily rollback to an older version if the new version is unstable). Once deployment of the new version is complete and the new version started serving requests, the older version can be taken OUT_OF_SERVICE
(without bringing them down) so that they will just stop serving requests. i.e.
PUT /eureka/apps/ORDER-SERVICE/localhost:order-service:8886/statusvalue=OUT_OF_SERVICE
The above-overridden status can be discarded and we can instruct the Eureka server to start honoring the status as published by the instance itself, as follows:
DELETE /eureka/apps/ORDER-SERVICE/localhost:order-service:8886/status
This will be useful when you find the new version of a microservice is unstable and you want to get an older version (i.e. which is already in OUT_OF_SERVICE
) to start serving requests.
Eureka Client Self-Diagnosis
The Eureka client (or server) never invokes the /health
endpoint to determine the instance's health status. The health status of a Eureka instance is determined by a HealthCheckHandler
implementation. The default HealthCheckHandler
always announces that the application is in an UP
state as long as the application is running.
Eureka allows custom HealthCheckHandlers
to be plugged-in through theEurekaClient#registerHealthCheck()
API. Spring Cloud leverages this extension point to register a new handler - EurekaHealthCheckHandler
- if the following property is set.
eureka.client.healthcheck.enabled=true
The EurekaHealthCheckHandler
works by aggregating the health status from multiple health indicators such as:
- DiskSpaceHealthIndicator
- RefreshScopeHealthIndicator
- HystrixHealthIndicator
It then maps that status into one of the Eureka-supported statuses. This status will then be propagated to the Eureka server through heartbeats.
Eureka Client Health Endpoints
Eureka clients POST a healthCheckUrl
in the payload when registering themselves with the server. The value of healthCheckUrl
is calculated from following instance properties.
eureka.instance.health-check-url
eureka.instance.health-check-url-path
The default value of .health-check-url-path
is /health which is the Springboot default health actuator endpoint and will be ignored if .heath-check-url
is configured.
These properties should be configured if you implement a custom health endpoint or change the properties impacting the default health endpoint path. i.e.
If you change the default health endpoint;
endpoints.health.path=/new-heath
# either relative path
eureka.instance.health-check-url-path=${endpoints.health.path}
# or absolute path
eureka.instance.health-check-url=http://${eureka.hostname}:${server.port}/${endpoints.health.path}
If you introduce a
management.context-path
management.context-path=/admin
# either relative path
eureka.instance.health-check-url-path=${management.context-path}/health
# or absolute path
eureka.instance.health-check-url=http://${eureka.hostname}:${server.port}/${management.context-path}/health
Making Use of Health Status
The Eureka server doesn't care much about what a client's status is - it just records it. When somebody queries its registry, it will publish the clients' health statuses as well. i.e.
GET /eureka/apps/ORDER-SERVICE
<application>
<name>DISCOVERY-EUREKA-CLIENT</name>
<instance>
<instanceId>localhost:discovery-eureka-client:8886</instanceId>
<ipAddr>192.168.1.6</ipAddr>
<port>8886</port>
<status>UP</status>
<overriddenstatus>UP</overriddenstatus>
<healthCheckUrl>http://localhost:8886/health</healthCheckUrl>
...
...
</instance>
</application>
The response has three important health-related pieces of information - status
, overridenstatus
, and healthCheckUrl
.
status
is the health status as published by the Eureka instance itself.overriddenstatus
is the health status that is enforced either manually or by tools. ThePUT /eureka/apps/{app id}/instance id}/status?value={status}
operation is used override the status published by the Eureka instance and, once invoked, bothstatus
andoverriddenstatus
will be changed to the new status.healthCheckUrl
is the endpoint which the client exposes to GET its health status.
This information can be leveraged by tools for various purposes.
Client-side load balancers like Ribbon to make load balancing decisions — Ribbon reads the
status
attribute and considers only the instances with anUP
status for load balancing. Ribbon, however, does not invoke thehealthCheckUrl
but relies on a published instance status available in the registry.
Deployment tools like Asgard to make deployment decisions — During rolling deployments, Asgard deploys a new version of a microservice and waits until the instance is transitioned to the
UP
status before deploying the rest of the instances (as a risk mitigation strategy). However, rather than relying on an instance status available in the registry (i.e. thestatus
attribute), Asgard learns the instance status by invokinghealthCheckUrl
. It could be because the value of thestatus
attribute can be stale (since it’s dependent on a few factors as described in the next section) but a live health status is important in this case in order to avoid deployment delays.
Accuracy of Health Status
The Eureka server registry (hence the health status) is not always accurate due to the reasons listed below.
- AP in CAP - As Eureka is a Highly Available system in terms of the CAP theorem, information in the registry may not be consistent between Eureka servers in the cluster - during a network partition.
- Server response cache - The Eureka server maintains a response cache which is updated every 30 seconds by default. Therefore, an instance which is actually
DOWN
may appear asUP
in theGET /eureka/apps/{app id}/
response. - Scheduled heartbeats - Since the clients send heartbeats every 30 seconds by default, the health status of an instance in the server registry can be inaccurate.
- Self-preservation - The Eureka server stops expiring clients from the registry when it does not receive heartbeats beyond a certain threshold, which in turn makes the registry inaccurate.
Therefore the clients should follow proper failover mechanisms to complement this inaccuracy.
Published at DZone with permission of Fahim Farook. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments