Observability in Hybrid Multi-Cloud Environment
This is my study of a real customer use case on monitoring and logging in a multi-cloud environment using Red Hat’s open-source technology.
Join the DZone community and get the full member experience.Join For Free
Last article I talked about the study I did among Red Hat customers that makes the jump towards deploying their workloads on hybrid and multi-cloud environments. These articles are abstractions of the common generic components summarized according to the actual implementations.
To overcome the common obstacles of going hybrid and multi-cloud, such as finding talents with multi-cloud knowledge. Secure and protect across low trust networks or just day-to-day operation across the board. I have identified some solutions from the study, where I will be covering in the series of articles:
- Briefing of the Hybrid Multi-Cloud Study
- Overview of Making Hybrid Multi-Cloud GitOps Works
- Hybrid Multi-Cloud Dynamic Security Management
- Observability in Hybrid Multi-Cloud Environment (This Article)
Customize Monitoring and Logging Strategy
One of the keys to managing multiple clusters is observability. In a single cluster environment, metrics and logs are segregated in different layers, applications, cluster components, and nodes, now adding in the complexity of multiple clusters on different clouds will definitely make it more chaotic than ever. By customizing how to gather and view the metrics allows you to be more effective, easier to operate, pinpoint and locate problems quickly. To do that, we will need to decide how we want to aggregate the collected data, by regions, operations, or a centralized point. (Depends on your HA strategy and how much data you are collecting). And then decide how to scrape or collect the metrics and logs. Historical data are valuable not only for us to view and identify problems. Since many vendors now support AI-based remediation, by validating and extracting the data for operational models. Therefore we will also need to persist data.
How It Works
Check out my previous article on the setup of the hybrid and multi-cloud environment, and if you are interested, other articles on GitOps and secure dynamic infrastructure. But this time, let’s look at how observability works. First, to view things in a single pane of glass, we host the Grafana dashboard to stream and query the Observatorium service in all managed clusters.
When bootstrapping the managed clusters on each cloud, we will need to install the following,
Prometheus is installed to scrape metrics for cluster components, as well as the applications. A Thanos sidecar gets deployed along with a Prometheus instance to persist metrics to storage and allow an instant query to Prometheus data. You may have multiple Prometheus instances per cluster depends on how you want to distribute the workload.
Observatorium is installed based on your defined strategy for observability(in this case, per region), this will deploy a set of service instances on the cluster. Where they will be responsible for aggregating, storing input metrics, as well as efficiently storing the data. Also providing endpoints (API) for observation tools like Grafana to query the persisted data.
- Queries from the Grafana dashboard in Hub cluster, the central Querier component in Observatorium process the PromQL queries and aggregate the results.
- Prometheus scraps metrics in the local cluster, Thano sidecar pushes metrics to Observatorium to persist in storage.
- Thanos sidecar acts as a proxy that serves Prometheus’s local data over Thanos’s gRPC API from the Querier.
Promtail collects log files with fine-grained control of what to ingest, what to drop, and the final metadata to attach to the logline. Similar to Prometheus, multiple Promtail instances can be installed per cluster depending on how you want to distribute the logging workload.
Observatorium in each defined region is also configured with Grafana Loki, where it aggregates logs by labeling each log stream pushed from Promtail. Not only does it persist logs in stooges but also allows you to query the high cardinality data for better analysis and visualization.
- Promtail is used to collect logs and push to Loki API (Observatorium).
- In Observatorium, the Loki distributor sends logs in batches to ingester, where they will be persisted. A couple of things to beware of: both ingester and querier require large memory consumption, will need more replicas.
- Grafana dashboard in Hub cluster display logs via requesting:
- Real-time display (tail) with WebSocket.
- Time-series-based query with HTTP.
This summarizes how observability was implemented in our customer’s hybrid multi-cloud environment. If you want to dive deeper into the technology, check out these videos done by Ales Nosek.
Published at DZone with permission of Christina Lin, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.