Achieving Multi-Tenancy in Monitoring With Prometheus: The Mighty Thanos Receiver
In this article, I evaluate Thanos' recently GA'ed component, "Thanos Receiver," and how we can use it to implement a simple multi-tenant monitoring solution.
Join the DZone community and get the full member experience.
Join For FreeHey there! If you are reading this blog post, then I guess you are already aware of Prometheus and how it helps us in monitoring distributed systems like Kubernetes. And, if you are familiar with Prometheus, then chances are you have come across the receiver called Thanos. Thanos is a popular OSS that helps enterprises achieve an HA Prometheus setup with long-term storage capabilities. One of the common challenges of distributed monitoring is implementing multi-tenancy. The Thanos receiver is a Thanos component designed to address this common challenge. The receiver was part of Thanos for a long time, but it was experimental. Recently, Thanos went GA with the receiver component.
Motivation
We tried this component with one of our clients and it worked well. However, due to a lack of documentation, the setup wasn't as smooth as we would have liked. The purpose of this blog post is to lay out a simple guide for those who are looking forward to creating a multi-tenant monitoring setup using Prometheus and the Thanos receiver. In this post, we will try to use the Thanos receiver to achieve a simple multi-tenant monitoring setup where Prometheus can be a nearly stateless component on the tenant side.
A Few Words on the Thanos Receiver
The receiver is a Thanos component that can accept remote write requests from any Prometheus instance and store the data in its local TSDB; optionally it can upload those TSDB blocks to object storage, like S3 or GCS, at regular intervals. The receiver does this by implementing the Prometheus Remote Write API. It builds on top of existing Prometheus TSDB and retains its usefulness while extending its functionality with long-term-storage, horizontal scalability, and down-sampling. It exposes the StoreAPI so that Thanos Queriers can query received metrics in real-time.
Multi-Tenancy
The Thanos receiver supports multi-tenancy. It accepts Prometheus remote write requests and writes these into a local instance of Prometheus TSDB. The value of the HTTP header ("THANOS-TENANT") of the incoming request determines the id of the tenant Prometheus. To prevent data leaking at the database level, each tenant has an individual TSDB instance, meaning a single Thanos receiver may manage multiple TSDB instances. Once the data is successfully committed to the tenant's TSDB, the requests return successfully. The Thanos Receiver also supports multi-tenancy by exposing labels that are similar to Prometheus external labels.
Hashring Configuration File
If we want features like load-balancing and data replication, we can run multiple instances of the Thanos receiver as a part of a single hashring. The receiver instances within the same hashring become aware of their peers through a hashring configuration file. The following is an example of a hashring configuration file:
[
{
"hashring": "tenant-a",
"endpoints": ["tenant-a-1.metrics.local:19291/api/v1/receive", "tenant-a-2.metrics.local:19291/api/v1/receive"],
"tenants": ["tenant-a"]
},
{
"hashring": "tenants-b-c",
"endpoints": ["tenant-b-c-1.metrics.local:19291/api/v1/receive", "tenant-b-c-2.metrics.local:19291/api/v1/receive"],
"tenants": ["tenant-b", "tenant-c"]
},
{
"hashring": "soft-tenants",
"endpoints": ["http://soft-tenants-1.metrics.local:19291/api/v1/receive"]
}
]
- Soft tenancy - If a hashring specifies no explicit tenants, then any tenant is considered a valid match; this allows for a cluster to provide soft tenancy. Requests whose tenant ID matches no other hashring explicitly will automatically land in this soft tenancy hashring. All incoming remote write requests which don't set the tenant header in the HTTP request fall under the soft tenancy and default tenant ID (configurable through the
-receive.default-tenant-id
flag) which is attached to their metrics. - Hard tenancy - Hard tenants must set the tenant header in every HTTP request for remote write. Hard tenants in the Thanos receiver are configured in a hashring config file. Changes to this configuration must be orchestrated by a configuration management tool. When a remote write request is received by a Thanos receiver, it goes through the list of configured hard tenants. A hard tenant also has the number of associated receiver endpoints belonging to it.
P.S: A remote write request can initially be received by any receiver instance, however, it will only be dispatched to receiver endpoints that correspond to that hard tenant.
Architecture
In this blog post, we are trying to implement the following architecture. We will use Thanos v0.14 in this blog post.
A brief overview of the above architecture:
We have three Prometheuses running in namespaces: sre
, tenant-a
and tenant-b
, respectively.
- The Prometheus in the
sre
namespace is demonstrated as a soft tenant, therefore it does not set any additional HTTP headers to the remote write requests. - The Prometheus in
tenant-a
andtenant-b
are demonstrated as hard tenants. The NGINX servers in those respective namespaces are used for setting the tenant header for the tenant Prometheus. - From a security point of view, we are only exposing the Thanos receiver's statefulset responsible for the soft tenant (
sre
Prometheus). - For both of the Thanos receiver's statefulsets (soft and hard) we are setting a replication factor=2. This will ensure that the incoming data gets replicated between two receiver pods.
- The remote write request which is received by the soft tenant receiver instance is forwarded to the hard tenant Thanos receiver. This routing is based on the hashring config.
The above architecture obviously misses a few features that one would also expect from a multi-tenant architecture, e.g: tenant isolation, authentication, etc. This blog post only focuses on how we can use the Thanos Receiver to store time-series from multiple Prometheus(es) to achieve multi-tenancy. Also, the idea behind this setup is to show how we can make Prometheus on the tenant side nearly stateless yet maintain data resiliency.
We will improve upon this architecture in the upcoming posts. So, stay tuned.
Prerequisites
Cluster Setup
Clone the repo:
git clone https://github.com/dsayan154/thanos-receiver-demo.git
Setup a Local KIND Cluster
cd local-cluster/
- Create the KIND cluster with calico, ingress, and extra-port mappings:
./create-cluster.sh cluster-1 kind-calico-cluster-1.yaml
- Deploy the NGINX ingress controller:
kubectl apply -f nginx-ingress-controller.yaml
Install Minio as Object Storage
kubectl create ns minio
helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install --namespace minio my-minio bitnami/minio --set ingress.enabled=true --set accessKey.password=minio --set secretKey.password=minio123 --debug
- Add the following line to /etc/hosts:
127.0.0.1 minio.local
- Login to http://minio.local/ with the credentials
minio:minio123
- Create a bucket with the name Thanos from the UI.
Install Thanos Components
Create Shared Components
xxxxxxxxxx
kubectl create ns thanos
## Create a file _thanos-s3.yaml_ containing the minio object storage config for tenant-a: cat << EOF > thanos-s3.yaml type: S3 config: bucket: "thanos" endpoint: "my-minio.minio.svc.cluster.local:9000" access_key: "minio" secret_key: "minio123" insecure: true EOF
## Create secret from the file created above to be used with the thanos components e.g store, receiver kubectl -n thanos create secret generic thanos-objectstorage --from-file=thanos-s3.yaml kubectl -n thanos label secrets thanos-objectstorage part-of=thanos ## go to manifests directory cd manifests/
Install a Thanos Receive Controller
Deploy a Thanos-receiver-controller to auto-update the hashring configmap when the Thanos receiver statefulset scales:
xxxxxxxxxx
kubectl apply -f thanos-receiver-hashring-configmap-base.yaml
kubectl apply -f thanos-receive-controller.yaml
The deployment above will generate a new configmap, thanos-receive-generated
, and keep it updated with a list of endpoints when a statefulset with the label controller.receive.thanos.io/hashring=hashring-0
and/or controller.receive.thanos.io/hashring=default
gets created or updated. The Thanos receiver pods will load the thanos-receive-generated
configmaps in them.
Install the Thanos Receiver.
kubectl apply -f thanos-receive-default.yaml kubectl apply -f thanos-receive-hashring-0.yaml
The receiver pods are configured to store 15d of data and with a replication factor of 2.
Create a service in front of the Thanos receiver statefulset for the soft tenants:
kubectl apply -f Thanos-receive-service.yaml
The pods of the Thanos-receive-default statefulset will load-balance the incoming requests to other receiver pods based on the hashring config maintained by the Thanos receiver controller.
Install a Thanos Store
Create Thanos store statefulsets:
kubectl apply -f thanos-store-shard-0.yaml
We have configured it such that the Thanos querier fans out queries to the store only for data older than 2 weeks. Data younger than 15 days is to be provided by the receiver pods. P.S: There is an overlap of 1 day between the two time windows that is intentional for data resiliency.
Install a Thanos Querier
Create a Thanos querier deployment and expose it through service and ingress:
kubectl apply -f thanos-query.yaml
We configure the Thanos query to connect to the receiver(s) and store(s) for fanning out queries.
Install Prometheus(es)
xxxxxxxxxx
kubectl create ns sre
kubectl create ns tenant-a
kubectl create ns tenant-b
Install Prometheus Operator and Prometheus
We install the Prometheus-operator and a default Prometheus to monitor the cluster.
xxxxxxxxxx
helm upgrade --namespace sre --debug --install cluster-monitor stable/prometheus-operator \
--set prometheus.ingress.enabled=true \
--set prometheus.ingress.hosts[0]="cluster.prometheus.local" \
--set prometheus.prometheusSpec.remoteWrite[0].url="http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive" \
--set alertmanager.ingress.enabled=true \ --set alertmanager.ingress.hosts[0]="cluster.alertmanager.local" \ --set grafana.ingress.enabled=true --set grafana.ingress.hosts[0]="grafana.local"
Install Prometheus and ServiceMonitor for Tenant-A
In the tenant-a
namespace:
Deploy an NGINX proxy to forward the requests from Prometheus to the Thanos-receive service in the Thanos namespace. It also sets the tenant header of the outgoing request:
kubectl apply -f nginx-proxy-a.yaml
kubectl apply -f prometheus-tenant-a.yaml
Install Prometheus and ServiceMonitor for Tenant-B
In the tenant-b
namespace:
Deploy an NGINX proxy to forward the requests from Prometheus to the thanos-receive service in the Thanos namespace. It also sets the tenant header of the outgoing request:
kubectl apply -f nginx-proxy-b.yaml
kubectl apply -f prometheus-tenant-b.yaml
Add Some Extra Localhost Aliases
Add the following lines to /etc/hosts
:
127.0.0.1 minio.local 127.0.0.1 query.local 127.0.0.1 cluster.prometheus.local 127.0.0.1 tenant-a.prometheus.local 127.0.0.1 tenant-b.prometheus.local
The above will allow you to locally access the minio, thanos querier, and cluster monitoring prometheus, as well as tenant-a's prometheus and tenant-b's prometheus. We are also exposing Alertmanager and Grafana, but we don't require those in this demo.
Test the Setup
Access the Thanos querier from http://query.local/graph and, from the UI, execute the query count (up) by (tenant_id)
. We should see the following output:
Otherwise, if you have ` jq` installed, you can run the following command:
xxxxxxxxxx
curl -s http://query.local/api/v1/query?query="count(up)by("tenant_id")"|jq -r '.data.result[]|"\(.metric) \(.value[1])"'
{"tenant_id":"a"} 1
{"tenant_id":"b"} 1
{"tenant_id":"cluster"} 17
Either of the above outputs shows that the cluster
, a
, and b
Prometheus tenants, respectively, have 17, 1, and 1 scrape targets up and running. All this data is getting stored in the thanos-receiver in real-time by Prometheus' remote write queue. This model creates an opportunity for the tenant side Prometheus to be nearly stateless yet to maintain data resiliency.
In our next post, we will improve upon this architecture to enforce tenant isolation on the thanos-querier.
If you encounter any issues while going through this article you can comment here.
Published at DZone with permission of Sayan Das. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments