Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Get Metrics for Advance Alerting to Prevent Trouble

DZone's Guide to

How to Get Metrics for Advance Alerting to Prevent Trouble

Learn how being proactive and using alerts can help you prevent issues altogether and cut down on the need for troubleshooting.

· Performance Zone
Free Resource

Although we all have to deal with unexpected events, we also have tools to prevent them. Like mentioned in the last post, log files must be accessible upfront; otherwise. troubleshooting is compromised. Before any issue occurs, there’s a lot we can do in order to be aware of what’s going on, act proactively, and don’t let the problem become reality.

Most companies have already implemented a monitoring solution. Usually, my sysadmin friends are the people in charge of such solutions. If you have this responsibility, you know how difficult is gather all the metrics, show them in fancy dashboards, and properly send alerts to the ones who must react in case of some evidence of trouble. Maybe, more often than you would like to, you have to justify why some metric wasn’t considered, or wasn’t shown, or some alert wasn’t sent. The bigger the monitoring service, the more likely to happen this kind of situation.

Don’t let your avoiding problems task become a problem itself. You can use open source tools and get a monitoring server ready to do the job. Once up and running, you will be able to easily plug any other server into the monitoring service, with no need of an installed agent. In addition, you will be able to send alert notifications through instant messaging apps, like Slack and Telegram, instead of by email.

The solution combines InfluxDB, a high-performance time series database, Grafana, a time series analytics and monitoring tool, and Ansible, an agentless automation tool. With Ansible is possible to extract constantly the servers’ hardware metrics and store them in the InfluxDB database. With Grafana is possible to connect to InfluxDB database and show the metrics in dashboards, define thresholds and configure alerts. The solution can be checked out on Github, and the details are shown right below.

The Development Environment

The monitored environment was reproduced using local VirtualBox machines, one representing the monitoring server (monitor) and the other two as servers that could be plugged into the monitoring service (server1 and server2). Vagrant was used to manage this development environment. With the Vagrantfile below, it’s possible to smoothly turn on and provision the monitoring server, by executing the command vagrant up monitor. Notice that the VMs server1 and server2 are also defined, but they can be booted up later if you want to plug just one or both into the monitoring service.

Vagrant.configure("2") do |config|
  config.vm.box = "minimal/trusty64"

  config.vm.define "monitor" do |monitor|
    monitor.vm.hostname = "monitor.local"
    monitor.vm.network "private_network", ip: "192.168.33.10"
    monitor.vm.provision "ansible" do |ansible|
      ansible.playbook = "playbook-monitor.yml"
    end
  end

  (1..2).each do |i|
    config.vm.define "server#{i}" do |server|
      server.vm.hostname = "server#{i}.local"
      server.vm.network "private_network", ip: "192.168.33.#{i+1}0"
    end
  end
end

The monitoring server provisioning is done by Ansible, and is divided into two basic parts: installation of the tools (InfluxDB, Grafana, and Ansible) and configuration of the monitoring service. Notice that Ansible is used to install Ansible! The playbook-monitor.yml below shows that.

Besides, rather than putting all the tasks in a big unique file, each tool installation’s tasks were placed in a specific YML file, in order to get the code clean, organized and easy to understand. The grouped tasks can then be dynamically included in the main playbook through the include_tasks statement.

---
- hosts: monitor
  become: yes
  gather_facts: no
  tasks:
  - name: Install apt-transport-https (required for the apt_repository task)
    apt:
      name: apt-transport-https
      update_cache: yes
    tags:
      - installation
  - name: Install InfluxDB
    include_tasks: influxdb-installation.yml
    tags:
      - installation
  - name: Install Grafana
    include_tasks: grafana-installation.yml
    tags:
      - installation
  - name: Install Ansible
    include_tasks: ansible-installation.yml
    tags:
      - installation
  - name: Configure monitoring
    include_tasks: monitoring-configuration.yml
    tags:
      - configuration

The Monitoring Service Configuration

The monitoring service configuration is composed of some steps, as shown in the monitoring-configuration.yml file below. First and foremost, the InfluxDB database, named monitor, is created. InfluxDB provides a very useful API, which can be used for a variety of database operations. For interacting with webservices, the Ansible URI module is the most indicated. All the metrics extracted from the monitored servers are stored in the monitor database.

After that, the Grafana data source that connects to the InfluxDB database is created. That way Grafana is able to access all the stored metrics data. Like InfluxDB, Grafana has an API which allows us to make most if not all of the configuration, through JSON-formatted content. Besides the data source creation, the Slack notification channel and the first dashboard are also created. Notice that, in order to assume as successful the task when the playbook is executed again, and guarantee the idempotence, responses statuses other than 200 are considered as well.

The configured Slack notification channel points to a test Slack workspace. Of course, you can join, but I’m pretty sure you will want to create your own and invite the troubleshooting guys to join. Don’t forget to create in your Slack workspace an incoming webhook and replace the JSON url field value by the generated webhook URL.

The initial dashboard shows the used memory percentage metric. Other metrics can be added to it, or you can create new dashboards, at your will. A threshold of 95% was defined, so you can visually know when the metric exceeded such limit. An alert was also defined, and a notification is sent to the configured Slack channel when the last five metric values are greater than or equal to the limit of 95%. The alert also sends a notification when the server health is restabilized.

With Ansible you can perform tasks on several servers at the same time. It’s possible because everything is done through SSH from a master host, even if it’s your own machine. Besides that, Ansible knows the target servers through the inventory file (/etc/ansible/hosts), where they are defined and also grouped. During the monitoring service configuration, the group monitored_servers is created in the inventory file. Every server once in this group is automatically monitored. Plugging a server into the monitoring service is as simple as adding a line in the file. The first server monitored is the monitoring server itself (localhost).

In order to prevent Ansible from checking the SSH key of the servers plugged into the monitoring service, it’s necessary to disable the default behavior in the Ansible configuration file (/etc/ansible/ansible.cfg). This way Ansible won’t have problems in collecting metrics from any new server through SSH.

Finally, an Ansible playbook (playbook-get-metrics.yml) is used to connect to all monitored servers and extract all the relevant metrics needed. It’s placed in the /etc/ansible/playbooks directory and configured in CRON to be executed every minute. Just to sum up, every minute the metrics are collected, stored, shown and in case of evidence of trouble, an alert is sent. Isn’t it awesome!

---
- name: Create the InfluxDB database
  uri:
    url: http://localhost:8086/query
    method: POST
    body: "q=CREATE DATABASE monitor"
- name: Create the Grafana datasource
  uri:
    url: http://localhost:3000/api/datasources
    method: POST
    user: admin
    password: admin
    force_basic_auth: yes
    body: "{{lookup('file','monitor-datasource.json')}}"
    body_format: json
  register: response
  failed_when: response.status != 200 and response.status != 409
- name: Create the Slack notification channel
  uri:
    url: http://localhost:3000/api/alert-notifications
    method: POST
    user: admin
    password: admin
    force_basic_auth: yes
    body: "{{lookup('file','slack-notification-channel.json')}}"
    body_format: json
  register: response
  failed_when: response.status != 200 and response.status != 500
- name: Create the Grafana dashboard
  uri:
    url: http://localhost:3000/api/dashboards/db
    method: POST
    user: admin
    password: admin
    force_basic_auth: yes
    body: "{{lookup('file','used_mem_pct-dashboard.json')}}"
    body_format: json
  register: response
  failed_when: response.status != 200 and response.status != 412
- name: Add localhost to Ansible inventory
  blockinfile:
    path: /etc/ansible/hosts
    block: |
      [monitored_servers]
      localhost ansible_connection=local
- name: Disable SSH key host checking
  ini_file:
    path: /etc/ansible/ansible.cfg
    section: defaults
    option: host_key_checking
    value: False
- name: Create the Ansible playbooks directory if it doesn't exist
  file:
    path: /etc/ansible/playbooks
    state: directory
- name: Copy the playbook-get-metrics.yml
  copy:
    src: playbook-get-metrics.yml
    dest: /etc/ansible/playbooks/playbook-get-metrics.yml
    owner: root
    group: root
    mode: 0644
- name: Get metrics from monitored servers every minute
  cron:
    name: "get metrics"
    job: "ansible-playbook /etc/ansible/playbooks/playbook-get-metrics.yml"

Collecting the Metrics

The playbook-get-metrics.yml file below is responsible for extracting from the monitored_servers all the important metrics and storing them in the monitor database. Initially, the only extracted metric is the used memory percentage, but you can easily start to extract more metrics adding tasks in the playbook.

Notice that the InfluxDB writing data API is used to store the metric in the monitor database. 192.168.33.10 is the IP address of the monitoring server and 8086 is the port where InfluxDB is on. The used memory percentage has the key used_mem_pct in the database, and you must choose an appropriate key for each metric you start to extract.

Ansible by default collects information about the target host. It’s an initial step before the execution of the tasks. The collected data is then available to be used by the tasks. The hostname (ansible_hostname) is one of those, essential to differentiate the server from where the metric is extracted. By the way, the used memory percentage is calculated also using two of the data gathered by Ansible: the used real memory in megabytes (ansible_memory_mb.real.used) and the total real memory in megabytes too (ansible_memory_mb.real.total). If you want to know all of such data, execute the command ansible monitor -m setup -u vagrant -k -i hosts, and type vagrant when prompted the SSH password. Notice that the information is JSON-formatted, and the values can be accessed through dot-notation.

---
- hosts: monitored_servers
  tasks:
  - name: Used memory percentage
    uri:
      url: http://192.168.33.10:8086/write?db=monitor
      method: POST
      body: "used_mem_pct,host={{ansible_hostname}} value={{ansible_memory_mb.real.used / ansible_memory_mb.real.total * 100}}"
      status_code: 204

Plugging a Server Into the Monitoring Service

You’ve probably already executed the command vagrant up monitor, in order to get the monitoring server up and running. If not, do it right now. It demands some time, depending on how fast is your Internet connection. You can follow the output and see each step of the server provisioning.

When finished, open your browser and access the Grafana web application by typing the URL http://192.168.33.10:3000. The user and the password to log in are the same: admin. Click in the used_mem_pct dashboard link, and take a look at the values concerning the monitoring server in the presented line chart. You may need to wait a few minutes until having enough values to track.

Ok, you may now want to plug another server into the monitoring service, and see its values in the line chart too. So, turn on the server1, for example, executing the command vagrant up server1. After that, execute the Ansible playbook below through the command ansible-playbook playbook-add-server.yml -u vagrant -k -i hosts. The -u argument defines the SSH user, the -k argument prompts for password input (vagrant, too), and the -i argument points to the hosts file, where the monitoring server is defined.

You will be prompted to inform the new server’s IP address and the SSH credentials, in order to enable Ansible to connect to the server. That’s enough to plug the server into the monitoring service, simply by inserting a line in the monitoring server’s /etc/ansible/hosts file. The next time CRON execute the playbook-get-metrics.yml, one minute later, server1 will be also considered a monitored server, so its metrics will be extracted, stored and shown in the dashboard too.

---
- hosts: monitor
  become: yes
  gather_facts: no
  vars_prompt:
  - name: "host"
    prompt: "Enter host"
    private: no
  - name: "user"
    prompt: "Enter user"
    private: no
  - name: "password"
    prompt: "Enter password"
    private: yes
  tasks:
  - name: Add the server into the monitored_servers group
    lineinfile:
      path: /etc/ansible/hosts
      insertafter: "[monitored_servers]"
      line: "{{host}} ansible_user={{user}} ansible_ssh_pass={{password}}"

Conclusion

Among the variety of monitoring solutions, the one just described aims to be cheap, flexible and easy to implement. Some benefits of its adoption are:

  • The solution does not require installing an agent in every monitored server, taking advantage from the agentless feature of Ansible;

  • The solution stores all the metrics data in InfluxDB, a high-performance time series database;

  • The solution centralizes the data presentation and the alerts configuration in Grafana, a powerful data analytics and monitoring tool.

I hope this solution can solve at least one of your pain points in your monitoring tasks. Experiment it and improve it and share it at your will.

Finally, if you want my help in automating something, please give me more details, tell me your problem. It may be a problem someone else has, too.

Topics:
ansible ,grafana ,influxdb ,ubuntu ,vagrant ,sysadmin ,devops ,infracode ,performance ,monitoring

Published at DZone with permission of Gustavo Carmo. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}