DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Optimized Metrics Generation With Metadata-Driven Dynamic SQL
  • Understanding Prometheus Metric Types: A Guide for Beginners
  • Confusion Matrix vs. ROC Curve: When to Use Which for Model Evaluation
  • Building Product to Learn AI, Part 3: Taste Testing and Evaluating

Trending

  • Navigating Double and Triple Extortion Tactics
  • Intro to RAG: Foundations of Retrieval Augmented Generation, Part 1
  • Designing for Sustainability: The Rise of Green Software
  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways

Troubleshooting Percona Monitoring and Management (PMM) Metrics

If Percona's PMM tool isn't working quite the way it should be, this article should provide some clarity and behind-the-scenes education.

By 
Peter Zaitsev user avatar
Peter Zaitsev
·
Jan. 24, 18 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
14.0K Views

Join the DZone community and get the full member experience.

Join For Free

With any luck, Percona Monitoring and Management (PMM) works for you out of the box. Sometimes, however, things go awry and you see empty or broken graphs instead of dashboards full of insights.

Before we go through troubleshooting steps, let's talk about how data makes it to the Grafana dashboards in the first place. The PMM Architecture documentation page helps explain it:

If we focus just on the "Metrics" path, we see the following requirements:

  • The appropriate "exporters" (Part of PMM Client) are running on the hosts you're monitoring
  • The database is configured to expose all the metrics you're looking for
  • The hosts are correctly configured in the repository on PMM Server side (stored in Consul)
  • Prometheus on the PMM Server side can scrape them successfully — meaning it can reach them successfully, does not encounter any timeouts and has enough resources to ingest all the provided data
  • The exporters can retrieve metrics that they requested (i.e., there are no permissions problems)
  • Grafana can retrieve the metrics stored in Prometheus Server and display them

Now that we understand the basic requirements let's look at troubleshooting the tool.

PMM Client

First, you need to check if the services are actually configured properly and running.

Second, you can also instruct the PMM client to perform basic network checks. These can spot connectivity problems, time drift and other issues.

If everything is working, next we can check if exporters are providing the expected data directly.

Checking Prometheus Exporters

Looking at the output from pmm-admin check-network, we can see the "REMOTE ENDPOINT". This shows the exporter address, which you can use to access it directly in your browser:

You can see MySQL Exporter has different sets of metrics for high, medium, and low resolution, and you can click on them to see the provided metrics:

There are few possible problems you may encounter at this stage:

  • You do not see the metrics you expect to see. This could be a configuration issue on the database side (docs for MySQL and MongoDB), permissions errors or exporter not being correctly configured to expose the needed metrics.
  • Page takes too long to load. This could mean the data capture is too expensive for your configuration. For example, if you have a million tables, you probably can't afford to capture per-table data.

mysql_exporter_collector_duration_seconds is a great metric that allows you to see which collectors are enabled for different resolutions, and how much time it takes for a given collector to execute. This way you can find and potentially disable collectors that are too expensive for your environment.

Let's look at some more advanced ways to troubleshoot exporters.

Looking at ProcessList
root@rocky:/mnt/data# ps aux | grep mysqld_exporter
root      1697  0.0  0.0   4508   848 ?        Ss    2017   0:00 /bin/sh -c
/usr/local/percona/pmm-client/mysqld_exporter -collect.auto_increment.columns=true
-collect.binlog_size=true -collect.global_status=true -collect.global_variables=true
-collect.info_schema.innodb_metrics=true -collect.info_schema.processlist=true
-collect.info_schema.query_response_time=true -collect.info_schema.tables=true
-collect.info_schema.tablestats=true -collect.info_schema.userstats=true
-collect.perf_schema.eventswaits=true -collect.perf_schema.file_events=true
-collect.perf_schema.indexiowaits=true -collect.perf_schema.tableiowaits=true
-collect.perf_schema.tablelocks=true -collect.slave_status=true
-web.listen-address=10.11.13.141:42002 -web.auth-file=/usr/local/percona/pmm-client/pmm.yml
-web.ssl-cert-file=/usr/local/percona/pmm-client/server.crt
-web.ssl-key-file=/usr/local/percona/pmm-client/server.key >>
/var/log/pmm-mysql-metrics-42002.log 2>&1


This shows us that the exporter is running, as well as specific command line options that were used to start it (which collectors were enabled, for example).

Checking out Log File
root@rocky:/mnt/data# tail /var/log/pmm-mysql-metrics-42002.log
time="2018-01-05T18:19:10-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:11-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:12-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:492"
time="2018-01-05T18:19:12-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:12-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:616"
time="2018-01-05T18:19:13-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:14-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:15-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:16-05:00" level=error msg="Error pinging mysqld: dial unix
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
2018/01/06 09:10:33 http: TLS handshake error from 10.11.13.141:56154: tls: first record does not look like a TLS handshake


If you have problems such as authentication or permission errors, you will see them in the log file. In the example above, we can see the exporter reporting many connection errors (the MySQL Server was down).

Prometheus Server

Next, we can take a look at the Prometheus Server. It is exposed in PMM Server at /prometheus path. We can go to Status->Targets to see which Targets are configured and if they are working: correctly

In this example, some hosts are scraped successfully while others are not. As you can see I have some hosts that are down, so scraping fails with "no route to host." You might also see problems caused by firewall configurations and other reasons.

The next area to check, especially if you have gaps in your graph, is if your Prometheus server has enough resources to ingest all the data reported in your environment. Percona Monitoring and Management ships with the Prometheus dashboard to help to answer this question (see demo).

There is a lot of information in this dashboard, but one of the most important areas you should check is if there is enough CPU available for Prometheus:

The most typical problem to have with Prometheus is getting into "Rushed Mode" and dropping some of the metrics data:

Not using enough memory to buffer metrics is another issue, which is shown as "Configured Target Storage Heap Size" on the graph:

Values of around 40% of total memory size often make sense. The details how to tune this setting.

If the amount of memory is already configured correctly, you can explore upgrading to a more powerful instance size or reducing the number of metrics Prometheus ingests. This can be done either by adjusting Metrics Resolution (as explained in ) or disabling some of the collectors ( ).

You might wonder which collectors generate the most data? This information is available on the same Prometheus Dashboard:

While these aren't not exact values, they correlate very well with what the load collectors generate. In this case, for example, we can see that the Performance Schema is responsible for a large amount of time series data. As such, disabling its collectors can reduce the Prometheus load substantially.

Hopefully, these troubleshooting steps were helpful to you in diagnosing PMM's metrics capture. In a later blog post, I will write about how to diagnose problems with Query Analytics (Demo).

Metric (unit)

Published at DZone with permission of Peter Zaitsev, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Optimized Metrics Generation With Metadata-Driven Dynamic SQL
  • Understanding Prometheus Metric Types: A Guide for Beginners
  • Confusion Matrix vs. ROC Curve: When to Use Which for Model Evaluation
  • Building Product to Learn AI, Part 3: Taste Testing and Evaluating

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!