Common Issues With Performance and Monitoring
Common Issues With Performance and Monitoring
Need to improve visibility, ease of use, performance, and knowing the impact the code has on the UX are some of the most common issues.
Join the DZone community and get the full member experience.Join For Free
SignalFx is the only real-time cloud monitoring platform for infrastructure, microservices, and applications. The platform collects metrics and traces across every component in your cloud environment, replacing traditional point tools with a single integrated solution that works across the stack.
To gather insights on the state of performance optimization and monitoring today, we spoke to 12 executives from 11 companies that provide performance optimization and monitoring solutions for their clients.
Here's what they told us when we asked, "What are the most common issues you see affecting performance optimization and monitoring?"
For network-centric, you need to recognize congestion in the network as well as latency. Congestion may be network links maxed by traffic or quality of service monitoring. It’s challenging but necessary to look at all traffic – are packets being dropped or lost? For latency, recognize problems anywhere in the chain-path, connection, the server layer within the app, or behind the server in the database, storage or client. Find it, isolate it, and break it down as quickly as possible so you can address the root cause. The network presents a nice starting point because everything is converged and connected.
Incomplete visibility throughout the pipeline is a problem. It's bad when monitoring optimization tools are not fully deployed and there are no metrics to the end-point. It doesn’t matter where in the stack something is going wrong. There’s an organizational challenge where monitoring is not a priority until it’s a problem. This forces you to be unnecessarily reactive rather than proactive.
Looking at metrics to determine what’s most important and having them be humane is important. Close the loop and determine what the issue is. Over-sensitivity to metrics to be monitored and prioritizing the resolution. This decreases morale and retention. DevOps brings a better experience for developers.
Avoid outages and maintain availability. DDOS attacks are larger and more frequent targeting a broader range of businesses and sites. DDOS is top of mind. Monitoring code changes to ensure they are not degrading user experience (UX). Is it an error in code or the network that is causing user transactions to be less efficient and slower?
There's a battle between hardware and software teams in how to solve the problem. Do you do it with more hardware or software? Engineering time is expensive. You need to do both. Move toward higher level development tools.
People don’t know the distributed computing and parallel process. Most users are business intelligence analysts and data scientists that know SQL and Python but not how Hadoop and Spark work. We must remove the technical complexity for end users. We help with the configuration settings to optimize performance. We enable people to write good programs on big data stacks.
With multi-tenancy, many users are running on the stack. Spark is memory intensive. Know how to allocate across departments and jobs. Understand what apps require. Analyze workload and allocate access to resources as needed. Multi-tenant allocation and capacity planning. The end-user can set the size of the container and the priority based on seeing the resources needed.
The common issue in our case was failing services during load testing. Once that happened, all the remaining testing was pretty much useless apart from the user making the requests him or herself (be it a load test, a REST client, or an application). The same set of services is heavily used by the robots, which are developed for the client and programmed to communicate with the services and each other. In such a setup, it is not always immediately clear how to achieve a chain of actions leading to a failure, and where it is to pinpoint the exact source of the problem. Again, as mentioned before, the mean (average) response time — as well as the rest of "averages" — is not nearly as transparent and informative as the 50th percentile. Initially, Amazon CloudWatch only supported the averages, but after we reported the issue, they added the 50th percentile (as well as other arbitrary percentiles) to the monitoring metric selection.
The biggest challenge has been in improving ease of use through analytics so that IT operators do less data interpretation and can focus more on remediation. Monitoring systems that can quickly and accurately identify root causes of problems and emergent issues are essential. Lack of true real-time monitoring in many solutions prohibits them from seeing all performance issues, and perhaps more importantly, identifying the signals that will lead to them. Monitoring systems that only poll the infrastructure every two to five minutes are inadequate for business-critical applications and operations.
Things that usually affect performance:
- Bad use of database indexes (or lack of).
- Wrong setup of the JVM in a prod environment.
- Using the same server side configuration for dev, test, and prod.
- Incorrect setup of prod hardware. For example, not putting enough memory on your servers could be a disastrous choice. However, overdoing could also be a poor choice. Again, understanding your product and load is crucial. Do load tests and pay attention to the performance graphs. Based on your observation you will know what to do next.
With developers, the common two issues I see are polar. On one end, developers do not understand the performance impact of their code, especially how it will perform in scale. On the opposite end, there is the issue with developers pre-optimizing, which may lead to less readable code and difficult bugs. Since optimization is a tradeoff, the extreme ends of this trade-off are usually problematic. With DevOps, the issue I most commonly see is around not clearly defining the metrics for a service before it goes live until an issue appears. We try to deal with it in JFrog by making sure that no service is released without clear monitoring definitions (that are automatically applied on deployment).
We need to be making sure we are monitoring the right metrics and presenting the data in the right way. It’s sometimes easy as the perceived product experts to think we know all the things we should be monitoring, but it’s important to set that aside and talk to the end users to qualify those beliefs and check them in the real world.
By the way, here's who we spoke to!
Josh Gray, Chief Architect, Cedexis.
Jeff Bishop, General Manager, ConnectWise Control.
Bryan Jenks, CEO and Co-Founder, DropLit.io.
Doru Paraschiv, Co-Founder, IRON Sheep TECH.
Yoav Landman, Co-Founder and CTO, JFrog.
Jim Frey, V.P. Strategic Alliances, Kentik.
Eric Sigler, Head of DevOps, PagerDuty.
Nick Kephart, Senior Director Product Marketing, ThousandEyes.
Kunal Agarwal, CEO, Unravel Data.
Len Rosenthal, CMO, Virtual Instruments.
Alex Rysenko, Lead Software Engineer, and Eugene Abramchuk, Senior Performance Engineer, Waverley Software.
Opinions expressed by DZone contributors are their own.