Know When (and When Not) to Blame Your Network
What do you do when it's difficult to pull apart whether a performance issue is application or network related? The reality is that some of the same techniques you likely already use to monitor application experience can also help with network experience.
Join the DZone community and get the full member experience.Join For Free
Nick Kephart leads Product Marketing at ThousandEyes, which develops Network Intelligence software, where he reports on Internet health and digs into the causes of outages that impact important online services. Prior to ThousandEyes, Nick worked to promote new approaches to cloud application architectures and automation while at cloud management firm RightScale.
Most of us are familiar with APM tools. They are broadly deployed; a recent Gartner survey showed APM being used to monitor more than 25% of applications in 40% of enterprises. APM instruments our code to make it easier to troubleshoot application-related issues. You can trace transactions, check SQL performance, and track error rates. But sometimes your users complain about performance and your APM shows everything is swell. That’s when it’s typically time to push the ticket over to the network team.
And not only is it the case that APM tools tend to be pretty opaque as to whether the network is at fault, but they also aren’t always well suited for the type of end-user and cloud environments that you are increasingly seeing. The same Gartner survey also found that a majority of APM users believe that current APM solutions are challenged by the prevalence of cloud-hosted applications and the Internet of Things (IoT).
So in this more distributed environment, where it’s already difficult to pull apart whether a performance issue is application or network related, what do you do? The reality is that some of the same techniques you likely already use to monitor application experience can also help with network experience. Getting better visibility into application delivery may not be as hard as it seems.
Seeing Application and Network As One
Active (or synthetic) monitoring is most associated with understanding page load and user transaction timings. But it can also help you understand when an issue is network-related, and when it isn’t, so you can be confident when assigning your team to look into a problem.
Active monitoring can give you insight into the performance of networks and infrastructure, as well as your application. And, in addition to your data center, it works in cloud environments and across the Internet, where many of your applications are hosted and where your customers are clicking away on your app. That way, you can see network and application data lined up right next to each other; and not just some average latencies, but in-depth information about how each portion of the network, between your app and your users, is performing. Most active monitoring tools will give you perspectives both from Internet locations and from within your own infrastructure, so you can use this technique for customer-facing or internal-facing applications.
How It Works
In addition to all of this browser-level information, active monitoring can also send packets across the network that are specifically instrumented to give you a clear understanding of performance. By synchronizing it with browser-level data, you can get a ton of useful data about how your app is being accessed, not just loaded.
A Stack Trace for Your Network Connectivity
This isn’t just round-trip loss and latency measurements to the web server. You can measure loss and latency to each layer 3 hop between your users and your web server or CDN edge. Plus, you can measure both the forward and reverse path, so you can understand performance of both downloads and uploads between your app and users. That’s huge!
Why? Well, within your app you can solve problems a lot faster when you have an exact stack trace, right? It’s the same with the network. By knowing which points in the network are having issues, you can much more quickly triage the issue and even tell who’s responsible. All of a sudden you can answer questions such as:
- Is your cloud provider or hosted data center proving the throughput you expect?
- Is your IaaS provider having an outage, either regionally or more widespread?
- Is your CDN correctly caching and serving your content from an optimal edge location?
- Is your DNS service efficiently serving up DNS queries to your users?
As any DevOps Engineer knows, a lot can go wrong in a lot of places and exhibit strange behavior. For example, traffic between countries in Asia often peers in the United States or Singapore given the congested links with China, Vietnam, and the Philippines. The same thing happens in Latin America, Africa, and the Middle East. Adjusting routing or other network configurations can dramatically speed up app delivery in these cases.
Having the equivalent of a stack trace for your network will make it possible to detect issues you may not know exist and to fix problems fast as they arise.
Key Application Delivery Metrics
So what will you do with this sort of data? First, you can start collecting key metrics about the delivery of your application that you may not currently have at your disposal. Here are some of the most important metrics to keep an eye on:
- Page Load and Transaction Time: A standard metric in many APM tools, this can provide a good performance baseline.
- Object Size (wire and uncompressed): The size of objects on the wire can vary widely and is important to your app’s sensitivity to throughput constraints.
- Object Errors and Load Time: Most apps and webpages have objects coming from a variety of third-party locations and CDNs. Understand whether availability of one object is causing part of your app to fail.
- Web/App Server Availability and Response Time: Most likely a metric you’re already tracking, but a key one to correlate with network connectivity metrics to understand outages.
- Loss per Interface: By tracking loss per interface, you can easily correlate network connectivity issues with specific service providers and devices.
- Latency per Link: With a link-level view, you can understand which portion of your CDN, ISP, or data center networks are congested or faulty.
- Throughput per Location: Understanding throughput by ISP, region, and city can inform decisions about how fast bulky objects can be loaded by users.
- CDN Latency: Measure performance from users to edge locations as well as your origin to CDN ingestion servers.
- DNS Availability and Response Time: It doesn’t go wrong often, but when it does, you’re hosed. Keep an eye on your DNS service provider.
- Routing Path Changes: Keeping a pulse on routing changes can ensure that you know if there is network instability or suboptimal routing.
Adding Active Monitoring to Your Arsenal
Active monitoring can save you from huge headaches. One major payment processor that I’ve worked with spent an entire holiday weekend, with multiple senior engineers trying to track down what they thought was a database transaction fault. Another team had just started deploying active monitoring in their environment, and upon reviewing the data, was able to track the problem to a routing issue that was causing unstable network connectivity. Upon seeing the data, the application development team became instant converts to adding active monitoring into the runbook for issue resolution.
As your applications are increasingly relying on IaaS, microservices, and APIs from far-flung parts of the Internet, your app is more reliant on the network than ever. That means in order to have a complete view of application experience, you should be adding active network monitoring to your application troubleshooting arsenal. With this data, your development team can avoid dead ends and be more confident the next time you need to ask the network guys to dive into an issue.
Opinions expressed by DZone contributors are their own.