The following is from an interview I conducted with Avi Freedman, co-founder and CEO of Kentik.
1. How does network performance management (NPM) differ from application performance management (APM)?
NPM and APM apply separate but complementary monitoring perspectives to ensure optimal user or customer experience (CX) of application.
The APM side applies reporting and analytics to the following application dimensions:
- End-user experience monitoring in terms of metrics like average response under peak load.
- Discovery and modeling of runtime application architecture.
- Monitoring health and activity of application components and the messaging stacks that connect them.
- Profiling transactions between distributed, modular application components and services.
The NPM side addresses the network and Internet’s role in the end-user experience (UX). This includes metrics such as:
- Latency--how much time it takes to get a response to a packet. This is measured bidirectionally. One direction of measurement looks at when a local host such as an application or load balancing server (like HAProxy or NGINX) sends a packet to a remote host and times how long it takes to get a response back. The other direction looks at when a packet is received from a remote host and measures how long it takes for the application (server) to send a response.
- Number and percentage of out of order packets. This is an important measure because TCP can’t pass data up to applications until bytes are in the right order. Small numbers of out of order packets typically don’t affect things much, but when they get too high, they will impact application performance.
- TCP retransmits. When a portion of a network path is overloaded or having performance problems, it may drop packets. TCP ensures delivery of data by using ACKs to signal that data has been received. If a sender doesn’t get a timely ACK from the receiver, it will resend a packet with the unacknowledged TCP segment. When TCP retransmits go over very low single digit percentage levels, application performance starts to degrade.
2. Don’t companies need both to ensure the optimal customer experience (CX)?
Absolutely. Both APM and NPM are important for ensuring CX or UX. They provide related but highly complementary viewpoints for recognizing the onset of problems and identifying root causes.
3. How has the cloud changed NPM?
To properly answer this question, we need to define what we mean by ‘cloud.’ For NPM discussion purposes, cloud refers to the move to distributed application architectures, where components are no longer all resident on the same server, but are spread across networks and (sometimes) the Internet, and accessed via API calls. The days of monolithic applications running in a datacenter, serving users located strictly in campuses and branch offices that are connected over a strictly private WAN, are not quite yet over, but they are the architecture of the past. The new cloud reality means that NPM must be able to measure and allow engineers to flexibly combine and analyze performance metrics, traffic flow, Internet routing/path and geolocation data to understand the network and Internet’s role in application performance behavior.
4. What “real world” problems are you solving?
Many of our customers run digital business operations that make money by serving ads, delivering gaming experiences or supporting e-commerce transactions. If you are such an organization, maybe you’ve achieved sufficient scale that you are running your own datacenters and have built out enough internet connectivity and points of presence in various geographical regions to act as a private content delivery network (CDN). You’ve got a very complex network infrastructure whose purpose is to deliver great user experience.
With Kentik’s NPM solution, you can deploy the nProbe agent onto load balancing servers like HAProxy or NGINX. The nProbe agent software measures key performance metrics like TCP retransmits, remote host and application host communications latency, out of order packets and fragments by regularly examining every packet in selected traffic flows. This information is sent in flow record format to the Kentik Detect big data back-end, where it is ingested and augmented with BGP path and geolocation data, and combined with millions to billions of other traffic flow records from network infrastructure devices like routers and switches. Together, all that data can be used to alert on anomalies.
From there, engineers can quickly look to see if real communications issues exist in the network or not. They can rapidly pivot through huge volumes of unsummarized network data to look at volumetric flows, congestion points in the infrastructure, internet paths, etc. With all these analytical options, they can discern if problems are specific to a particular transit ISP connection. Or it might be there is an internal congestion point due to a misconfiguration, so the communication issue is between the application servers and where the traffic exits to the internet. The faster operators and engineers can find the cause, the faster they can shift to doing something about the problem. If one particular transit or ISP is having issues, engineers can shift where traffic is being delivered from, perhaps to a different datacenter or via a different internet connection.
Another use case might be an enterprise IT group that has outsourced application loads to a couple different IaaS clouds. In many cases, that has meant they really don’t have much visibility or control over performance. However, by deploying host agents selectively they can start to take performance measurements and begin to plan more sophisticated network architecture approaches to improve control and outcomes for UX. For example, a growing number of IaaS providers allow you to request a direct connection to their network. By using a combination of traffic flow, internet path and performance monitoring data, you can start to create a more sophisticated connectivity scheme and gain better control over the network.
The important thing in both examples is that with a rich set of data and the ability to ask key questions in an extremely flexible manner, and get answers in seconds rather than minutes or hours, network operators and engineers can move past laborious analysis to faster action for restoring UX.
5. What are the most common issues infrastructure and operations (I&O) companies face as they migrate to the cloud?
Migration to the cloud and to digital business operations, where you’re not just serving up applications to internal users, but where network traffic is about external revenue, fundamentally changes both the importance of controlling flow of network traffic and the business criticality of the user experience. One common issue is that I&O leaders have to adjust to is that cloud application performance may be more business critical and thus have more “owners” within the organization. As these applications increasingly have to do with digital business initiatives that have a direct connection to revenue generation, there are more line of business owners who are not only interested--they may be directly funding the initiative and infrastructure. This isn’t a technical issue obviously, but it means that performance management issues will be under even more pressure and scrutiny than usual.
On the more technical level, many I&O leaders have not created a network monitoring strategy to deal with these new patterns of traffic. In fact, in many cases, I&O leaders don’t even have an inventory of their cloud and digital business components and all their API-driven dependencies. So, a smart move for I&O leaders is to draw up holistic plans around their digital operations that take these API and cloud dependencies in mind and start prioritizing monitoring and analytics investments that deal with these realities.
6. Where are the greatest opportunities for improvement in NPM?
I really think taking NPM forward to address cloud and digital operations realities is one of the most important opportunities right now, which is why we’ve focused on it. Gartner analyst Sanjit Ganguli recently wrote a research note entitled “Network Performance Monitoring Tools Leave Gaps in Cloud Monitoring” which addresses this very point.
Beyond addressing cloud and digital operations specifically, NPM and network monitoring tools in general need to catch up with cloud-scale and big data technologies. Most NPM tools are based on appliance architectures that retain mid 1990’s assumptions about the scarcity of compute and storage, which means that the way that traditional NPM tools treat data is reductive. In other words, they tend to throw away a lot of details and retain summaries. I&O leaders shouldn’t accept the anachronistic assumption that network data details are too voluminous to retain and analyze rapidly.
7. What does the future hold for I&O companies, and end users/customers?
While cloud and big data are not exactly new technologies in and of themselves, they are the future in terms of I&O monitoring such as NPM. Cloud and big data economies of scale mean that the industry should expect a much higher level of detail available and analyzable at scale, across what were previously considered inviolable data siloes. I&O leaders should think of their infrastructure and network data details like a retail operations executive thinks of point of sale (PoS) data in terms of its availability and value.
8. What do developers need to keep in mind to optimize network performance?
As with any optimization problem, remember the critical importance of metrics. First instrument your code so that you have a good baseline. Now start changing things: Trade CPU for bandwidth with compression. Use persistent connections to avoid initialization overhead. Parallelize the tx/rx operation to take advantage of high bandwidth links. Can you make all network operations asynchronous? How does this affect your failure modes? Trade non-deterministic behavior for performance: remember you can use UDP if you are OK with its drawbacks. Talking locally? Use a Unix domain socket. Talking over a WAN or the Internet? Use a tool like Kentik Detect and the Kentik nProbe Host Agent to track performance and figure out the best routing paths to avoid buggy or congested intermediary networks.