Last week, I had the opportunity to speak to a large fast food chain, over 1,000 locations in North America. The revenue is split between users coming in vs. those ordering take-out. They have an operations-oriented reactive monitoring solution today. The thing is, with a slightly elderly infrastructure, reactive and mostly backend-focused instrumentation, it's not before the next day that they know a store was offline. The overall annual revenue loss exceeds 7 digits, and most importantly, too often, users go to pick their order and see a puzzled and weary staff. Not good.
When considering approaches to protect revenues and brand name in production, there are many options and solutions to consider. One could think of a quadrant, made of the following axes:
Proactive (Synthetic) vs. Real User Monitoring (RUM): We won't go into an extensive comparison here, but for short: in the former, you set a lab with devices and browsers, user flows and schedules. You proactively execute these user flows and alert on any outages and responsiveness degradation. On the latter, code needs to be instrumented so it executes when users use the app. So you learn about user behavior, analytics etc., but you're dependent on users going through poor experience to know something's wrong. Again, not a comprehensive comparison, but enough for now. Examples for vendors in this space include Perfecto and CatchPoint for the former; dynaTrace, AppDynamics, and NewRelic are well known for the latter.
Front end to back end perspectives: One could place "sensors" at each point of the application, especially on the backend (cloud or physical): servers, load balancers, service APIs, and front end (web or native). Note that the proliferation of 3rd party content and services (in native apps and websites) complicates the problem: it would be challenging to instrument 3rd party backends. Those really have to be monitored in a different fashion.
In terms of "what to monitor," I recommend taking the approach of "if you had $1, where would you put it." Let's take two scenarios:
Monitor the end user experience (If you do that, you really are monitoring the entire app architecture): measuring the availability and responsiveness of the website and native app(s). Note that if you spend your $1 there, you may not have health visibility readily available to you for each component of the architecture.
Monitor a component in the app architecture (let's say one of the central servers or service APIs), and not the end user experience.
In my view, option 1 has the benefit of taking ownership of the #1 metric of the business: the end user experience. Even if the team has to scramble to find what is the root cause of the website outage, knowing about the outage is of top priority. Option 2 means that you know some server or service is misbehaving, but you don't know the impact. Further, when the user sees an outage, it may not be related to what you're monitoring. A good example, again, is 3rd party content: if your native app is making use of check scanning 3rd party service, currently experiencing an outage, your app will fail and you won't know (because you're not monitoring the end user experience).