Are You and Your Systems Prepared for Disruption?
Are You and Your Systems Prepared for Disruption?
How prepared are your critical business systems, knowing that you are only as strong as your weakest link? You may want or need to do something, but where do you start?
Join the DZone community and get the full member experience.Join For Free
SignalFx is the only real-time cloud monitoring platform for infrastructure, microservices, and applications. The platform collects metrics and traces across every component in your cloud environment, replacing traditional point tools with a single integrated solution that works across the stack.
What keeps you up at night? We all know that the next significant local or global event is just around the corner. The question is: could the next geopolitical or macroeconomic event drive a surge in usage of your applications? In recent weeks, there have been many stories on how events such as Brexit, Pokemon Go, and Amazon Prime Day have caused application failures. If that happened to your organization, what would the impact be?
Or perhaps you are worried about scenarios applicable to your current state, such as your infrastructure reaching end of life support, mean-time to failure duration is trending in the wrong direction, you have 3rd party vendor services issues, or deployment of faulty code or errant changes is an ongoing issue. How prepared are your critical business systems, knowing that you are only as strong as your weakest link? You may want or need to do something, but where do you start?
The following are six key areas of focus to evaluate your current system(s) for mitigating risks surrounding unexpected events:
- Non-functional requirements (NFRs) should be clearly defined and documented. If they are not, start by answering the following:
- What are your performance, scalability, and availability requirements at any time during any situation?
- Are they documented in a format that is clear and verifiable via testing and production metrics?
- Are they comprehensive, covering all business services?
- Are they known and agreed to by all of IT, 3rd party services vendors, and the business owners?
- Do you have SLAs in place for any 3rd party services vendors?
- System awareness – this is a detailed, documented understanding of your systems specifying the services used by the key business processes and the infrastructure used to operate these services. It includes the interdependencies between system components and their fault tolerance details. Having and maintaining this information is critical to triage efforts in times of crisis. The following are key questions to start a comprehensive system inventory:
- What services are used to support each business process?
- Do you know what hardware supports each of these services?
- What are the change management and configuration management process?
- How does the architecture satisfy the NFRs?
- Were performance budgets defined?
- What are all the key components and scalability plan for each? The high availability plan for each?
- If one key service slows down, how does it impact the rest of the system? Does everything slow down or is it contained?
- What is your weakest link?
- Application Performance Management (APM) –You cannot manage what you do not measure. APM provides in-depth insight into the experience of your users detailing where the challenges are in your system. Start by asking:
- Do you have a plan for measuring your production performance? If so, can you determine the system bottleneck or root cause of an issue?
- Are you satisfying your NFRs in production?
- Are you able to proactively identify and mitigate risk?
- Do you have insight into user satisfaction?
- Do you know user abandonment rates?
- What system and application errors are occurring and at what rate?
- What is the weakest link?
- Performance Testing & Validation Processes – Your performance test efforts must mimic reality and identify issues before production deployment. If not, then you need to ask the following:
- Do you have a comprehensive and accurate workload model?
- Does your load automation implementation properly simulate this workload?
- Do you have proper performance test environment operational controls and test execution processes in place?
- Do you have a suitable performance test environment that is ideally identical to production or at minimum logically equivalent?
- Do you have representative test data in terms of composition, volume, and physical layout?
- Are you monitoring and measuring all the right things during your tests?
- Do you ever cross verify PT results with actual production observations?
- Can you reproduce production issues in PT?
- Have you determined the weakest link in your system?
- Production performance issue remediation. This is usually more of a people and process challenge than a technical one. Most organizations have capable individuals but poorly performing teams. If issues do arise:
- Do you have the right processes, people, tools, and 3rd party vendor support in place to quickly identify and resolve issues?
- What is your weakest link in your current production support model?
- Capacity Planning – The process for determining current infrastructure needs and planning for future growth.
- Do you know how much headroom you have with current systems?
- Do you know the trends in system workload and system resource utilization?
- How quickly can you add new capacity?
- Does your architecture allow for horizontal or vertical scaling? Is it linear scaling? Since it is likely not, at what rate will it scale?
- What is the tipping point, the weakest link in the system?
Yes, it is a lot of work and can be overwhelming, however, it is much worse, and could have a disastrous impact on your business if you are not prepared. As a first step to the above focus areas, you must get senior management support since achieving your performance goals involves the enterprise – business and IT. Another important step is managing expectations. Performance improvement is a continuous and evolving process. You are never done. There is no big bang approach and everything is fixed. So avoid analysis paralysis, attack it on multiple fronts, and just start now.
Published at DZone with permission of Christopher Griffith , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.