Preparation for the Failure
Are you prepared to fail?
Join the DZone community and get the full member experience.Join For Free
Predicting failures of software in a production environment is very critical, apart from ensuring the quality during the development stage. The failures can happen in many ways, predicting them upfront and ensuring there are solutions for all such failures is a smart way. It will position you ahead in the race.
The preparation for failure should start from the initial stage of software design and carry on to the development and testing cycles. We must keep questioning our decisions in all these stages about the probability of failures and associated solutions.
If you are prepared for failures, then you will be confident to tackle them in production. Otherwise, it can cause a big impact on the customers.
Let’s understand a few such preparations should be done to recover from the failures in the production.
New Features Release
Important aspects to consider during the design stage are:
- The end customers and their pattern of software usage is center of the requirements
- The architecture and design are flexible to extend based on user feedback and production escalations.
- The architecture and design allow scaling both horizontally and vertically based on the load.
- The test data is designed based on the end customer. Avoid using arbitrary data for testing.
We must envisage the possibility of failures during new feature development and log the crucial information that will help us to detect or fixing the production issues. We must also log the data that lead to the exception. By looking at the logs, it should be able to assess the health of the feature. It is important to give enough attention to the instrumentation during the development stage of a feature.
All new features development must consider the size of the data, the type of data generated by production users. A few examples are:
- Consider the latency to load the data and implement thinking indicators
- Implement paging techniques to handle latency issues
- All data capturing fields are validated as per the business rules.
Test data should also consider the above points so that they are taken care during the development stage.
Data Alerts for Data Corruption
Set up an alerting mechanism to scan the data in the corresponding database tables and report any anomalies in the data through email alerts. It will help us to detect any data corruption. The better way is to make sure there is no room for data corruption. For the corner issues, we should set up the data scanning alerts on data corruption.
Another way to detect data corruption would be to check application logs for CRUD operation failures and data constraint related exceptions. Such exceptions should be isolated and alerted when they occur in production.
It is important to understand the timing of a new feature release into the production. For example, in the education field, we must be very careful to release new features in the middle of the semester. In case the features are not well accepted or understood by the user, it might create lots of chaos among the users that leads to more production tickets.
The preparation tasks before the release are:
- The production support team should be trained on the new features so that they can answer customer questions on new features.
- New feature FAQ’s must be prepared and ready to guide users. It will guide the customers using new FAQ’s, on-screen help, and training sessions so that the users are comfortable to embark the new features. It will ensure the success of the release.
A good production monitoring is a must for a new release to the production. It should be one of the pre-tasks in deployment tasks of a release. We must monitor below things proactively after a production release.
- It is important to keep an eye on production usage like the number of users using it. New users onboard, recurring users, etc., For monitoring product usage, we should use tools like Google Analytics. We must ensure that bounce rate is less for a new feature of our product. We must also consider a few other important statistics to monitor.
- Application Exceptions — After the release, all new exceptions should be monitored and proactively investigate the reasons for those exceptions. It will help to find out the issues in those new features before a customer reports it. Also, it will give us a chance to prepare the fix in a short span of time lead to minimizing the impact to the customers.
- Application Performance Monitoring — Using the APM’s like New Relic and App Dynamics, it is important to monitor the Response Time, Throughput, and error rate. Also, monitor the slowest pages, and transactions. It will give you an idea of the possible failures during the peak usage. You should collect these performance pain points and address them immediately before they become urgent.
Infrastructure Alerting system — It is mainly to monitor the hardware system like CPU, Disk usage, Network connectivity, etc. There are many tools available in the market like Zenoss, Monza, App Dynamics, New Relic, etc. You can consider using one of them based on the cost factor.
Application health alerting system — It is mainly to monitor the response time, throughput, and error rate in the application. It can be monitored using tools like App Dynamics or New Relic.
In both of the above monitoring, we should set up an email alerting system if there is any violation in the threshold of the parameters.
Setting up monitoring and alert for any software abuse is very crucial as preparation for the failure. Some of the factors that can be considered are
- Monitor your IIS logs > user agent field. Ensure there are no unexpected user agents.
- Keep an eye on the type of data posted to the application using logging techniques.
- Immediate alerts should be set for any sudden surge in the traffic
- Check all repeated exceptions related to authentication, retrieve data, and CRUD operations. All these exceptions should be alerted immediately.
- Monitor for unexpected geographical usage of your software using software like Google Analytics
The above practices will ensure your agility to react for the production failures and recover quickly from the failures. The above practices will place you ahead of failures and help to take appropriate steps to avoid them or quickly recover from the failures. These also help in reducing the amount of damage.
Opinions expressed by DZone contributors are their own.