DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
What's in store for DevOps in 2023? Hear from the experts in our "DZone 2023 Preview: DevOps Edition" on Fri, Jan 27!
Save your seat
  1. DZone
  2. Software Design and Architecture
  3. Microservices
  4. Preparation for the Failure

Preparation for the Failure

Are you prepared to fail?

Lokesh Raj user avatar by
Lokesh Raj
·
Jan. 18, 19 · Presentation
Like (2)
Save
Tweet
Share
3.69K Views

Join the DZone community and get the full member experience.

Join For Free

Predicting failures of software in a production environment is very critical, apart from ensuring the quality during the development stage. The failures can happen in many ways, predicting them upfront and ensuring there are solutions for all such failures is a smart way. It will position you ahead in the race.

The preparation for failure should start from the initial stage of software design and carry on to the development and testing cycles. We must keep questioning our decisions in all these stages about the probability of failures and associated solutions.

If you are prepared for failures, then you will be confident to tackle them in production. Otherwise, it can cause a big impact on the customers.

Let’s understand a few such preparations should be done to recover from the failures in the production.

New Features Release

Design

Important aspects to consider during the design stage are:

  1. The end customers and their pattern of software usage is center of the requirements
  2. The architecture and design are flexible to extend based on user feedback and production escalations.
  3. The architecture and design allow scaling both horizontally and vertically based on the load.
  4. The test data is designed based on the end customer. Avoid using arbitrary data for testing.

Application Logging

We must envisage the possibility of failures during new feature development and log the crucial information that will help us to detect or fixing the production issues. We must also log the data that lead to the exception. By looking at the logs, it should be able to assess the health of the feature. It is important to give enough attention to the instrumentation during the development stage of a feature.

Data

All new features development must consider the size of the data, the type of data generated by production users. A few examples are:

  1. Consider the latency to load the data and implement thinking indicators
  2. Implement paging techniques to handle latency issues
  3. All data capturing fields are validated as per the business rules.

Test data should also consider the above points so that they are taken care during the development stage.

Data Alerts for Data Corruption

Set up an alerting mechanism to scan the data in the corresponding database tables and report any anomalies in the data through email alerts. It will help us to detect any data corruption. The better way is to make sure there is no room for data corruption. For the corner issues, we should set up the data scanning alerts on data corruption.

Another way to detect data corruption would be to check application logs for CRUD operation failures and data constraint related exceptions. Such exceptions should be isolated and alerted when they occur in production.

Production Release

It is important to understand the timing of a new feature release into the production. For example, in the education field, we must be very careful to release new features in the middle of the semester. In case the features are not well accepted or understood by the user, it might create lots of chaos among the users that leads to more production tickets.

The preparation tasks before the release are:

  1. The production support team should be trained on the new features so that they can answer customer questions on new features.
  2. New feature FAQ’s must be prepared and ready to guide users. It will guide the customers using new FAQ’s, on-screen help, and training sessions so that the users are comfortable to embark the new features. It will ensure the success of the release.

Production Monitoring

A good production monitoring is a must for a new release to the production. It should be one of the pre-tasks in deployment tasks of a release. We must monitor below things proactively after a production release.

  1. It is important to keep an eye on production usage like the number of users using it. New users onboard, recurring users, etc., For monitoring product usage, we should use tools like Google Analytics. We must ensure that bounce rate is less for a new feature of our product. We must also consider a few other important statistics to monitor.
  2. Application Exceptions — After the release, all new exceptions should be monitored and proactively investigate the reasons for those exceptions. It will help to find out the issues in those new features before a customer reports it. Also, it will give us a chance to prepare the fix in a short span of time lead to minimizing the impact to the customers.
  3. Application Performance Monitoring — Using the APM’s like New Relic and App Dynamics, it is important to monitor the Response Time, Throughput, and error rate. Also, monitor the slowest pages, and transactions. It will give you an idea of the possible failures during the peak usage. You should collect these performance pain points and address them immediately before they become urgent.

Alerting System

  1. Infrastructure Alerting system — It is mainly to monitor the hardware system like CPU, Disk usage, Network connectivity, etc. There are many tools available in the market like Zenoss, Monza, App Dynamics, New Relic, etc. You can consider using one of them based on the cost factor.

  2. Application health alerting system — It is mainly to monitor the response time, throughput, and error rate in the application. It can be monitored using tools like App Dynamics or New Relic.

In both of the above monitoring, we should set up an email alerting system if there is any violation in the threshold of the parameters.

Security

Setting up monitoring and alert for any software abuse is very crucial as preparation for the failure. Some of the factors that can be considered are

  • Monitor your IIS logs > user agent field. Ensure there are no unexpected user agents.
  • Keep an eye on the type of data posted to the application using logging techniques.
  • Immediate alerts should be set for any sudden surge in the traffic
  • Check all repeated exceptions related to authentication, retrieve data, and CRUD operations. All these exceptions should be alerted immediately.
  • Monitor for unexpected geographical usage of your software using software like Google Analytics

The above practices will ensure your agility to react for the production failures and recover quickly from the failures. The above practices will place you ahead of failures and help to take appropriate steps to avoid them or quickly recover from the failures. These also help in reducing the amount of damage.

Test data Production (computer science) application Release (agency) Software design Monitor (synchronization) Data corruption app Production support

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Educating the Next Generation of Cloud Engineers With Google Cloud
  • Top Five Tools for AI-based Test Automation
  • What Should You Know About Graph Database’s Scalability?
  • Spring Boot Docker Best Practices

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: