DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Microservices' Keys to Success
  • Dynatrace Perform: Day Two
  • Non-blocking Database Migrations
  • How API Management Can Ease Your Enterprise Cloud Migration

Trending

  • REST vs. Message Brokers: Choosing the Right Communication
  • Best Practices for Writing Clean Java Code
  • Software Verification and Validation With Simple Examples
  • Build a Serverless App Fast With Zipper: Write TypeScript, Offload Everything Else

Increasing System Robustness With A ‘Let It Crash’ Philosophy

Ben Wootton user avatar by
Ben Wootton
·
Apr. 13, 13 · Interview
Like (0)
Save
Tweet
Share
6.75K Views

Join the DZone community and get the full member experience.

Join For Free

Designing fault tolerant systems is extremely difficult.  You can try to anticipate and reason about all of the things that can go wrong with your software and code defensively for these situations, but in a complex system it is very likely that some combination of events or inputs will eventually conspire against you to cause a failure or bug in the system.

In certain areas of the software community such as Erlang and Akka, there’s a philosophy that rather than trying to handle and recover from all possible exceptional and failure states, you should instead simply fail early and let your processes crash, but then recycle them back into the pool to serve the next request.  This gives the system a kind of self healing property where it recovers from failure without ceremony, whilst freeing up the developer from overly defensive error handling.

I believe that implementing let it crash semantics and working within this mindset will improve almost any application – not just real time Telecoms system where Erlang was born.  By adopting let it crash, redundancy and defence against errors will be baked into the architecture rather than trying to defensively anticipate scenarios right down in the guts of the code.  It will also encourage you to implement more redundancy throughout your system.

Also ask yourself, if the components or services in your application did crash, how well would your system recover with or without human intervention?  Very few applications will have a full automatic recoverability property, and yet implementing this feels like relatively low hanging fruit compared to writing 100% fault tolerant code.

So how do we start to put this into practice?

At the hardware level, you can obviously look towards the ‘Google model’ of commodity servers, whereby the failure of any given server supporting the system does not lead to a fatal degradation of service.  This is easier in the cloud world where the economics encourage us to use a larger number of small virtualised servers. Just let them crash and design for the fact that servers can die at a moments notice.

Your application might be comprised of different logical services.  Think a user authentication service or a shopping cart system.  Design the system to let entire services crash.  Where appropriate, your application should be able to proceed and degrade gracefully whilst the service is not available, or to fall back onto another instance of the service whilst the first one is recycling.  Nothing should be in the critical code path because it might crash!

Ideally, your distributed system will be organised to scale horizontally across different server nodes.  The system should load balance or intelligently route between processes in the pool, and different nodes should be able to join or leave the pool without too much ceremony or impact to the application.  When you have this style of horizontal scalability,let nodes within your application crash and rejoin the pool when they’re ready.

What if we go further and implement let it crash semantics for our infrastructure?

For instance, say we have some messaging system or message broker that transports messages between the components of your application.  What if we let that crash and come back online later.  Could you design the application so that this is not as fatal as it sounds, perhaps by allowing application components to write to or dynamically switch between two message brokers?

Distributed NoSQL data stores gives us let it crash capability at the data persistence level.  Data will be stored in some distributed grid of nodes and replicated to at least 2 different hardware nodes.  At this point, it’s easier to let database nodes crash than try to achieve 100% uptime.

At the network level, we can design topologies such that we do not care if routers or  network links crash because there’s always some alternate route through the network.  Let them crash and when they come back the optimal routes will be there ready for our application to make use of again in future.

Let it crash is more than simple redundancy.  It’s about implementing self recoverability of the application.  It’s about putting your site reliability efforts into your architecture rather than low level defensive coding.  It’s about decoupling your application and introducing asynchronicity in recognition that things go wrong in surprising ways.  Ironically, sitting back and cooly letting your software crash can lead to better software!

application Crash (computing) philosophy Robustness (computer science) IT

Published at DZone with permission of Ben Wootton, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Microservices' Keys to Success
  • Dynatrace Perform: Day Two
  • Non-blocking Database Migrations
  • How API Management Can Ease Your Enterprise Cloud Migration

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: