DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Production Ready Code Is Much More Than Error Handling

If you're used to thinking of code as being production ready, it might be time to adopt a more holistic mindset.

Oren Eini user avatar by
Oren Eini
·
Apr. 23, 19 · Opinion
Like (1)
Save
Tweet
Share
6.49K Views

Join the DZone community and get the full member experience.

Join For Free

Image title


Production-ready code is a term that I don’t really like. I much prefer the term Production Ready System. This is because production readiness isn’t really a property of a particular piece of code, but of the entire system.

The term is often thrown around, and usually it is referring to adding error handling and robustness to a piece of code. For example, let’s take an example from the official docs:

static HttpClient client = new HttpClient{ BaseAddres = "http://api.server.url/" };

static async Task<Product> GetProductAsync(string id)
{
    Product product = null;
    HttpResponseMessage response = await client.GetAsync(  $"api/products/{id}");
    if (response.IsSuccessStatusCode)
    {
        product = await response.Content.ReadAsAsync<Product>();
    }
    return product;
}


This kind of code is obviously not production ready, right? Asked to review it, most people would point out the lack of error handling if the request fails. I asked on Twitter about this and got some good answers, see here.

In practice, to make this piece of code production worthy you’ll need a lot more code and infrastructure:

  • .NET specific - ConfigureAwait(false)   to ensure this works properly with a SynchronizationContext
  • .NET specific – HTTP Client caches Proxy settings and DNS resolution, requiring you to replace it if there is a failure / on a timer.
  • .NET specific – Exceptions won’t be thrown from HTTP Client if the server sent an error back (including things like auth failures).
  • Input validation – especially if this is exposed to potentially malicious user input.
  • A retry mechanism (with back off strategy) is required to handle transient conditions, but need either idempotent requests or way to avoid duplicate actions.
  • Monitoring for errors, health checks, latencies, etc.
  • Metrics for performance, how long such operations take, how many ops/sec, how many failures, etc.
  • Metrics for the size of responses (which may surprise you).
  • Correlation ID for end-to-end tracing.
  • Properly handling of errors – including reading the actual response from the server and surfacing it to the caller/logs.
  • Handling successful requests that don’t contain the data they are supposed to.

And this is just the stuff that pops to my head from looking at 10 lines of really simple code.

And after you have done all of that, you are still not really production ready, mostly because if you implemented all of that in the  GetProductAsync()  function, you can’t really figure out what is actually going on.

These kinds of operations are something that you want to have to implement once, via the infrastructure. There are quite a few libraries which do robust service handling that you can use, and using that will help, but it will only take you part way toward a production-ready system.

Let’s take cars and driving as an example of a system. If you’ll look at a car, you’ll find that quite a bit of the car design, constraints and feature set is driven directly by the need to handle the failure mode.

A modern car will have (just the stuff that is obvious and pops to mind):

  • Drivers – required explicit learning stage and passing competency test, limits on driving in an impaired state, with higher certification levels for more complex vehicles.
  • Accident prevention: ABS, driver assist and seat belt beeps.
  • Reduce injuries/death when accidents do happen – seat belts, air bags, crumple zones.
  • On the road – rumble strips, road fence, road maintenance, traffic laws, active and passive enforcement.

I’m pretty sure that anyone who actually understands cars will be shocked by how sparse my list is. It is clear, however, that accidents, their prevention, and reducing their lethality and cost are a part and parcel of all design decisions on cars. In fact, there is a multi-layered approach for increasing the safety of drivers and passengers. I’m not sure how comparable the safety of a car is to production readiness of a piece of software, though. One of the ways that cars compete with one another is on the safety features. So there is a strong incentive to improve there. That isn’t usually the case with software.

It usually takes a few (costly) lessons about how much being unavailable costs you before you can really feel how much not being production ready costs you. And at this point, most people turn to error handling and recovery strategies. I think this is a mistake. A great read on the topic is "How Complex System Fail" it is a great, short paper, highly readable and very relevant to the field of software development.

I consider a system production ready when it has, not error handling inside a particular component, but actual dedicated components related to failure handling (note the difference from error handling), management of failures and its mitigations.

The end goal is that you’ll be able to continue execution and maintain a semblance of normalcy to the outside world. That means having dedicated parts of the system that are just about handling (potentially very rare) failure modes as well as a significant impact on your design and architecture. That is not an inexpensive proposition. It takes quite a lot of time and effort to get there, and it is usually only worth it if you actually need the reliability this provides.

With cars, the issue is literally human lives, so we are willing to spend quite a lot of preventing accidents and reducing their impact. However, the level of robustness I expect from a toaster is quite different (don’t catch on fire, pretty much) and most of that is already handled by the electrical system in the house.

Erlang is a good example of a language and environment that has always prioritized production availability. Erlang systems famously have 99.9999999% availability (that is nine nines). That is 32 milliseconds of downtime per year, which pretty much means less than the average GC pause in most systems. Erlang has a lot of infrastructure to support this kind of availability numbers, but that still requires you to understand the whole system.

For example, if your Erlang service depends on a database, a restart of a database server (which takes 2 minutes to cycle) might very well mean that your service processes will die, will be restarted by their supervisors only to die again and again. At this point, the supervisors itself give up and die, passing the buck up the chain. The usual response is to restart the supervisor again a few times, but the database is still down and we are in a cascading failure scenario. Just restarting is really effective in handling errors, but for certain failure scenarios, you need to consider how you’ll actually make it work. A database being unavailable can make your entire system cycle through its restarts options and die just as the database is back online. For that matter, what happens to all the requests that you tried to process at that time?

I have had a few conversations that went something like: “Oh, we use Erlang, that is handled,” but production readiness isn’t something that you can solve at the infrastructure level. It has a global impact on your architecture, design and the business itself. There are a lot of questions that you can’t answer from a technical point of view. “If I can’t validate the inventory status, should I accept an order or not?” is probably the most famous one, and that is something that the business itself need to answer.

Although, to be honest, the most important answer that you need from the business is a much more basic one: “Do we need to worry about production readiness, and if so, by how much?”

Production (computer science)

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • OpenID Connect Flows
  • DevOps Roadmap for 2022
  • Utilize OpenAI API to Extract Information From PDF Files
  • The Top 3 Challenges Facing Engineering Leaders Today—And How to Overcome Them

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: