Over a million developers have joined DZone.

How Retailers Can Prevent Downtime More Effectively

DZone's Guide to

How Retailers Can Prevent Downtime More Effectively

The words ''504 gateway time out'' are the last four words any online retailer ever wants its shoppers to see. Read on for some tips on how to prevent such a disaster.

· Performance Zone ·
Free Resource

Built by operators for operators, the Sensu monitoring event pipeline empowers businesses to automate their monitoring workflows and gain deep visibility into their multi-cloud environments. Get started for free today.

The words "504 gateway time out" are pretty much the last four words any online retailer would ever want its shoppers to see, let alone on the biggest cyber shopping weekend of the year.

With online sales of a record-breaking $3.34 billion on Black Friday 2016, according to Adobe Digital Insights, it’s a real shame if your site went down.

Unfortunately, big retail companies like Macy’s, Inc., Victoria’s Secret’s, Express, Inc., and Pier 1 Imports failed to heed the lessons of Black Fridays past and experienced major shopping jams, as their sites just couldn’t handle the heavy traffic. It’s happened before at Target, Neiman Marcus, and Best Buy, costing them customer loyalty and a significant amount of revenue.

This year, Macy’s website suffered an extended Black Friday disruption, directing customers to a page citing “heavier traffic than normal” or “we’re getting a makeover.” Mmm, perhaps not quite the best timing for a makeover.

Image title

When your site’s down, your customers lose their confidence in you and things get personal. They get frustrated, angry and need to vent. Publicly. I mean, you did promise them a fabulous, hassle-free online shopping experience, but failed to deliver. They mayeventually forgive you, but you’ll have to work hard to earn back their trust, probably putting in more work than if you had originally planned ahead to avoid the situation. You’ll probably also lose a few potential customers in the process, and who knows where the social media winds will blow, right?

Plus, your stocks may drop. According to a Bloomberg report, the Macy’s site couldn’t keep up with the heavy traffic flowing, leading to a reported decline of 1.71% during active trading on Friday. Not only were potential customers disappointed, causing them to leave the site or hit the refresh button over and over, but this downtime issue actually hurt the $13.52 billion retailer’s stocks. That’s a straight-forward lose-lose situation right there.

What should you do to avoid this?

100% Uptime Is Not Impossible

In our internal analysis of the holiday season, we found that retailers using XAP achieved success in surpassing key metrics for online retailing app performance. Notable metrics from our 2016 internal analysis include the following:

 Black Friday and Cyber Monday metrics

A Story From the War Room

We sent our field engineer to work on-site with one of our top eCommerce customers. Here are some useful insights on how a preemptive support strategy and a short feedback loop works.

This retailer used XAP to provide access to its catalog, inventory data, and tax to achieve a zero-downtime holiday season for the second year in a row. As a result, this company delivered a fantastic customer experience with a page load time of 3.6 seconds on average, reaching 3.88 at its highest peak of holiday traffic and generating a 100% increase of 2016 holiday sales.

It’s notable to mention that this retailer did not experience any system performance issues.

Identifying and Mitigating Key Risks 

Rather than reacting after a failure occurred, prevent failure in the first place.

  • Most failures are the result of misconfiguration or capacity planning guesswork.

  • Planning and proactive tuning with continuous system monitoring produce predictable improvements at scale, eliminating risks from incorrect provisioning.

  • eCommerce applications are complex and built from many subsystems. In many cases, an eCommerce organization does not have the expert skill-set in each of the subsystems. Having an expert in the room helps to bridge this gap and builds the capabilities of business operations.

  • When product-related issues are identified, we were able to provide the fastest path to protect the business and address concerns in a timely fashion.

Using In-Memory Computing

You should use in-memory computing to buffer peak load access to shared data resources.

  • Typical eCommerce systems have shared data resources for managing inventory, orders, and catalog information. Putting the shared data resources in-memory provides faster and more efficient (parallel) access to this shared data.

  • Data is mirrored back into the database in batches. In this way, peak load transactions are buffered so that database traffic does not crash the database back-end.

  • The in-memory computing grid acts as a system of record. Failure in the underlying database can be saved without affecting the online users while the database is restored to a working state.

  • Using a combination of in-memory and SSD allows very large in-memory data sets to be stored at a reasonable cost, while still ensuring fast recovery during failure.

Remembering That Self-Healing Systems Recover From Failure in Real Time

  • Failures are inevitable. Keeping a backup copy in memory enables zero-downtime systems to service user traffic without interruption, even if something does go wrong.

  • Systems provisioned for failure handle failure by design.

  • Automatic fail-over and provisioning eliminate the need to overprovision (costly) resources in case of failure. Traditionally, it’s common for retailers to provision resources for holiday season that are five times the capacity of non-holiday traffic infrastructure.

Final Notes

Peak load performance often tends to stretch any system behavior in areas that are least expected and thus are often hard to handle. Quite often, peak loads lead to unexpected downtime.

There are many cases in which this sort of peak load performance is known in advance, as is the case with Black Friday and Cyber Monday. Still, many eCommerce sites continue to experience downtime or slowness during such events that lead to huge loss of revenue and reputation.

For the past few years, we (together with our customers) have taken a preemptive approach by putting an engineer on-site to escort the customer team during the event itself. This resulted in huge success, leading to 100% uptime. We learn so much from the experience; the customer learned even better how to operate our product and what to look for to ensure that the system is running properly. We learned much about how the customer is using our product and were able to shorten the feedback loop between the customer and our product and engineering team.

Let’s agree that next year, the last four words you want your customer to see are “Thank you for shopping!” That definitely has a better ring to it.

Download our guide to mitigating alert fatigue, with real-world tips on automating remediation and triage from an IT veteran.

retail ,downtime prevention ,site performance ,performance

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}