Chaos Data Engineering Manifesto: 5 Laws for Successful Failures

Five laws for how data teams can start breaking data systems to make them more reliable.

Shane Murray

Feb. 28, 23 · Opinion

Like (1)

Save

3.4K Views

It's midnight in the dim and cluttered office of The New York Times, currently serving as the "situation room."

A powerful surge of traffic is inevitable. During every major election, the wave would crest and crash against our overwhelmed systems before receding, allowing us to assess the damage.

We had been in the cloud for years, which helped some. Our main systems would scale– our articles were always served– but integration points across backend services would eventually buckle and burst under the sustained pressure of insane traffic levels.

However, this night in 2020 differed from similar election nights in 2014, 2016, and 2018. That's because this traffic surge was simulated, and an election wasn't happening.

Pushing to the Point of Failure

Simulation or not, this was prod, so the stakes were high. There was suppressed horror as J-Kidd–our system that brought ad targeting parameters to the front end–went down hard. It was as if all the ligaments had been ripped from the knees of the pass-first point guard for which it had been named. Ouch.

I'm sorry, Jason; it was for the greater good.

J-Kidd wasn't the only system that found its way to the disabled list. That was the point of the whole exercise, to push our systems until they failed. We succeeded. Or failed, depending on your point of view.

The next day the team made adjustments. We decoupled systems, implemented failsafes, and returned to the court for game 2. As a result, the 2020 election was the first I can remember where the on-call engineers weren't on the edge of their seats, white-knuckling their keyboards…At least not for system reliability reasons.

Pre-Mortems and Chaos Engineering

We referred to that exercise as a "premortem." Its conceptual roots can be traced back to the idea of chaos engineering introduced by site reliability engineers.

For those unfamiliar, chaos engineering is a disciplined methodology for intentionally introducing points of failure within systems to understand their thresholds better and improve resilience.

It was largely popularized by the success of Netflix's Simian Army, a suite of programs that would automatically introduce chaos by removing servers and regions and introducing other points of failure into production. All in the name of reliability and resiliency.

While this idea isn't completely foreign to data engineering, it can certainly be described as an extremely uncommon practice.

No data engineer in their right mind has looked at their to-do list, the unfilled roles on their team, the complexity of their pipelines, and then said: "This needs to be harder. Let's introduce some chaos." That may be part of the problem.

Data teams need to think beyond providing snapshots of data quality to the business and start thinking about how to build and maintain reliable data systems at scale.

We cannot afford to overlook data quality management, and it plays an increasingly large role in critical operations. For example, just this year, we witnessed how deleting one file, and an out-of-sync legacy database could ground more than 4,000 flights.

Of course, you can't just copy and paste software engineering concepts straight into data engineering playbooks. Data is different. DataOps tweaks DevOps methodology as data observability does to observability.

Consider this manifesto as a proposal for taking the proven concepts of chaos engineering and applying them to the eccentric world of data reliability.

The 5 Laws of Data Chaos Engineering

The principles and lessons of chaos engineering are a good place to start defining the contours of a data chaos engineering discipline. Our first law combines two of the most important.

1. Have a Bias for Production, But Minimize the Blast Radius

There is a maxim among site reliability engineers that will ring true for every data engineer who has had the pleasure of the same SQL query returning two different results across staging and production environments. That is, "Nothing acts like prod except for prod."

To that, I would add "production data too." Data is just too creative and fluid for humans to anticipate. Synthetic data has come a long way, and don't get me wrong, it can be a piece of the puzzle, but it's unlikely to simulate key edge cases.

Like me, the mere thought of introducing points of failure into production systems probably makes your stomach churn. It's terrifying. Some data engineers justifiably wonder, "Is this even necessary within a modern data stack where so many tools abstract the underlying infrastructure?"

I'm afraid so. Remember, as the opening anecdote and J-Kidd's snapped ligaments illustrated, the elasticity of the cloud is not a cure-all.

In fact, it's that abstraction and opacity–along with the multiple integration points–that makes it so important to stress test a modern data stack. An on-premise database may be more limiting, but data teams tend to understand its thresholds as they hit them more regularly during day-to-day operations.

Let's move past the philosophical objections for the moment and dive into the practical. Data is different. Introducing fake data into a system won't be helpful because the input changes the output. It's going to get really messy too.

That's where the second part of the law comes into play: minimize the blast radius. There is a spectrum of chaos and tools that can be used:

In words only, "let's say this failed; what would we do?"
Synthetic data in production.
Techniques like data diff allow you to test snippets of SQL code on production data.
Solutions like LakeFS allow you to do this on a bigger scale by creating "chaos branches" or complete snapshots of your production environment where you can use production data but with complete isolation.
Do it in prod, and practice your backfilling skills. After all, nothing acts like prod but prod.

Starting with lesser chaotic scenarios is probably a good idea and will help you understand how to minimize the blast radius in production.

Deep diving into real production incidents is also a great place to start. But does everyone really understand what exactly happened? Production incidents are chaos experiments that you've already paid for, so make sure that you are getting the most out of them.

Mitigating the blast radius may also include strategies like backing up applicable systems or having data observability or data quality monitoring solution in place to assist with the detection and resolution of data incidents.

2. Understand It's Never a Perfect Time (Within Reason)

Another chaos engineering principle holds to observe and understand "steady state behavior."

There is wisdom in this principle, but it is also important to understand the field of data engineering isn't quite ready to measure by the standard of "5 9s" or 99.999% uptime.

Data systems are constantly in flux, and there is a wider range of "steady state behavior." As a result, there will be the temptation to delay the introduction of chaos until you've reached the mythical point of "readiness." Unfortunately, you can't out-architect bad data; no one is ever ready for chaos.

The Silicon Valley cliche of failing fast is applicable here. Or, to paraphrase Reid Hoffman, if you aren't embarrassed by the results of your first post-mortem/fire drill/chaos-introducing event, you introduced it too late.

Introducing fake data incidents while you are dealing with real ones may seem silly. Still, ultimately this can help you get ahead by better understanding where you have been putting bandaids on larger issues that may need to be refactored.

3. Formulate Hypotheses and Identify Variables at the System, Code, and Data Levels

Chaos engineering encourages forming hypotheses of how systems will react to understand what thresholds to monitor. It also encourages leveraging or mimicking past real-world incidents or likely incidents.

We'll dive deeper into the details of this in the next article, but the important modification here is to ensure these span the system, code, and data levels. Variables at each level can create data incidents, some quick examples:

System: You didn't have the right permissions set in your data warehouse.
Code: A bad left JOIN.
Data: A third-party sent you garbage columns with a bunch of NULLS.

Simulating increased traffic levels and shutting down servers impact data systems, and those are important tests but don't neglect some of the more unique and fun ways data systems can break badly.

4. Everyone in One Room (Or at Least Zoom Call)

This law is based on the experience of my colleague, site reliability engineer, and chaos practitioner Tim Tischler.

"Chaos engineering is just as much about people as it is systems. They evolve together, and they can't be separated. Half of the value from these exercises comes from putting all the engineers in a room and asking, 'what happens if we do X or if we do Y?' You are guaranteed to get different answers. Once you simulate the event and see the result, now everyone's mental maps are aligned. That is incredibly valuable," he said.

Also, the interdependence of data systems and responsibilities creates blurry lines of ownership, even on the most well-run teams. As a result, breaks often happen and are overlooked in those overlaps and gaps in responsibility where the data engineer, analytical engineer, and data analyst point at each other.

In many organizations, the product engineers creating the data and the data engineers managing it are separated and siloed by team structures. They also often have different tools and models of the same system and data. Feel free to pull these product engineers in as well, especially when the data has been generated from internally built systems.

Good incident management and triage can often involve multiple teams, and having everyone in one room can make the exercise more productive.

I'll also add from personal experience that these exercises can be fun (in the same weird way putting all your chips on red is fun). I'd encourage data teams to consider a chaos data engineering fire drill or pre-mortem event at the next offsite. It makes for a much more practical team bonding exercise than getting out of an escape room.

5. Hold Off on the Automation for Now

Truly mature chaos engineering programs like Netflix's Simian Army are automated and even unscheduled. While this may create a more accurate simulation, the reality is that automated tools don't currently exist for data engineering. Furthermore, if they did, I'm unsure if I would be brave enough to use them.

To this point, one of the original Netflix chaos engineers has described how they didn't always use automation as the chaos could create more problems than they could fix (especially in collaboration with those running the system) in a reasonable period.

Given data engineering's current reliability evolution and the greater potential for an unintentionally large blast radius, I would recommend data teams lean more towards scheduled, carefully managed events.

Practice as You Play

The important takeaway from the concept of chaos engineering is that practice and simulations are vital to performance and reliability. In my next article, I'll discuss specific things that can be broken at the system, code, and data level and what teams may find out about those systems by pushing them to their limits.

Chaos engineering Engineering Chaos Data (computing) Law (stochastic processes) systems

Published at DZone with permission of Shane Murray. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending