Skip to content
John Allspaw

Artwork: Micha Huigen

What we talk about when we talk about ‘root cause’

It’s a lot more nuanced than you might think.

Photo of John Allspaw
Adaptive Capacity Labs logo

John Allspaw // Cofounder, Adaptive Capacity Labs

The ReadME Project amplifies the voices of the open source community: the maintainers, developers, and teams whose contributions move the world forward every day.

What we talk about when we talk about ‘root cause’? To begin, consider this passage from Thinking By Machine (deLatil, 1956) p.153:

“Imagine an iron bar thrust into an electric furnace. The bar lengthens, and the “cause” of the lengthening is said to be the heat of the furnace. One is astonished—why should it not be the introduction of the bar into the furnace? Or the existence of the bar? Or the fact that the bar had been previously kept at a lower temperature? None of these possibilities can be termed secondary causes; they are all primary determining causes without which the lengthening phenomenon could not have occurred.”

In recent years, the understanding that failure in complex systems requires multiple contributors coming together to produce these surprising events that we call incidents has gained traction. Much has been written and presented about this hallmark phenomena of complex systems, and while this perspective isn’t yet considered a “mainstream” view, I suspect it aligns with what all experienced software engineers intuitively understand.

In his seminal paper How Complex Systems Fail (Cook, 1998), my colleague Dr. Richard Cook put it this way:

3) Catastrophe requires multiple failures—single point failures are not enough. The array of defenses works. System operations are generally successful. Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident. Each of these small failures is necessary to cause catastrophe but only the combination is sufficient to permit failure. Put another way, there are many more failure opportunities than overt system accidents. Most initial failure trajectories are blocked by designed system safety components. Trajectories that reach the operational level are mostly blocked, usually by practitioners. 

Another description of this perspective was made by Ryan Kitchens at SRECon Americas in 2019:

“There is no root cause. The problem with this term isn't just that it's singular or that the word root is misleading: there's more. Trying to find causes at all is problematic...looking for causes to explain an incident limits what you'll find and learn. And the irony is that root cause analysis is built on this idea that incidents can be fully comprehended. They can't. We already have a better phrase for this, and it sounds way cooler: it's called a perfect storm. In this way, separating out causes and breaking down incidents into their multiple contributing factors, we're able to see that the things that led to an incident are either always or transiently present. An incident is just the first time they combined into a perfect storm of normal things that went wrong at the same time.”

From an abstract perspective, language that describes causality is, ostensibly, value-neutral. But use of the term ‘root cause’ is almost always used in the context of untoward or negative outcomes, and not in situations where an outcome is deemed a success. Rarely does someone demand a search for the ‘root cause’ of a successful product launch, for example. It seems widely accepted that successful outcomes in complex systems come from many influences that come together in a positive way. Failures aren’t often viewed the same way.

Rather than restating what’s been written and spoken about (such as the references linked above), I’d like to explore in this article what seems to keep people using the term ‘root cause’ despite the growing skepticism of its value.

What makes this term attractive for the people using it? Is it simply used as shorthand language, as a way to summarize an otherwise too-detailed explanation for the reader or listener? Or is it used to simplify a story, to redirect people’s attention to a specific and bounded area so something—anything—practical can be done?

Research literature on this topic reveals that in descriptions of accidents and incidents, use of the term ‘root cause’ (or even multiple ‘root’ causes) serves social purposes more than technical ones. 

Providing reassurance about the future

Labeling something as a ‘root cause’ helps people cope with the (sometimes implicit) anxiety that comes along with the experience of incidents. When people are observing their systems working well (or at least well enough), and a seemingly out-of-nowhere incident happens, the contrast can be jarring, to say the least. We can go from feeling confident about how well we understand our technical systems to suddenly feeling astonished and quite uncertain.

The lived experience of people responding to these situations can leave them wanting for something—anything—to help them feel better about the future. Likewise, technical leaders are also not immune to the feeling of unease that incidents tend to bring with them. If this can happen unexpectedly, what else can? Do these events represent harbingers of more significant ones to come? What do these incidents say about the organization’s abilities...or my own leadership skills?

Incidents have a way of producing genuine and unsettling dismay; it’s understandable to search for an explanation, a cause, that we can be sure of. 

In this way, labeling some specific part of the story as a 'root cause' helps us. It provides some comfort that we’ve got a handle on things we previously didn’t. There’s a term for this phenomenon: Nietzschean anxiety. It reflects what the German philosopher wrote in Twilight of the Idols:

“With the unknown, one is confronted with danger, discomfort, and care; the first instinct is to abolish these painful states. First principle: any explanation is better than none. Because it is fundamentally just our desire to be rid of an unpleasant uncertainty, we are not very particular about how we get rid of it: the first interpretation that explains the unknown in familiar terms feels so good that one “accepts it as true”….  The “why” shall, if at all possible, result not in identifying the cause for its own sake, but in identifying a cause that is comforting, liberating, and relieving. A second consequence of this need is that we identify as a cause something already familiar or experienced, something already inscribed in memory. Whatever is novel or strange or never before experienced is excluded.”

In other words, our experience with incidents can be so disturbing to us that we feel a strong and immediate desire to identify what “caused” an event, so we can then do something (which typically means fixing something) in order to regain a sense of being in control. This is what John Carroll (Carroll, 1995) called root cause seduction. 

On the face of it, this idea seems understandable, even relatable. But we have to acknowledge that labeling something as a ‘root cause’ reflects a cherry-picked perspective; it highlights one aspect of a complex event and discounts others. The label performs a sort of sleight-of-hand or redirection like a magician might, akin to saying “look right here—don’t concern yourself with other things.”  

Purposes and audiences

It can be useful to understand the context in which the term “root cause” is being used. 

Who is the author (or speaker) using it? What are they hoping to convey by using the term? Who is their audience? How do they understand the use of ‘root cause’ in the context of what they are reading? 

If it is used in a conversation amongst engineers on the same team, it might be used simply as a way to emphasize or highlight a specific location they believe warrants attention. Quite often, we find this usage more to reflect a thing better conveyed as a trigger, rather than a cause. The term ‘trigger’ tends to do a better job of describing a specific dynamic that “activates” already existing conditions, some of which might have been latent in the code or architecture’s arrangement for some time.

If it is used in a legal agreement or other contractual documents, the term tends to intentionally have an ambiguous meaning so as to allow for flexibility of interpretation that frequently comes with legal language. 

When it comes to articles companies publish about incidents their service(s) or products have experienced, the term ‘root cause’ tends to be used in very specific ways. The core audience for these public posts are both current and potential future customers. The primary purpose is to provide a) confidence that the company understands the event sufficiently and b) some form of commitment to improving the situation in the future. By labeling a specific component as a “root cause” (or even a finite number of “root causes”) authors of these posts can project much more certainty or confidence than if they were to acknowledge the genuine complexity of the incident.

A challenge for readers and listeners

I’ll offer a few questions to consider the next time you read or hear the term ‘root cause’:

  • What is the author (or speaker) trying to convey by using the term?

  • What agenda(s) might the author (or speaker) have in their version of the story, other than providing the richest description they can?

  • What else can you imagine is influencing the outcome of the story being told, besides what is deemed the ‘root cause’?

  • What details seem to be noticeably absent in the story you’re being told?

  • What questions can you imagine being dismissed or discounted by the storyteller, if you had the chance to ask them?

Questions like these are garden-variety critical thinking exercises. But they might help us explore what the story doesn’t tell us, or what might be missing in the story.

Hi, I'm John. I'm a founder of Adaptive Capacity Labs, and the author of The Art of Capacity Planning and Web Operations. In 2009, I helped shape what later became known as the DevOps movement before I moved to Brooklyn, New York. My fascination with understanding how people handle challenging problems under pressure led me to my master's degree in Human Factors and Systems Safety at Lund University while I was CTO at Etsy. Since then, my colleagues and I have been working hard to bring perspectives and approaches from Resilience Engineering to the world of software engineering and operations.

About The
ReadME Project

Coding is usually seen as a solitary activity, but it’s actually the world’s largest community effort led by open source maintainers, contributors, and teams. These unsung heroes put in long hours to build software, fix issues, field questions, and manage communities.

The ReadME Project is part of GitHub’s ongoing effort to amplify the voices of the developer community. It’s an evolving space to engage with the community and explore the stories, challenges, technology, and culture that surround the world of open source.

Follow us:

Nominate a developer

Nominate inspiring developers and projects you think we should feature in The ReadME Project.

Support the community

Recognize developers working behind the scenes and help open source projects get the resources they need.