Everyone’s Fault, Nobody’s Fault


Someone pushes a new feature to prod the same day you go on-call. Hours later, your phone goes off - not a gentle buzz, but a full-blown siren that could wake up the entire neighborhood.

You open the alert, and it’s for a feature you didn’t even touch. Maybe it’s unhandled NPEs, maybe something else. Doesn’t matter. You’re the one on-call, so it’s your problem now.


When Things Break

In those moments, it’s usually faster to just debug and fix it - even without full context.

I’m pretty good at debugging (unless it’s latency issues; race conditions are somehow easier).

By the time someone sees my message, figures out what’s going on, and proposes a fix, more customers would’ve been affected. So I’d rather just handle it.

Sometimes I roll back. Sometimes I don’t - if the issue’s isolated and the rest of the system keeps running fine, I’ll patch it directly.

But this newsletter isn’t about debugging, or when to roll back.

It’s about what happens after.


At AWS, We Don’t Point Fingers

Even if it’s your code that caused the issue, you’re not alone in it. Every change has at least one reviewer, multiple analyzers, and automated checks. If something still slips through, it’s not just on you.

When something significant happens, we write a COE (Correction of Error). No names. Just “the engineer.”

Then we go through the five “whys.”

Why did it happen?

Why did that happen?

Why did that happen?

And so on, until you reach the real root cause - maybe a missing test, weak automation, or a blind spot in monitoring.

It’s never just one mistake. It’s a chain.


The Netflix Culture Thing

A friend keeps trying to convince me to apply to Netflix. I’ve read about their culture, though - supposedly, if you cause an outage, you present your mistake to the entire company.

Maybe that’s exaggerated, but still. I know myself. I’d always have that tiny voice in the back of my head: don’t screw up.

We have global ops meetings too, where the more interesting incidents get ~~roasted~~ reviewed. But the difference is - no one’s being judged.


Mistakes Aren’t Failures

Nobody gets fired for one mistake. As IBM’s former CEO Thomas J. Watson supposedly said:

“Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him.”

If you’re causing weekly outages, sure, that’s a problem. But even then, it’s rarely just your problem - it’s the team’s.

So if your company has a blaming culture, be the one who changes it.

It’ll make everyone around you - and the product - better.


P.S. Once, I got paged at a casino while I wasn’t even on-call. Apparently, I had a setting enabled that triggered alerts whenever I had no service. The bug? It didn’t care whether you were on-call or not. My friend thought someone had just hit a jackpot. Turns out, it was just AWS yelling at me.

When I asked around, someone said it happened to them too - during a church service. So yeah, nobody’s safe from bad alert settings.

Cheers!

Evgeny Urubkov (@codevev)

600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246
Unsubscribe · Preferences

codevev

codevev is a weekly newsletter designed to help you become a better software developer. Every Wednesday, get a concise email packed with value:• Skill Boosts: Elevate your coding with both hard and soft skill insights.• Tool Tips: Learn about new tools and how to use them effectively.• Real-World Wisdom: Gain from my experiences in the tech field.

Read more from codevev

Last week, I ran into this tweet: the tweet It kinda triggered me. Why would someone pay $0.40 per secret per month when you could just use AWS Parameter Store and store them as SecureStrings FOR FREE? That’s what I use for oneiras.com, so I was determined to find out if I’d missed something. Am I unknowingly paying per secret? Or is there actually a reason to use AWS Secrets Manager instead? Turns out, there are a couple, but only if you really need them. The Big One: Automated Secrets...

Well, the global AWS outage happened just four days after I sent a newsletter about COEs and how “nobody gets blamed.” Great timing, right? I wish I could’ve been in the weekly global ops meeting to see the temperature in the room. That’s the one where teams present their recent issues and learnings. I can only imagine how lively that one must’ve been. Turns out the culprit was a DNS failure in the Amazon DynamoDB endpoint in the us-east-1 region. And while that sounds region-specific, it...

About eight years ago, when I was still a QA, Microsoft Azure “lost” our primary database. Without it, we were basically out of business - it was the main source of truth for, well, almost everything. I don’t remember exactly what the database held anymore, but I do remember the chaos that day. And the stress. A lot of it. Today, I saw a tweet about how the Korean government had all its data in a single location, with no backups. It reminded me: we all know this lesson, but we keep relearning...