From Cleanup to Chaos: A CloudFormation Cautionary Tale


It was 5 PM on a Friday, and our intern had already dropped off his laptop, since it was his last day. Two days earlier, he’d started cleaning up the AWS resources he created in our test account. What none of us realized at the time was that his cleanup would block every single deployment in our pipeline—for two full days.

The culprit?

CloudFormation dependencies between stacks.


The Setup

In our setup, whenever an SQS queue is created, we automatically generate CloudWatch alarms for it—including any dead-letter queues. Then, in a separate stack, we add those alarms to our monitoring system, so we get alerted if something goes wrong.

Totally reasonable.

Until someone tries to delete just one part of that system.


The Intern Meant Well

What the intern did was correct in theory:

He simply removed the CDK code that created those resources.

That change was approved, merged, and pushed into the pipeline—and that’s when CloudFormation threw a fit.

He and his mentor spent all of Thursday trying to fix it. But the mentor was off on Friday. And by then, I was the one trying to deploy something. (Not that I had to do it Friday… but I definitely didn’t want to deal with it on Monday.)

Apparently, I was the fourth person to take a look at the issue.

But I tend to be the last one who needs to.


Debugging the Dependencies

I traced the issue back to how the stacks were set up.

The monitoring stack still depended on an exported value from the stack that previously created the queue and alarm.

Normally, these exports are created automatically under the hood. You don’t have to think about them—until you break the dependency. Once that happens, the stack stops exporting those values, assuming they’re no longer needed.

In this case, the intern had deleted the SQS queue from the CDK code… and that removed the export. So CloudFormation refused to deploy—because the monitoring stack was trying to import a value that no longer existed.

I made the first PR to break the dependency by skipping alarm creation for that specific queue. Since this was the beta environment, I ran cdk deploy directly to get a faster feedback loop.

In theory, this should have worked.

It didn’t.


The Hidden Problem

Turns out, someone had manually deleted the SQS queue and alarm from the AWS Console.

And here’s the key thing:

Manually deleting a resource doesn’t change anything for CloudFormation.

As far as CloudFormation is concerned, the queue and alarm still exist.

But it can’t delete them.

And it won’t re-create them.

It just… stalls.


The Fix

Here’s what we actually had to do:

  1. Re-create the missing SQS queue manually, so CloudFormation could delete it properly. (We skipped the alarm—it wasn’t the blocker.)
  2. Update the CDK code:
    • Remove the SQS queue definition
    • Skip alarm creation
    • Explicitly define CfnOutput entries to manage the exports the monitoring stack expected
  3. Deploy. Now that the queue existed again, CloudFormation could cleanly delete it.
  4. Submit a follow-up PR to remove the now-unused CfnOutput and alarm logic.

The Lesson

If you take one thing from this:

Never manually delete CloudFormation-managed resources.

You’re not helping—you’re breaking the system.

And yes, we can see you did it. (Thanks, CloudTrail.)


What about you—what do you use for infrastructure-as-code?

CDK? CloudFormation? Terraform? Something else entirely?

Hit reply and let me know.

Cheers!

Evgeny Urubkov (@codevev)

600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246
Unsubscribe · Preferences

codevev

codevev is a weekly newsletter designed to help you become a better software developer. Every Wednesday, get a concise email packed with value:• Skill Boosts: Elevate your coding with both hard and soft skill insights.• Tool Tips: Learn about new tools and how to use them effectively.• Real-World Wisdom: Gain from my experiences in the tech field.

Read more from codevev

Last week, I ran into this tweet: the tweet It kinda triggered me. Why would someone pay $0.40 per secret per month when you could just use AWS Parameter Store and store them as SecureStrings FOR FREE? That’s what I use for oneiras.com, so I was determined to find out if I’d missed something. Am I unknowingly paying per secret? Or is there actually a reason to use AWS Secrets Manager instead? Turns out, there are a couple, but only if you really need them. The Big One: Automated Secrets...

Well, the global AWS outage happened just four days after I sent a newsletter about COEs and how “nobody gets blamed.” Great timing, right? I wish I could’ve been in the weekly global ops meeting to see the temperature in the room. That’s the one where teams present their recent issues and learnings. I can only imagine how lively that one must’ve been. Turns out the culprit was a DNS failure in the Amazon DynamoDB endpoint in the us-east-1 region. And while that sounds region-specific, it...

Someone pushes a new feature to prod the same day you go on-call. Hours later, your phone goes off - not a gentle buzz, but a full-blown siren that could wake up the entire neighborhood. You open the alert, and it’s for a feature you didn’t even touch. Maybe it’s unhandled NPEs, maybe something else. Doesn’t matter. You’re the one on-call, so it’s your problem now. When Things Break In those moments, it’s usually faster to just debug and fix it - even without full context. I’m pretty good at...