This Error Should Have Been Impossible

Last week I was looking into some errors popping up on our ops dashboard. Nothing mission-critical, but the goal is always 0 errors—or as close as possible. Having them in this particular Lambda meant they were annoying enough—status updates wouldn’t show up in time, which meant customers could see outdated info for a while.

As always, I went to CloudWatch Log Insights to dig into the errors. To my surprise, they were related to retrieving items by ID from DynamoDB, but the items weren’t there. The problem? That Lambda shouldn’t even have run until the item existed in DynamoDB.

I went to the codebase to track the flow and confirmed my understanding—this should have been impossible. But then I added another log group to the query, the one from the Lambda that actually creates the item, and looked at the timestamps. The gap between the item creation and the error? Less than 200ms.

That’s when it hit me.

The issue wasn’t in our code. It was because by default, DynamoDB uses eventually consistent reads. That means until the item is replicated across all the table’s AZs (availability zones), DynamoDB may return an outdated view depending on which AZ gets queried.

The fix?

One option was to turn our SQS queue (where events get sent after the item is created) into a delay queue. That might work, but it’s not guaranteed—and introducing even small delays to customers can be sub-optimal.

What I ended up doing instead was enabling consistent reads in the AWS SDK. Simple setting, big impact.

When to use consistent reads:

You need the freshest version of the data right after a write.
Your system can’t tolerate stale reads (think: real-time updates, chained writes/reads).
You don’t care about the cost.

When you might not need them:

You’re pulling data for analytics or a dashboard where a few seconds’ lag is fine.
You care more about cost—because eventually consistent reads are half the price.
You’re running at serious scale, and an occasional retry might be cheaper than paying for consistency every time.

Also worth knowing: reads from Global Secondary Indexes (GSIs) and DynamoDB Streams are always eventually consistent. No way around that. So if you’re reacting to those, just expect a little delay.

We already had retries configured for when DynamoDB queries actually fail—but in this case, we were getting an empty list from our DAO—a layer in our code responsible for fetching data from DynamoDB, and it failed later when we expected at least one item to always be there. I could’ve added more defensive checks to avoid the error, but that would’ve just masked the real issue—because we know for a fact this should never happen.

Wrap-up:

This was a good reminder that it’s hard to account for something you didn’t even know to look for. I didn’t write this part of the code, and I can see how easy it would be to overlook read consistency until it bites you. If you’re working with DynamoDB—or any distributed system—those small defaults can lead to big surprises.

Cheers!

Evgeny Urubkov (@codevev)

600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246
Unsubscribe · Preferences

codevev

This Error Should Have Been Impossible

When to use consistent reads:

When you might not need them:

Wrap-up:

“Connection pool shut down” is a lie

We were deleting history to avoid a DynamoDB limit

My friend asked if I know what AWS is.