This Error Should Have Been Impossible


Last week I was looking into some errors popping up on our ops dashboard. Nothing mission-critical, but the goal is always 0 errors—or as close as possible. Having them in this particular Lambda meant they were annoying enough—status updates wouldn’t show up in time, which meant customers could see outdated info for a while.

As always, I went to CloudWatch Log Insights to dig into the errors. To my surprise, they were related to retrieving items by ID from DynamoDB, but the items weren’t there. The problem? That Lambda shouldn’t even have run until the item existed in DynamoDB.

I went to the codebase to track the flow and confirmed my understanding—this should have been impossible. But then I added another log group to the query, the one from the Lambda that actually creates the item, and looked at the timestamps. The gap between the item creation and the error? Less than 200ms.

That’s when it hit me.

The issue wasn’t in our code. It was because by default, DynamoDB uses eventually consistent reads. That means until the item is replicated across all the table’s AZs (availability zones), DynamoDB may return an outdated view depending on which AZ gets queried.

The fix?

One option was to turn our SQS queue (where events get sent after the item is created) into a delay queue. That might work, but it’s not guaranteed—and introducing even small delays to customers can be sub-optimal.

What I ended up doing instead was enabling consistent reads in the AWS SDK. Simple setting, big impact.


When to use consistent reads:

  • You need the freshest version of the data right after a write.
  • Your system can’t tolerate stale reads (think: real-time updates, chained writes/reads).
  • You don’t care about the cost.

When you might not need them:

  • You’re pulling data for analytics or a dashboard where a few seconds’ lag is fine.
  • You care more about cost—because eventually consistent reads are half the price.
  • You’re running at serious scale, and an occasional retry might be cheaper than paying for consistency every time.

Also worth knowing: reads from Global Secondary Indexes (GSIs) and DynamoDB Streams are always eventually consistent. No way around that. So if you’re reacting to those, just expect a little delay.

We already had retries configured for when DynamoDB queries actually fail—but in this case, we were getting an empty list from our DAO—a layer in our code responsible for fetching data from DynamoDB, and it failed later when we expected at least one item to always be there. I could’ve added more defensive checks to avoid the error, but that would’ve just masked the real issue—because we know for a fact this should never happen.

Wrap-up:

This was a good reminder that it’s hard to account for something you didn’t even know to look for. I didn’t write this part of the code, and I can see how easy it would be to overlook read consistency until it bites you. If you’re working with DynamoDB—or any distributed system—those small defaults can lead to big surprises.

Cheers!

Evgeny Urubkov (@codevev)

600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246
Unsubscribe · Preferences

codevev

codevev is a weekly newsletter designed to help you become a better software developer. Every Wednesday, get a concise email packed with value:• Skill Boosts: Elevate your coding with both hard and soft skill insights.• Tool Tips: Learn about new tools and how to use them effectively.• Real-World Wisdom: Gain from my experiences in the tech field.

Read more from codevev

It was my first on-call shift since I’ve been back after surgery. I was also onboarding a new person to be on-call, which is always a fun combo: you’re trying to look calm while quietly hoping nothing explodes. On Wednesday night I went to bed early, around 9pm, trying to catch up on sleep. Of course, my “favorite” sound came from the phone, the pager app. I really didn’t want to get up, so I did the lazy thing: checked which alarm fired through this terrible app we have to use, saw it wasn’t...

A couple weeks ago I wrote about making our reports take a couple seconds instead of 3 minutes. What I discovered later is that we didn’t actually have access to historical reports, because all the DynamoDB entries that pointed to the S3 data behind those reports had a TTL of one day. After asking around, the reason was simple: some partition keys were exceeding 10GB, and that’s the DynamoDB item collection limit per partition key (aka “all items with the same partition key”). So the...

My friend asked me yesterday if I know what AWS is. And it’s not someone I only talk to once in a while. We literally talk every day. I guess I always refer to my employer as just Amazon, so “AWS” never comes up. He recently acquired an app and needed to create a new AWS account, add a new user for his developer, and give him permissions for Lightsail (whatever that is). He managed to do the first two. The permissions part? Yep, he had no idea. I’ve written about IAM before here. My goal...