A Tale of ECS Service Stability
How ECS's new(ish) version consistency feature affected the stability of an old service
An Unexpected Evening Investigation
I was just about to log off for the day when I noticed a customer's SRE team chat suddenly erupting with activity. It was around 6pm my time, approaching midnight for their engineers in Europe. I hadn't received any direct notifications, but seeing the rapid-fire messages in the channel, I decided to hop in and see what was happening.
First things first, I pulled up the Datadog dashboard.
As it turned out, their production environment was experiencing something peculiar - their API Gateway traffic had completely flatlined. Not degraded, not sluggish, but completely silent. For a system that typically hums with constant activity, this digital silence was both unusual and concerning.
Diving into the Mystery
My first step was to check the enterprise firewall service that sits in front of their API Gateway - the crucial component responsible for traffic filtering and security. This service runs as a task in Amazon ECS, and surprisingly, the ECS console showed zero running instances.
What made this particularly intriguing was that there hadn't been any deployments to this service in several weeks. The service was designed with auto-scaling and self-healing capabilities specifically to prevent this type of situation. Yet somehow, it was completely down with no obvious explanation.
The Puzzling Behavior
The logs revealed what appeared to be a straightforward issue:
Task stopped at: YYYY-MM-DDThh:mm:ss.555Z CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref [ECR_REPO].dkr.ecr.us-west-2.amazonaws.com/[SERVICE]/[IMAGE]:[TAG]@sha256:[DIGEST]: [ECR_REPO].dkr.ecr.us-west-2.amazonaws.com/[SERVICE]/[IMAGE]:[TAG]@sha256:[DIGEST]: not found
ECS couldn't find the container image it needed to launch the task. But here's where things got really interesting.
I decided to try pulling the images myself, and what I discovered was perplexing: I could successfully pull the associated tag, and I could see the new SHA digest associated with that tag. However, I couldn't pull the specific SHA digest that ECS was trying to use. It was as if ECS had cached an old version of the image digest and was stubbornly refusing to look at the current tag.
This behavior contradicted everything I'd previously understood about how ECS handles container images. In my experience, ECS services had always pulled whatever image was currently associated with a tag, without getting hung up on SHA digests from previous versions. This fundamental change in behavior was what led me to reach out to AWS Support.
A Helping Hand from AWS Support
After explaining the puzzling behavior I was observing, AWS Support provided invaluable assistance. While I had correctly identified the root cause of the issue, AWS Support helped me attribute the cause to a specific documented change in ECS behavior - pointing me to documentation about the Software Version Consistency feature introduced in July 2024.
This was one of those moments where past experience created a blind spot. I had actually seen this feature announcement when it was released, but I hadn't fully grasped how dramatically it would change the behavior of ECS service stability in certain scenarios. What I had assumed was a bug or misconfiguration was actually a deliberate design change to improve security.
The Technical Puzzle Pieces
The Software Version Consistency feature fundamentally altered how ECS handles container images. Instead of resolving image tags at runtime for each task, ECS now:
Resolves a container image tag to its digest when the first task of a deployment starts
Stores this digest in the ECS control plane
Uses this exact digest for all subsequent tasks in that deployment
Here's where our customer's situation created the perfect storm:
Their ECR repository had a lifecycle policy that removed older, unused images
The specific image digest referenced by the firewall service had been removed by this lifecycle policy
When tasks needed to restart, ECS tried to pull the exact image digest it had on record
That specific digest no longer existed in ECR, resulting in launch failures
Before this feature, ECS would have simply pulled whatever image was currently associated with the tag - a behavior I had internalized over years of working with these services. The change to using immutable digests created a new failure mode that hadn't been accounted for in our operational practices.
Given that so many of the services we work with have a unique tag associated with each image and task definition, this was my first exposure to this failure mode.
Philosophical Takeaways
While a short outage in the grand scheme of things, this incident was a good reminder that cloud architecture requires constant adaptation. Features designed to improve security can sometimes create unexpected ripple effects in our operational practices.
The Software Version Consistency feature itself is a valuable security improvement - it ensures workloads use precisely the container images they were designed to use. The challenge comes in adapting our operational practices to work effectively with these evolving platform capabilities.
Practical Takeaways and Recommendations
As far as my practical recommendations for situations like this:
Try hard to use different tags with each image: When the ECS task definition references an image tag, it is always safest when that tag uniquely associates it with an immutable docker image.
When images require the same tags, leverage version consistency, but with safety nets: Preserving the ECS version consistency feature rather than disabling it enables us to audit any change in the image ECS runs. Nonetheless teams should take efforts to implement automated recovery mechanisms that can detect this specific failure pattern and force a new deployment automatically. This ensures each deployment represents a distinct immutable docker image, while still allowing recovery from failure.
Improve alerting granularity: Proactively escalating alerts on launch failures of mission-critical microservices is essential. If these launch failures have known root causes, adding these root causes into alert messages helps SRE teams to quickly identify root causes.
Align lifecycle policies with deployment strategies: Teams should align their ECR lifecycle policies to ensure they don't conflict with their ECS deployment patterns.
Develop deeper awareness of platform changes: One area I think is worth exploring, is how language models might enable us to consume updates from cloud providers or software providers, and determine whether they might be relevant for us. For teams who are constantly focused on innovating, these tools can help us cut through the noise to find information that is pertinent to our systems.
Looking Forward
For this customer, the solution isn't to disable the safety feature, but rather to enhance their recovery processes to account for new potential failure modes. It's about finding that balance between embracing security improvements while maintaining resilience in the face of unexpected changes.
For the rest of us, this incident offers a valuable reminder about the continuous evolution of cloud platforms. What works perfectly today might behave differently tomorrow, not because of a flaw in our implementation, but because the underlying platform continues to improve and evolve.