I really like Amazon ECS and we have probably deployed it for at least 20 customers by now at Foresight Technologies. Both with Fargate and EC2 flavors depending on the use-case in question. A fully managed control plane and deep integration into various AWS services like Load Balancing, IAM, CloudMap, CloudWatch, and EventBridge make it incredibly appealing as an orchestration engine for containers.
When using ECS on EC2, my team and I mostly use the same Autoscaling ECS Cluster Terraform module that I built a couple of years ago1 on top of Amazon Linux ECS Optimized AMIs.
This one client, however, asked for CIS Hardened EC2 images required to meet contractual obligations. Ok, so swap over the base AMIs for CIS hardened AMIs from the AWS Marketplace, install the ECS agent and easy peasy. And sure enough, the ECS agent started scheduling tasks configured with the recommended2 awsvpc
networking mode on the EC2 instances and everything worked correctly.
At least I thought they did at the time. The services started without a problem, and everything appeared to be working as expected. It was only later came to discover that despite apparent seamless operation, some very important ECS features were broken.
For one, every time a call to AWS was made with a privilege granted in the Task Role, the call would timeout while the ECS task tried to retrieve its temporary credentials. For another, if I killed a task forcibly from ECS, the entire docker agent froze, and I had to restart the machine before the ECS agent could launch any additional tasks on the screen.
The natural first port for debugging was the ECS agent logs. No dice. Everything looked normal, and no errors were being reported. Strange.
I connected to my EC2 instances with AWS Session Manager and started exploring. I take the following steps:
Start a task on the ECS instance in
awsvpc
mode and wait for it to start.Execute
docker ps
to determine the id of the container running scheduled task.Execute
docker exec -it containerid /bin/bash
to enter the container and explore.Execute
env
to print the container’s environment variables. I note theECS_CONTAINER_METADATA_URI
and theAWS_CONTAINER_CREDENTIALS_RELATIVE_URI
values.I execute
wget $ECS_CONTAINER_METADATA_URI
and experience a timeout. The container is not able to access its metadata endpoint.At this point, it is pertinent to note that the
$ECS_CONTAINER_METADATA_URI
points tohttp://169.254.170.2/v3/some-long-guid-possibly-a-task-id
, and when configuring my ECS host’siptables
, I have configured the following rules per AWS’s documentation:sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679 sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
Unremarkably,
127.0.0.1:51679
points to my local ECS agent. My timeout, therefore, can be narrowed down to reaching the ECS agent from within the ECS task container.I need to determine whether the ECS agent is accessible from outside the container and operating correctly. So I exit the container’s shell and execute
wget http://169.254.170.2/v3/some-long-guid-possibly-a-task-id
and am immediately able to retrieve metadata. I can therefore conclude that the ECS agent seems to be reachable at least from the ECS instance if not from within the task itself.
At this point I’m stymied and need to learn more. How does the ECS agent usually expose its metadata endpoint to the docker containers if they’re deployed with different network interfaces? I discover this AWS blog entry that touches on the subject.
Take a while reading the entire article if you’re interested, as I think the background is necessary to understand the issue in full. If you are already familiar with how awsvpc
container networking works, I have extracted the highly relevant portion and italicized the critical line.
The ECS agent invokes a chain of CNI plugins to ensure that the elastic network interface is configured appropriately in the container’s network namespace. You can review these plugins in the amazon-ecs-cni-plugins GitHub repo.
The first plugin invoked in this chain is the ecs-eni plugin, which ensures that the elastic network interface is attached to container’s network namespace and configured with the VPC-allocated IP addresses and the default route to use the subnet gateway. The container also needs to make HTTP requests to the credentials endpoint (hosted by the ECS agent) for getting IAM role credentials. This is handled by the ecs-bridge and ecs-ipam plugins, which are invoked next.
So maybe the ecs-bridge and ecs-ipam plugins aren’t being invoked because the CNI plugins aren’t being invoked by the ECS agent for some reason. I run the validations described at the end of the blog post and verify that the first plugin in the chain (the ecs-eni plugin) is in fact operating perfectly well, and each task has its own network interface as expected.
The ecs-bridge and ecs-ipam plugins, however, are not having their intended effect and I am still not able to reach the bridge from the container.
I conclude that one of two things is true:
The ECS agent requires additional undocumented configuration to run tasks in
awsvpc
configuration mode on a CIS hardened Amazon Linux 2 machine.The ECS agent is not currently able to run tasks in
awsvpc
configuration mode on a CIS hardened Amazon Linux 2 machine.
It is at this point that my terrible understanding of networking3 prevents me from making further progress toward resolving this issue.
I open a support ticket to AWS, and send it along to our AWS partner representative who puts us in touch with a partner Solutions Architect who initially misunderstands my problem. He tells me he is able to run tasks on the CIS hardened Amazon Linux AMI. I jump on a call to explain it to him that it’s not running tasks that I’m having issues with. Upon talking over the phone I’m able to better illustrate the issues I am facing, and he tells me he will reach out internally and try to find a resolution for me.
In a rather painless experience with the AWS team, they are able to very quickly help pinpoint the root cause and publish some solutions. I’m not sure whether I’m meant to be identifying the solutions engineer in question, but I am incredibly grateful to this Solutions Architect and the ECS agent team for the fantastic support on this issue.
The tldr; is that the CIS hardened image includes the following iptables INPUT
chain rule:
0 0 DROP all -- * * 127.0.0.0/8 0.0.0.0/0
The ECS agent relies on a standard ACCEPT
rule configured by default on regular un-hardened Amazon Linux. With the hardened image dropping these packets by default, two new (undocumented at the time) iptables rules are required to configure the ECS agent:
iptables -A INPUT -i ecs-bridge -d 127.0.0.0/8 -p tcp -m tcp --dport 51679 -j ACCEPT
iptables -A INPUT -i docker0 -d 127.0.0.0/8 -p tcp -m tcp --dport 51679 -j ACCEPT
This additional configuration made everything work immediately. See this github issue for more info.
I don’t like to be beaten, but this was a fun rabbit hole and it taught me a ton about the way ECS is designed in the process. If you’re still with me at the end of all of this, I hope you learned a bit too.
Albeit upgraded to Terraform 0.14 compatible syntax with modified userdata to enable operation on Amazon Linux 2 ECS Optimized AMIs.
Aside from being recommended by AWS, awsvpc
networking mode attaches a ENI to each ECS task and allows the use of native AWS security groups to restrict ingress and egress traffic from each task at the network level making this important for our particular use-case.
of the computer variety although the same description could apply to my business networking skillset.
Insightful!
Great post, showing true ownership and customer obsession. Well done.