Can we get security groups rule references that extend across Cloud WAN, transit gateway, and cross-region VPC peers?
What would make AWS even better? - #9 in the countdown from 10
This is item #9 in my count-down of items on my 2022 re:invent wishlist. You can find item #10 here.
Back when I started building on AWS, I used to think of network security in terms of private and public subnets and nothing more. I was a software engineer after all, and what did I care for configuring firewalls or routes? Networking was something that needed to get out of the way and just let my code work. And now it was my problem.
Back in college, we learned some theory about the TCP handshake protocol and subnet masks, but I didn’t fully internalize the concepts. Years later, I had forgotten most of the meager theory I once knew.
So, I did what I usually do when I don’t know what I’m doing: I consulted the great sage, Google, and found an article called Practical VPC Design which I read carefully and followed to the tee.
At the time I would use two security groups on any network interfaces I deployed:
A public security group I would attach with to internet facing load balancers. This security group allowed ingress TCP traffic from 0.0.0.0/0 on application ports (usually 443 and 80) This security group then allowed egress TCP traffic to all ports and the CIDR range of the VPC.
A private security group that allowed ingress from itself and the public security group, and egress traffic to the internet.
I believe that my present state of embarrassment at admitting this fact is appropriate. After seeing security hub findings highlighting open security ports, I started taking this more seriously, but I never had a clear idea about what a good strategy to determine when I might want to create a new security group, and what kinds of network interfaces I should place in each.
This changed around two and a half years ago, when I met someone over the internet named Evan Spaeder. He is now, and was then, the CEO of Foresight Technologies. Back then, he was the company’s only employee.
Evan was and still is a great network engineer. While my bread-and-butter was software, his was enterprise architecture and networking. This is relevant because he said something off-handed to me that changed everything I thought I knew about using security groups. I’ll paraphrase, because I don’t remember the exact words he used:
“AWS security groups are powerful and easy: just give every service or component its own security group. That way you can create a zero trust network by whitelisting any network paths you want to use.”
This approach resembled something I was already familiar with: authorization rules in a service mesh. Except while authorization policies in a service mesh were enforced by the sidecar, the zero-trust network model that Evan described was enforced at the network layer. Since this conversation, security groups have brought a huge paradigm shift to how I think about network security. These days I use security rules to build zero-trust networks wherever possible.
I’ll briefly digress to say that Evan and I developed a mutual respect and trust and our companies merged in January of last year (2021) after we had already been working together for several months. Since merging, our joint team has grown from a headcount of 4 to 24 with no sign of slowing down. Along the way, I have learned a great deal from Evan about enterprise architecture and networking.
Returning to the subject at hand, you will notice that I said I use security rules to build zero-trust networks wherever possible. Building zero-trust networks with security rules is not always possible.
The first time I ran into a limitation regarding referencing a security group from another security group was also the first time I built a transit gateway. It was at some point in 2020, I don’t quite remember when. It was also the first time that I tried to put into practice learnings from the then newly released AWS whitepaper Building a Scalable and Secure Multi-VPC AWS Network Infrastructure.
By this point, I was accustomed to referencing security groups across multiple accounts across a VPC peering connection and was surprised when my attempt to reference a security group in another security group’s rule did not work over a transit gateway.
I shouldn’t have been surprised. The very white-paper I was basing my network architecture on contains this excerpt:
Security groups referencing works with intra-Region VPC peering. It does not currently work with Transit Gateway. Within your Landing Zone setup, VPC Peering can be used in combination with the hub and spoke model enabled by Transit Gateway.
I hadn’t read it closely enough. So I started configuring hybrid transit-gateway, vpc peer network topologies. In these models, traffic between VPCs is routed through peering connections, while traffic destined for other transit gateway attachments is routed through the transit gateway. In this way, I could define zero-trust security group rules that spanned accounts and VPCs by referencing other security groups across an account and VPC peer.
This network topology has created weird asymmetric routing issues when traffic that is sent from a network load balancer which is configured to preserve the client IP address through a vpc peer. In these cases, traffic is dropped as it egresses the vpc via the transit gateway instead of the vpc peering connection it arrived on. For the most part, however, the setup works well.
The zero-trust network model breaks down when you start trying to reference security groups across an inter-region VPC peer. You see, a security group is a regional construct. When you attempt to reference a security group from a security group rule that crosses an inter-region VPC peer, your request will fail. The security group rule in us-east-1 cannot see a security group in us-east-2.
The effect of this is that if a service in us-east-2 needs to receive traffic from a source in us-east-1, an ingress rule must be added to its security group that allows traffic from all CIDR blocks of subnets that the source in us-east-1 is deployed to. After all, with the exception of NLB enis, no network interface in an AWS vpc is guaranteed a static private IP address.
After adding the requisite rules to establish connectivity, all other services deployed to the same subnet as the source have network access to the destination and zero-trust is broken.
Recently, Evan and I were at the AWS Summit in New York where we attended a session on AWS Cloud WAN. Cloud WAN has the same limitation, and security group rules cannot reference other security groups across Cloud WAN attachments.
While there, we met and had a great conversation with an exceptional solutions architect at AWS. We extensively discussed limitations around security groups in particular with her.
The architect in question seemed unfamiliar with the methodology of using security groups to form a service mesh. She clearly understood the methodology’s merits, however, and spent time discussing some of the underlying factors that make the prospect tricky.
My takeaways from the conversation were that the use of a security group rule that references a source security group was initially designed to operate within a single VPC. AWS has extended this concept to apply to an intra-region VPC peering connection, but has not yet managed to apply it to transit gateways, cross-region VPC peering connections, or Cloud WAN because of the technical challenges involved.
It is my hope that this Re:Invent changes that, and we can begin to build zero trust networks that span the globe on AWS.