How an Azure migration kicked my butt
A less than epic tale of moving a hybrid-tenant application from AWS to Azure
Only a year and a half ago I led the effort to built a backend to handle a hybrid-tenant solution for one of our customers on AWS.
The idea was simple: bin-pack application micro-containers on Amazon EC2 instances, and schedule them with ECS. One application container per tenant. Clone a prepared seed database to bootstrap the application and place it in a shared database cluster to save on infrastructure costs. Maintain an inventory of tenant state in a DynamoDB table. Inject an email approval workflow in the middle. Orchestrate all steps and gain visibility into tenant deployments with Step Functions.
It seemed simple, and it was relatively simple. There were a couple of hoops to jump through in terms of how we would fit tons of containers onto an EC2 instance with only a few network interfaces, but that’s a story for another time.
So we defined multi-tenant infrastructure code templates and deployed them. We defined single-tenant infrastructure code templates and created a command line deployment CLI for long running tasks and dockerized it. We defined a set of Step Function state machines to handle the data integrations, decisions, and inventory management for these deployments.
When we built this pipeline we came up with a plan, executed it, and it worked without any major wrenches.
A couple of months ago, we got an interesting request. Certain tenants were dead-set on dedicated hybrid-tenant environments in Azure and GCP, and were willing to pay to make it happen. I and the team are not entirely unfamiliar with Azure and GCP environments, but we hardly have the depth and breadth of experience as we do with AWS.
The remainder of this blog post is dedicated to five primary goals:
Detailing the plan to deploy this application to azure.
Describing the roadblocks that we ran into along the way arising from Azure limitations.
Describing how we adapted to said roadblocks.
Describing what changes I would make if I were to start from scratch.
Reflecting on the merits of my general experience building non-trivial Azure infrastructure.
The Plan
The plan is simple but there are still a few kinks to be worked out. Azure environments will consist of a vnet with one subnet for the databases, and another for each tenant application.
We still need to decide where to host the container without ECS available to us.
AKS is always an option but I like solutions that just work and Kubernetes always feels like extra work to manage. Especially if you want out of the box monitoring and dashboards, and logs.
Azure Container Apps at the time still in beta and did not yet have terraform support. At a glance Azure Service Fabric looks kind of similar to Kubernetes and I’m looking for something that doesn’t require me to manage underlying vms. Azure Container Instances doesn’t autoscale or manage SSL certificates.
One of our engineers discovers Azure App Service which looks very close to what we are looking for. It gives us a very similar user experience to AWS App Runner and should be super simple to get up and running with.
The Execution
An engineer on my team at Foresight Technologies writes some terraform scripts and spins up a vnet and deploys Azure Database for MySQL servers. He then deploys an App Service along with a Regional VNet Integration to allow it to reach the database, pushes an image to ACR, and seeds the database. The app service runs fine, seems stable, and was super responsive.
Well that was easy. All that remains is consolidating some of the code, adding some monitoring configuration, and extending the existing deployment pipeline to allow deployments to azure environments. (Or so we think as you will discover as you read on).
It is with a single successful tenant deployment that our engineer takes paternity leave. It is also at this point that I pick up the project and the fun truly begins.
Extending the Existing Pipeline
Unfortunately some organizational constraints make extending the existing pipeline a little tricky. You see, the customer requires that no data from any of the tenant databases leave Azure, and the deployment pipelines performs operations on the data inside the database. It is therefore insufficient to simply set up a site to site VPN to Azure. We need to run database operations from within Azure.
In AWS we ran these operations on the ECS cluster using the RunTask api action from Step Functions. In Azure we have to decide where to run these operations.
A Bad Idea
My plan is to run database operations inside Azure Container Instances. After all, you can set up container instances that don’t self heal. So the idea is: run a container instance let it perform the work, and then die.
As I begin working on executing this plan, I realize that it’s doomed to failure. You see, while ECS allows you to specify dynamic container overrides like docker command strings when you run a task. Azure Container Instances has no way of passing these container overrides. This means I need a separate container instance for every deployment pipeline execution.
Investigating further, I discover that Azure has a quota limiting the number of Container Instances I can deploy. If I have many deployment pipeline executions, I do not want to need to garbage collect every ephemeral container instance that is spawned.
Azure Container Instances was clearly not designed for this, and my current attempt was a poor abuse of the service.
After additional investigation, short of spawning an AKS cluster, I can find no way to perform an event driven ephemeral task with transient docker containers in Azure. In fact most event driven patterns in Azure run with Azure Functions.
A Viable Solution
There are several system dependencies, but the best option I can think of is creating an Azure Function capable of handling events from the deployment pipeline. I will, of course, write an Azure function interface for the command line we built.
Because these functions are long living and rely on linux system dependencies, I deploy a docker backed Azure Functions Application on a more expensive App Service Plan to ensure the runtime doesn’t die after ten minutes. I build an adapter to leveraging Azure Service Bus messages to pass requests to Azure Functions and the Step Functions API to send output and errors back to the deployment pipeline.
This was a little more work than I wanted, but apparently running ephemeral docker tasks is not as common a task as I expected.
I work out a few more kinks in the step functions workflow and am elated when the first tenant deployed through the deployment pipeline. Everything just works. I am just about to email the customer saying that we are done, when I decide to deploy one more tenant just in case.
Tenant Networking
I watched the state machine’s execution graph light up green step after step, until the final service deployment step. And then it went red. The deployment failed. Only one resource failed to create: the network connection between the new App Service and the existing subnet.
You see, as is abundantly clear from azure documentation:
The integration subnet can be used by only one App Service plan.
I am trying to use the same integration subnet with another App Service plan.
My instinct is to share an App Service plan between multiple plans. This is supported by continued documentation seemingly endorsing this idea.
You can have only one regional virtual network integration per App Service plan. Multiple apps in the same App Service plan can use the same virtual network.
Should be a simple fix but I search the App Service plan documentation to understand the implications of running all my App Service tenants on the same plan.
It turns out that an App Service plan represents a set of physical infrastructure that runs the App Service. And when it scales, a warning is evident:
In this way, the App Service plan is the scale unit of the App Service apps. If the plan is configured to run five VM instances, then all apps in the plan run on all five instances. If the plan is configured for autoscaling, then all apps in the plan are scaled out together based on the autoscale settings.
If I’m planning on deploying hundreds of Azure tenants, I’m going to need some very expensive computers I’m not sure I or my customer can afford.
Changing the Network Model
There is only one option left to me. Change the networking model and deploy a separate subnet for every App Service. I do a little math and if I want to pack /28 CIDR blocks into a /16 block, I can fit up to 4096 in. After verifying, Azure has a hard limit of 3000 subnets per vnet. I’m very lucky that the number of tenants to be deployed to Azure is in the order of hundreds rather than thousands.
Deployment pipeline users have no idea about networking, and manually keeping track this network space is going to be painful.
As such, the deployment pipeline needs to take care of managing cidr block allocation for the Azure subnets. I write an AWS lambda function to achieve this. The logic is simple enough.
def get_next_available_subnet_cidr(
vnet_name=f"dev", cidr_prefix=28,
resource_group_name="dev"
):
vnet = network_client.virtual_networks.get(
resource_group_name=resource_group_name,
virtual_network_name=vnet_name
)
vnet_ranges = map(ip_network, vnet.address_space.address_prefixes)
candidate_subnets = (subnet
for vnet_range in vnet_ranges
for subnet in vnet_range.subnets(new_prefix=cidr_prefix)
)
cidrs_in_use = map(ip_network, (s.address_prefix for s in vnet.subnets))
for candidate_cidr in candidate_subnets:
overlaps = False
for in_use_cidr in cidrs_in_use:
logging.debug(f"Candidate: {candidate_cidr}, In use: {candidate_cidr}, overlaps: {candidate_cidr.overlaps(in_use_cidr)}")
if candidate_cidr.overlaps(in_use_cidr):
overlaps = True
break
if overlaps:
continue
return str(candidate_cidr)
raise ValueError("No available CIDR ranges in vnet")
I run the lambda function locally and test it in several cases. It works better and faster than expected. I am able to find the next available subnet in several test cases.
I make changes to the Step Functions State Machine DSL, and try to deploy the state machine along with the new lambda function.
The Final Obstacle
The lambda function deployment process has been working smoothly for the last year, but it explodes now. The AWS API informs me that I may not have an unzipped deployment package of over 256MB.
I’ve only added one dependency though. The pip package azure-mgmt-network lets me find the subnets in my azure tenant to be able to find the next available subnet. It turns out that this along with its dependencies this pushes me over the 256MB limit.
I look at my Pipfile and I suppose that there are a few dependencies that might be large, but I am surprised that the azure-mgmt-network package pushed me over the edge.
No problem, I’ve dealt with large dependencies before. I try to create a lambda layer for this dependency, but this too fails. Upon investigation, I discover that the unzipped size of the azure-mgmt-network and its dependencies is 265MB. This package alone is over the limit of lambda layer size.
This stuns me and apparently I’m not alone. This Github Issue open for over 2 years now has over 50 reactions. It turns out the reason is that the pip package includes all historical versions of the package and the package-info references functionality from some assortment of these.
I surrender in resignation, and knowing that I already have an Azure Functions application, I move the subnet CIDR determination to Azure functions. I send a message to Azure Service Bus from the state machine, and we’re off to the races.
Our second, third, and fourth tenants create successfully. I am very glad to put this implementation behind us.
Reflections on my Azure experience
Looking back on this deployment experience, I found myself preferring the development experience on AWS.
For one, I do think that the breadth of features AWS offers makes it easy to find something that fits most use cases I encounter. More importantly, I rarely find myself faced with constraints that I don’t understand the rationale behind. Why should only one App Service plan be able to be associated with a Subnet?
None of these are egregious oversights to my mind. They are clearly documented and had I and my team been more vigilant in our planning, we would have been able to plan around them.
There is, however, no excuse to my mind for a network management client package that adds 265MB to a python deployment package. This oversight veers more into the negligence category. For this ticket to be open for over 2 years without any sign of imminent resolution is not something I understand.
There are some magical things about Azure development, I admit. No need to worry about availability zones and NAT Gateway routes. If the magic comes with some of these constraints, however, I will gladly remain an AWS muggle.
I will probably need to do more Azure work like this in the future, but when I do, we’ll be using AKS. For, with all the complexity Kubernetes brings, I rarely run into tooling or platform limitations that are overly difficult to design around.