Exploring the Emerging Cloud Development Tooling Landscape
In-depth thoughts about the past, present, and future of cloud development from my (opinionated) perspective
I’ve been building infrastructure with code and cloud applications since 2016. Over the years, some of what I have built has been deployed to servers, some of it has been deployed to container platforms, and some has been deployed to function-as-a-service platforms. The way I and my team build has evolved a lot and I anticipate that it will continue to evolve over the next few years. This is an (opinionated) exploration of how we got here and where I believe we’re going.
While I expect the salient points of this article to apply across cloud platforms, I offer the following disclaimer: While I do have experience building on GCP, Azure, and Kubernetes, I have not fully explored the many options available for these platforms. As such, I will focus on AWS in my discussion.
This is a very long blog post, so if the subject matter interests you, sit tight and get comfortable.
The Beginning
When I started building on AWS as a Software Engineer at Audible, I, and everyone else I came into contact with, developed via click-ops. We opened the AWS console and manually configured DynamoDB tables, Aurora Clusters, SNS topics, SQS queues, Kinesis Streams, S3 Buckets, Firehoses, EMR clusters, and Lambda Functions.
When we wanted to replicate the configuration to production or set up a similar set of infrastructure, we would attempt to recreate our clicks a second time in a new AWS account.
I quickly became sick of repetitive clicking around in the AWS console and the resulting misconfigurations that led to production issues. I looked for alternatives.
Infrastructure-as-Code
I discovered CloudFormation and introduced it to our team in 2016.
The only opinion I had about infrastructure-as-code back then was that defining YAML CloudFormation templates to automate my AWS configurations beat the crap out of ClickOps every time.
In November of that year, SAM enabled me to start building serverless applications with more expressive YAML. It felt painless and I started becoming excited by how expressive my infrastructure as code templates became. At least until my configuration files got longer and the resulting YAML became unwieldy. I got fed up with CloudFormation YAML’s verbosity and difficulty to debug.
The occasional cryptic failures during my CloudFormation deployments with unclear remediation paths eg. UPDATE_ROLLBACK_FAILED also played a part in my dissatisfaction with the experience.
Developing Serverless Java Applications
In December or January, I found the aws-serverless-java-container package on GitHub while exploring ways to bundle a Spring-based java application into a Lambda function. The idea was simple: we had applications with very irregular traffic and could reduce the operational footprint by moving our entire Spring-based application inside a lambda function. We could then transparently reverse-proxy all traffic to that lambda function from API Gateway.
I could use a lightweight SAM template to deploy a full-blown Java application that scaled elastically to meet demand. Beat the developer experience of ElasticBeanstalk or self-managed EC2 runtimes1 by miles.
Running full Spring applications inside a lambda function was not without pain, however. Cold starts were slow (especially due to spring context builds), and AWS Lambda’s code size and timeout limitations reared their respective heads too frequently. With provisioned concurrency not yet available, I used CloudWatch event rules, to keep lambda functions warm. P-95 and P-99 latencies remained high.
I was not enamored with these drawbacks, but the approach had me thinking about how cloud development experiences should feel.
Terraform and a new Infrastructure-as-code obsession
By mid-2017, I was obsessed with the power of infrastructure-as-code and started exploring alternatives to my CloudFormation YAML spaghetti. I quickly discovered terraform and became obsessed. So much so, that I made the decision to quit my day job as an SDE in a failed attempt to sell reusable infrastructure-as-code modules. I found my way into AWS professional services where terraform nonetheless became my new best friend.
Terraform was far more expressive than CloudFormation and had IDE highlighting. Some of the things I loved about it that were not true for CloudFormation back then were:
Support for modules.
A state manipulation API to import existing resources or remove a resource from a terraform stack.
Support for non-AWS providers and resources.
The ability to deploy to multiple AWS accounts and regions in a single stack, enabling an easier AWS multi-account development experience.
Superior documentation, community, and a visible roadmap.
Using terraform to build serverless workloads, however, was painful.
After working with SAM, creating serverless APIs in terraform felt clunky. For one, zipping and deploying lambda functions on the fly wasn’t something terraform was suited for2. For another, configurations for API Gateway integrations seemed verbose and there was no standard library of supported modules I could use to hide this verbosity.
Bridging the serverless Infrastructure-as-Code gap
Having started my career building monolithic RESTful APIs, I wanted the same ease of development with the benefits of the lower operational costs of AWS lambda-based workloads.
I explored the Serverless Framework and found it to be easier to configure and maintain than SAM or Terraform. It also had an interesting plugin ecosystem. The developer experience still felt fragmented, however.
Chalice
After some exploration, I discovered Chalice, an AWS open-source framework for building event-driven serverless applications with Python.3
Developing applications with chalice is similar in user experience to building applications in Flask.
A Hello World app in chalice looks as follows:
from chalice import Chalice
app = Chalice(app_name='helloworld')
@app.route('/')
def index():
return {'hello': 'world'}
This chalice deployment generates an AWS API Gateway endpoint and lambda function which handles an incoming request to /
, invokes the index()
function, and returns a 200 response with {'hello': 'world'}
.
In addition to API routes, Chalice can also configure function subscriptions to SQS queues, S3 bucket events, Kinesis streams, EventBridge events and schedules, SNS, and DynamoDB streams.
Having a framework infer infrastructure configuration from application code immediately felt intuitive to me. Over five years later, it still feels good.
The pattern of inferring required infrastructure configurations from application code has recently become known as infrastructure-from-code4. We will explore infrastructure-from-code in greater detail after fully exploring the CDK and Pulumi’s approach to infrastructure-as-code.
Expressive cloud infrastructure declaration
It was probably early in 2018 when I started questioning Terraform’s HCL language as a sufficiently expressive way of configuring cloud infrastructure. I found I repeated myself a lot and the declarative language felt suboptimal. IntelliSense for Terraform in my editor (VS Code) was also not up to par.
Control flow constructs like loops and conditionals (which were frequently necessary for my stacks) felt like second-class citizens in HCL5, and configuration that could be expressed easily in regular programming languages like Python or TypeScript felt cumbersome and sometimes infeasible to build with terraform.
Back then, I believed6 the best way to resolve these terraform pain points was to amend the HCL language and improve IDE tooling. I watched the release notes for HCL very closely back then, and the language spec has definitely improved somewhat. As has IDE support.
Pulumi
Toward the end of 2018, another infrastructure-as-code project arrived on my radar. Another engineer I spoke to in the infrastructure-as-code space had become obsessed with the early beta-releases of Pulumi.
Pulumi offers an alternative to Terraform to enable users to configure code in an imperative paradigm using their programming language of choice. The Pulumi deployment engine is then responsible for ensuring that the deployed infrastructure matches the imperatively defined infrastructure. In this way, some parts of Pulumi operate imperatively and others declaratively.
I saw value in this model but was unwilling to experiment. After all, Pulumi was new and had low adoption and little support. I did take notice though. More on this later.
The AWS CDK
In July 2019, AWS announced the AWS CDK. At the time, anyone serious about infrastructure-as-code was using Terraform. Managing infrastructure in YAML or JSON CloudFormation templates didn’t scale and AWS looked to the CDK as its response.
The initial release announcement blog for the CDK showed 30 lines of infrastructure code defining an API that posted messages to an SQS queue that was consumed by an ECS Fargate service in a VPC. I tried it out, and the user experience and abstractions blew my mind.
I was almost convinced to abandon terraform in favor of the CDK wherever possible, but three factors kept me using Terraform:
Terraform has a huge array of non-AWS providers that you can use to combine AWS and non-AWS resources into the same stacks.
Deploying infrastructure across multiple regions and AWS accounts was easier with Terraform than with the CDK.
Terraform has a state API as an escape hatch to easily remediate drift. I was attached to the ability to be able to manipulate state files programmatically.
Experimenting with the CDK
In early-2021, I hired a new Principal Engineer who was (and still is) a very strong proponent of the CDK. He introduced the CDK to our team at Foresight and his interest reignited my own. My discussions with him regarding the tradeoffs of terraform and the CDK coincided with the announcement of the preview for the CDK for Terraform. My aversion to CloudFormation was no longer a good excuse to avoid the CDK.
I needed to build a small Fargate / Sagemaker-based prototype for a customer. The perfect opportunity to try out the Terraform CDK for something non-trivial. My experience building with the Terraform CDK felt intuitive, with a couple of painful exceptions. Back then, the token system was a mess. I was forced to litter my code with nasty workarounds that I found buried within Github issues and, failing that, the source code itself.
Issues with the token system have since been fixed, but cdktf
support for the terraform state
cli is still poor. Refactoring terraform stacks still require users to know about the generated terraform code rather than using the cdktf cli.
Problems aside, I immensely enjoyed the expressiveness of a full-fledged and mature programming language when defining infrastructure.
Reflections on infrastructure-as-code tools
To make a long story short, I continued to experiment in this space and have developed some opinions:
I currently favor Pulumi
When building greenfield systems, I favor Pulumi. With a state management API, an automation API that enables pulumi triggers from code, and its own deployment engine, Pulumi feels more flexible and powerful than either the AWS or Terraform CDK.
In electing to not reuse the Terraform or CloudFormation deployment engines, Pulumi is able to support asynchronous processing in a construct using its Output.apply(callback)
pattern. This is impossible with the AWS and Terraform CDKs as synthesis creates CloudFormation or Terraform prior to creating infrastructure.
Pulumi is not without its drawbacks, but it feels like the most flexible infrastructure-as-code option currently. Some drawbacks are:
The community is small (albeit friendly and helpful) and the documentation feels fragmented. It can be very difficult to find the information you need.
The small community leads to a meager construct ecosystem, especially when compared with that of the AWS CDK. Pulumi has attempted to remedy this by building CDK construct interoperability. This interoperability is unfortunately buggy and relies on the AWS Cloud Control API which does not yet fully support all necessary resources.
Pulumi
Output<T>
types expose a callback interface that initially feels unintuitive for the uninitiated. Language-native promise support allowing the use ofasync
/await
keywords would improve developer experience tremendously.
Domain-Specific Languages vs Programming Languages
Defining infrastructure using existing programming languages like Python and TypeScript provides a superior experience to domain-specific languages like HCL or CloudFormation YAML. IntelliSense, autocomplete, testing frameworks, package eco-systems, and years of refinement have ensured that these languages are flexible, expressive, composable, and are a relative pleasure to work with.
Dynamic vs Static Typing
Statically typed languages with mature type systems like TypeScript make infrastructure-as-code, and cloud development in general, a lot easier than dynamically typed languages like Python. This is especially true because the work is so dependent on modeling relationships and passing data between different cloud services.
Ensuring developers can detect invalid relationships statically using an IDE linter rather than dynamically at run-time is hugely impactful. While Python’s type annotations are helpful in this regard, I find that Python developers frequently pass complex data types as dictionaries or lists or tuples rather than modeling out these types ahead of time. Additionally, Python’s typing ecosystem does not feel as mature as TypeScript’s to me. See this Hacker News thread for a conversation comparing the two.
The future of infrastructure-as-code
It is interesting to me that the AWS CDK, the Terraform CDK, and Pulumi all elected to target a large number of programming languages rather than specializing in a single popular statically typed language like TypeScript.
I think that a TypeScript native implementation of an infrastructure-as-code tool would let designers focus enough to build a much better user experience.
Output<T>
types are one example of a sub-optimal developer experience in Pulumi. An intuitive async-await API built on language-native promises would significantly simplify code.
Consider the following Pulumi code allowing traffic from an application security group to a database security group in a separate AWS account, for example:
const appSecurityGroup = new aws.ec2.SecurityGroup(
"InfraFunctionSecurityGroup",
{
description: "InfraFunctionSecurityGroup",
vpcId: vpcId.value,
}
);
const dbSecurityGroupId = aws.ssm.getParameterOutput({
name: "/db/security-group-id",
});
const dbSecurityGroupRef = securityGroupId.value.apply(
(value) => `${dbAccountId}/${value}`
);
const appSecurityGroupId = appSecurityGroup.id.apply((id) => `${id}`);
const allowAppToDb = new aws.ec2.SecurityGroupRule("allowAppToDb", {
type: "egress",
fromPort: 5432,
toPort: 5432,
protocol: "tcp",
sourceSecurityGroupId: dbSecurityGroupRef,
description: "Allow outgoing connections from app to db",
securityGroupId: appSecurityGroupId,
});
A more succinct Promise-based TypeScript-specific declaration that leverages native promises and async-await semantics might resemble the following:
const appSecurityGroup = aws.ec2.securityGroup(
"appSecurityGroup",
{ vpcId: vpcId.value }
);
const dbSecurityGroupId = aws.ssm.getStringParameter({
name: "/db/security-group-id",
});
const allowAppToDb = await aws.ec2.securityGroupRule("allowAppToDb", {
type: "egress",
fromPort: 5432,
toPort: 5432,
protocol: "tcp",
sourceSecurityGroupId: `${dbAccountId}/${await dbSecurityGroupId}`,
description: "Allow outgoing connections from app to db",
securityGroupId: await appSecurityGroup.id,
});
Of the two, the second example is more concise and intuitive. When describing more complex relationships and dependencies, the benefits of conciseness compound. Language-native promise-based infrastructure-as-code configuration flattens code and renders more complex cases requiring the nesting of entire resource declarations within callbacks trivial.
Cloud dev tool designers Sam Goodwin and Dax Raad discuss a compelling solution like this on Twitter7:
The next infrastructure-as-code tool I plan on adopting will provide a language-native, statically typed async API to define infrastructure, and will execute lazily.
The gap left by Infrastructure-as-code tools
The composability and flexibility of programming language-based IaC tools like Pulumi and the CDK are integral to cloud development. These tools enable us to create higher-level and reusable infrastructure-as-code constructs and share them using language-native package ecosystems. They are not sufficient, however, and achieving an optimal cloud development experience requires improved abstractions.
Like monolith developers, cloud developers must understand how to build software. Unlike monolith developers they must also understand:
The purpose of a huge array of cloud services and how to configure them.
How to keep their codebase comprised of both infrastructure and application code from growing complex and disorganized.
How to test their solutions without waiting 15 minutes between iteration cycles.
Perhaps most importantly, how to structure infrastructure code, application code, and release pipelines to keep blast-radiuses contained and to ensure that systems remain stable during releases.8
Even with the above expertise, developing cloud workloads with infrastructure-as-code tools and utility scripts leveraging AWS SDKs feels slow and error-prone. A far cry away from the magical developer experiences provided by application frameworks like Ruby-on-Rails or Spring Boot or Django.
Cloud-development frameworks and Infrastructure-from-code
With a high barrier to entry, the tedium of testing, and the lack of conventions in cloud development, it was inevitable that old frameworks and tools would evolve and new frameworks and tools would emerge to target cloud development pain points and bridge the gap unfilled by infrastructure-as-code.
Cloud-focused enhancements to popular application frameworks
Many AWS services are used to support communication and persistence for business applications. This is true even of applications deployed outside AWS. These applications are frequently built with traditional application frameworks like Spring Boot in the Java world or Rails in the ruby world.
Libraries and cloud-focused extensions to these frameworks enable first-class support for some key AWS services. For example, ruby-on-rails has been extended to support AWS’s DynamoDB and S3 with dynamoid and active storage s3 support respectively. Similarly, Spring Cloud AWS brings DynamoDB, S3, SES, SNS, SQS, Parameter Store, Secrets Manager, and CloudWatch metrics support to Spring.
Usually, but not always, these libraries and extensions focus on supporting AWS’s data-plane operations rather than control-plane operations9.
Occasionally, as is the case with the popular JavaScript and TypeScript library dynamoose, these libraries perform control-plane type operations like creating database tables. In these cases, I recommend disabling this functionality.
Libraries and extensions intended to live in existing applications are ill-suited to managing the operational lifecycle of the AWS resources they create. Instead, these tools should be used in conjunction with a dedicated infrastructure-as-code tool to manage the life cycle of an application’s dependent cloud resources.10
Frameworks for serverless development
Early on, it became obvious that regular infrastructure-as-code tools did not provide an ideal developer experience for serverless workloads. One reason for this is that serverless workloads frequently consist of lots of cloud resources including functions, event triggers, IAM roles and policies, and more. With each integration requiring a significant amount of configuration, serverless developers were spending more time configuring infrastructure, than writing code.
Tools like the Serverless Framework and SAM emerged to enable more concise configurations for these types of applications. Tools like Chalice and Zappa aimed to unify the development experience further, by moving this configuration inside your code using decorators.
All of these frameworks provide developer experiences that enable users to focus on defining and binding event triggers and business logic rather than the underlying cloud resources.
Almost every serverless stack, however, depends on extra cloud resources that belong external to the application code. Persistent resources like DynamoDB tables, S3 buckets, and SQS queues are common examples11.
To address extending serverless applications beyond event triggers, the serverless framework and SAM both support extension via CloudFormation YAML. The Chalice CLI supports a —-merge-template
argument that allows you to merge separate CloudFormation files containing these resources into your chalice deployment bundle12.
My painful associations with CloudFormation led me to write a set of python scripts to extract infrastructure dependencies from terraform stacks, and inject them into our chalice configuration. I performed a similar trick to extract resources (like API Gateway endpoints) created by Chalice and inject them into dependent terraform stacks. This approach has served us well at Foresight despite all the upfront python needle-and-thread work required.
Frameworks as Infrastructure-as-code Constructs
The cleanest model for managing infrastructure dependencies for a Chalice application, however, emerged in January 2021. Amazon released a CDK construct for a Chalice App. Suddenly, passing infrastructure configuration into a Chalice application, and extracting infrastructure configuration created by the Chalice application became trivial. Here’s a code snippet from the above blog post that illustrates this simplicity:
class ChaliceApp(cdk.Stack):
def __init__(self, scope, id, **kwargs):
super().__init__(scope, id, **kwargs)
self.dynamodb_table = self._create_ddb_table()
self.chalice = Chalice(
self, 'ChaliceApp', source_dir=RUNTIME_SOURCE_DIR,
stage_config={
'environment_variables': {
'APP_TABLE_NAME': self.dynamodb_table.table_name
}
}
)
self.dynamodb_table.grant_read_write_data(
self.chalice.get_role('DefaultRole')
)
A DynamoDB table is created with a CDK construct. A chalice app is deployed with its own construct which in turn creates a new IAM role. Permissions on the table are then modified to enable the chalice role to read and write data.
This CDK-chalice integration blended the best of an expressive infrastructure-as-code tool and a serverless application framework. To the best of my knowledge, the Chalice-CDK integration was the first framework to offer this deployment model. It was not the last, however. I am aware of two other frameworks that use this deployment model.
The first of these frameworks is SST, and the second is Eventual. Like Chalice, Eventual is deployed as a construct. SST, on the other hand, exposes constructs to deploy applications built with popular frameworks to the cloud.
Each of these two frameworks has more to offer than their integrations with general-purpose programming language-based infrastructure-as-code tools, however. We will therefore devote some time to each.
SST
SST is a super-set of the CDK focused on providing primitives necessary to build serverless applications on AWS seamlessly. Of the frameworks I’ve worked with, SST represents the best current developer experience for building greenfield serverless applications.
Deploying web applications with SST
SST exposes constructs for building static, NextJS, Remix, Astro, and Solid State websites. An example of a NextJS site with a custom DNS taken from the SST docs follows:
import * as acm from "aws-cdk-lib/aws-certificatemanager";
import * as route53 from "aws-cdk-lib/aws-route53";
import * as route53Targets from "aws-cdk-lib/aws-route53-targets";
// Look up hosted zone
const hostedZone = route53.HostedZone.fromLookup(stack, "HostedZone", {
domainName: "my-app.com",
});
// Create a certificate with alternate domain names
const certificate = new acm.DnsValidatedCertificate(stack, "Certificate", {
domainName: "foo.my-app.com",
hostedZone,
region: "us-east-1",
subjectAlternativeNames: ["bar.my-app.com"],
});
// Create site
const site = new NextjsSite(stack, "Site", {
path: "my-next-app/",
customDomain: {
domainName: "foo.my-app.com",
alternateNames: ["bar.my-app.com"],
cdk: {
hostedZone,
certificate,
},
},
});
// Create A and AAAA records for the alternate domain names
const recordProps = {
recordName: "bar.my-app.com",
zone: hostedZone,
target: route53.RecordTarget.fromAlias(
new route53Targets.CloudFrontTarget(site.cdk.distribution)
),
};
new route53.ARecord(stack, "AlternateARecord", recordProps);
new route53.AaaaRecord(stack, "AlternateAAAARecord", recordProps);
Note how certificates are passed into the NextjsSite
construct, a path to the application code is specified, and a route53 alias record points to the CloudFront distribution returned from the NextjsSite
construct.
While these frontend frameworks are not cloud-development frameworks, per-se, they are frequently deployed to cloud resources including CloudFront distributions, S3 buckets, and lambda functions.
Expressing application code and infrastructure dependencies in a single code base without complex workflows or spaghetti configuration to deploy changes feels incredibly powerful.
What the SST adds to the CDK’s developer experience
It has only been over the last couple of months that we’ve started building workloads using the SST at Foresight. I have heard only good feedback about the developer experience from our team.
Aside from the set of expressive higher-level constructs the SST provides, the SST also differentiates itself by laser-focusing on the entire serverless developer experience.
It provides a project bootstrapping mechanism that lets you create a best-practice mono-repo. It offers support for intuitive testing. It offers mechanisms for managing database migrations and secrets, along with a set of excellent local development tools including IDE support, a local lambda debugging experience, and a visual console to aid your development experience.
The SST team’s focus on developer experience has quickly made it one of my favorite cloud development experiences. If you haven’t tried it yet, I’d recommend giving it a try.
Eventual (and the future of Infrastructure-from-code)
Eventual is the brain-child of Sam Goodwin and Sam Sussman, two former Amazon Alexa engineers. This cloud development framework is still in a closed beta that you can request access to here. Even in beta, this framework offers a working glimpse of what I hope cloud development becomes.
Eventual provides a few simple but very powerful abstractions. The Eventual compiler enables this level of abstraction by running Eventual Service code and having primitives self-register to produce an AppSpec
. Eventual then uses esbuild to tree-shake and bundle distributed components. During the bundling process, the eventual compiler performs some limited transformations on the source code using SWC libraries.
The produced AppSpec
is consumed by infrastructure-as-code constructs to understand what infrastructure they must create.
The extent to which Eventual goes to both introspect and bundle service code is no parlor trick. It enables a development experience akin to programming a single computer, but a deployment experience that enables the reliability and scale of battle-hardened cloud services.
An example Eventual Service
To illustrate the power of Eventual, I will excerpt (with permission) a couple of very simple examples from the examples provided in the as-of-yet non-public GitHub repository. Expect some of these interfaces to change before a stable version is released.
An Eventual Service
is defined as a TypeScript file without much fanfare:
import { event, activity, workflow, api, HttpResponse } from "@eventual/core";
api.post("/work", async (request) => {
const items: string[] = await request.json();
const { executionId } = await myWorkflow.startExecution({
input: items,
});
return new HttpResponse(JSON.stringify({ executionId }), {
status: 200,
});
});
export const myWorkflow = workflow("myWorkflow", async (items: string[]) => {
const results = await Promise.all(items.map(doWork));
await workDone.publishEvents({
outputs: results,
});
return results;
});
export const doWork = activity("work", async (work: string) => {
console.log("Doing Work", work);
return work.length;
});
export interface WorkDoneEvent {
outputs: number[];
}
export const workDone = event<WorkDoneEvent>("WorkDone");
This service lets you post an array of strings to an API Gateway endpoint at route /work
. The API starts a workflow that processes this array of strings and immediately returns an HTTP response including the workflow’s ID to the caller. The workflow then iterates over each string and computes its length. Once the entire array has been processed, the workflow publishes a WorkDoneEvent
with a computed array of string lengths.
This service does not seem all that impressive at first. After realizing that the workflow execution occurs in a different serverless runtime13 than the API, you might be a little more impressed. Upon realizing that each string’s length is calculated in parallel in a separate serverless runtime, you might be floored. Especially after you realize that each workflow execution and activity result is durable with exactly-once execution guarantees. I know I was floored.
This simple and monolithic-looking development experience describes a distributed serverless application that can be deployed to AWS using an API gateway, AWS Lambda event handler, a durable workflow built on top of lambda, SQS, and DynamoDB, along with an EventBridge EventBus.
Defining a similar workflow using AWS Step Functions and other equivalent AWS infrastructure is a lot more difficult. You need to configure a lambda function construct to doWork
, an event bus to publish the resulting event, a step function with correctly configured ASL, and a second lambda function with an API gateway integration to trigger the step function. You then need to wire up the environment variables in the API handler function with the step function id and create a step functions client with the AWS SDK before writing the code to trigger a workflow execution. This is in addition to the complex set of permissions that must be defined with IAM roles and policies.
Eventual wires everything up for you with a few elegant primitives including workflow
, activity
, event
, and api
, along with its own cloud application compiler.
Not used in this example, but often very important in workflow design is Eventual’s signal
primitive that enables workflows to pause and wait for additional information from API handlers and event handlers.
A strong focus on statically-typed integrations
Another feature provided by Eventual is support for Typesafe tRPC-like command
s that are validated with zod schemas. An eventual command
might be defined as follows:
export const privateAccess = api.use(cors).use(authorized);
export const listPipelines = privateAccess.command(
"listPipelines",
{
input: z.object({
beforeTime: z.date({ coerce: true }).optional(),
}),
},
async ({ beforeTime }, { user }) => {
const query = Pipeline.query.byOwner({
ownerId: user.username,
});
const pipelines = await (beforeTime
? query.gte({
createTime: beforeTime?.toISOString(),
})
: query
).go();
return {
items: pipelines.data,
nextToken: pipelines.cursor,
};
}
);
This example describes an RPC-like interface with included middleware to enable cors and authorization for a command named listPipelines
. This particular command, which was sent to me by Sam Goodwin (one of the creators of Eventual), accepts three arguments. The first is the name of the command, the second is a zod schema with which the input to the command is validated, and the third is the function that is invoked when the command is invoked.
This command can be invoked from a TypeScript frontend like a react project for instance via the Eventual ServiceClient
documented here. An invocation might resemble the following:
import { useService } from '@/useService'
import { useUser } from '@/useUser'
import { useCallback, useState } from 'react'
export default function PipelineCountComponent () {
const [pipelineCount, setPipelineCount] = useState(0);
const { session } = useUser({ redirectTo: "/login"})
const myEventualService = useService(session)
useEffect(async () => {
const pipelines = await myEventualService.listPipelines(
{
beforeTime: new Date("2022-12-01")
}
)
setPipelineCount(pipelines.length)
}, [setPipelineCount, session, myEventualService])
return <div>You have {pipelineCount} pipelines</div>
}
Note how the Eventual ServiceClient
enables you to use a statically typed abstraction to listPipelines
. It takes care of serializing and deserializing the data as it traverses the network, mirroring the developer experience of a local library-method invocation.
These type definitions enable your compiler to alert you of any mismatches between your input and the expected format for the listPipelines
function, and your IntelliSense to let you know what methods and input formats are available on myEventualService
.
The Zod schema integration also enables Eventual to generate OpenAPI specs for API endpoints defined as commands. This way you can be sure that your OpenAPI spec accurately describes your API contracts.
Eventual does not stop at API validation, however. It extends its type-safe development experience to event-driven systems and enables validation of event structures with zod. In the case of event-driven integrations, Eventual can generate and publish JSON Schema for your event types.
This focus on type safety is a huge deal in a cloud development framework because it abstracts over the network layer entirely. Starting at the user’s browser, and ending in backend upstream workflows, your compiler can alert you of broken integrations as you are developing locally. Crucially this speeds up the cloud development feedback loop and helps detect many integration errors. Detecting these errors early lowers the probability of them going unnoticed until a production release.
Deploying an Eventual Service
Like deploying a Remix app with SST or a Chalice app with the CDK, an Eventual Service is defined as a construct. Support for a Pulumi construct is in progress14 too. For instance:
const service = new Service(stack, "Service", {
name: "my-service",
entry: path.resolve("services", "my-service.ts"),
});
What if you need your service to know about a DynamoDB table you created in your CDK stack? Just pass in an environment variable like so:
const service = new Service(stack, "Service", {
name: "my-service"
entry: path.resolve("services", "my-service.ts"),
environment: {
TABLE_ARN: table.tableArn,
},
});
But Eventual deploys many different isolated runtime15s. What if you only want to set the environment on one of them? Well, you can use a type-only import to import the types from your eventual service definition into your CDK stack. You can then override individual command
, subscription
, and activity
handler properties in your eventual definition, like this:
import type * as MyService from "@my/service";
const service = new Service<typeof MyService>(stack, "Service", {
entry: path.resolve("services", "functions", "my-service.ts"),
commands: {
myCommand: {
// set environment variables only on the myCommand function
environment: {
TABLE_ARN: table.tableArn,
},
},
},
});
In the same way that the Eventual Service
construct accepts configuration from dependencies in your infrastructure-as-code stack, it exports the resources it creates as attributes. You can then pass these attributes as inputs to dependent infrastructure components.
What’s missing from Eventual
Eventual is a really good start but it is not yet feature complete. Here’s a list of new features that many of the workloads I build rely on but don’t seem to be easily achievable with Eventual yet:
Support for exposing WebSockets endpoints
Consuming externally defined events like s3 invocations, Amazon-native EventBridge events, and events published to SNS topics
Support for batched processing of streams and queues like kinesis and dynamodb streams, and SQS queues
Auth middleware that supports OIDC
Support for VPC configuration and private APIs that are inaccessible over the internet
Long-running activities that last longer than current lambda run-time limits of 15 minutes
The ability to provide base-runtime configurations like docker images or pre-built AMIs with preinstalled software for API handlers, commands, activities, and Event handlers
Support for transparently mounting POSIX-compliant file systems like FSx for Lustre and EFS to enable dealing with data in large files so you can persist files between workflow activities
Web-socket notifications for state changes during
workflow
executions and publishedevent
s.An
activity
of typenotebook
, that lets you specify running a specified Jupyter notebook as a step in a workflow.
The good news is that the Eventual team is listening closely
Concluding thoughts on Eventual
I am a huge fan of Eventual and its approach to building distributed cloud applications for three reasons:
Eventual focuses intently on providing a statically typed experience that integrates with your TypeScript toolchain.
Eventual integrates natively with and delegates the lifecycle of infrastructure to your infrastructure-as-code tool.
Eventual exposes an incredibly powerful set of abstractions that are non-trivial to build yourself.
Eventual’s most impressive abstraction to me represents a product in-and-of itself. Its workflow
abstraction allows you to describe durable and distributed workflows using native TypeScript semantics. The experience is not unlike that of Temporal and similar workflow frameworks.
Eventual provides a far superior deployment experience to Temporal, however. It tree-shakes activities before deploying them and the other distributed components necessary to run these workflows on your behalf. What’s more, the lifecycle of these components is delegated to your preferred infrastructure-as-code engine with a single construct.
Powerful primitives and laser focus on developer experience make Eventual a formidable framework capable of expressing both micro-service architectures and event-driven architectures.
Other Emerging Infrastructure-from-code Frameworks
There has been a lot of talk about infrastructure-from-code frameworks, and concluding this article without addressing other contenders in this landscape would not do this topic justice. Several great resources are available about this topic and I recommend consuming them if you are interested in delving further into this space:
Cloud Application Infrastructure from Code by Asher Sterkin16
The Unfulfilled Potential of Serverless by Jeremy Daly, the CEO of Ampt.
The Self-Provisioning Runtime by Swyx (Shawn Wang). *Note that Swyx backs the Ampt project.
The Current State of Infrastructure-from-code by Allen Helton
Here are some brief initial thoughts about other infrastructure-from-code tools I have surveyed.
Winglang
Winglang’s thesis is that existing programming languages are not sufficient to describe cloud applications. It introduces a distinction between code that is executed after deployment and code that is executed during deployment. Code that is executed during deployment is called Pre-flight, while code that is executed after deployment as a part of a cloud application’s run-time is called In-flight. In Winglang’s model, control-plane operations are generally preflight, and data-plane operations are generally in-flight.
I would not discount Winglang, as it is backed by CDK veterans who are experts at infrastructure-as-code, but to me, having in-flight code deployed by a construct using an infrastructure-as-code tool, as Eventual does, provides a better separation of concerns and user experience than Winglang’s proposed new-language approach.
A new language and development tooling ecosystem also seems unnecessary to me but I’ll be watching Winglang to see how the ecosystem changes. I might be proven wrong yet.
Ampt
Ampt is the successor to Serverless Cloud, and its developer experience looks very smooth at first glance. I’ve added my name to the waitlist, but have not yet managed to get my hands on the private beta. Jeremy Daly’s vision does appear well thought out and cohesive, and some highly competent distributed systems and dev-tools engineers have thrown their support behind this project.
The recent Sessions with SAM & Friends episode by Eric Johnson featuring Jeremy Daly gave a great demo of the Ampt control plane’s simple console UI. Some highlights were live updating of application configuration that propagates to Ampt environments in near-realtime, and an easy-to-use file-management UI that lets you read and modify assets used by your app.
The icing on top and Ampt’s key differentiator to me is the high-performance local development experience that transparently syncs changes from the local development environment to your connected Ampt backend. Also especially powerful is the ability to drop its libraries into your favorite full-stack application frameworks like Astro, and Remix.
See the conversation here for a demo:
With a recently updated documentation repository, Ampt looks primed for a beta release and I’m hoping to get my hands on it soon. There are, however, four attributes that make me more tentative about Ampt than Eventual at the moment:
I have not seen any evidence of existing support for durable and distributed workflows in Ampt. This is a non-trivial integration that sets Eventual apart for me.
Early documentation indicates that Ampt is meant to be deployed separately from your infrastructure-as-code stack. This makes me think that integrating with existing infrastructure or external resources will not be seamless.17
The underlying infrastructure created by an Ampt project remains a mystery to me despite having read its current documentation repo and seeing a demo. This is in opposition to the Eventual team’s easily navigable and well-organized generated AWS infrastructure that is easily grok-able as you plan and execute your infrastructure-as-code scripts. For AWS control freaks like myself, I can see Eventual’s transparency giving it an edge over Ampt.
Eventual’s focus on an end-to-end type-safe tRPC-inspired development experience feels absent in Ampt.
Caveats aside, my initial impressions of Ampt are very positive and I am looking forward to eventually getting my hands on the beta.
Nitric
Nitric provides similar abstractions to Ampt with more transparency into generated infrastructure, with less integration into application frameworks, and with slower deployments. Because Nitric is deployed with Pulumi, you can use pulumi’s tools to visualize your infrastructure.
Because Nitric cannot be deployed from an existing Pulumi infrastructure stack as a construct, integration into the rest of your infrastructure-as-code ecosystem requires more effort than Eventual. This is done via an external Nitric configuration file which is an imperfect developer experience for me.
Unlike Eventual or Ampt, Nitric has already been available to build publicly since early December. You can get your hands on it immediately without joining any programs, and get a feel for what this new emerging paradigm feels like. Like Eventual and Ampt, I am excited to see how Nitric evolves to find its place in the infrastructure-from-code ecosystem.
Klotho and Encore
Klotho and Encore let you define annotations as comments in your code that follow a defined spec. Engines convert your annotated code to infrastructure and code dependencies that are necessary to generate a distributed cloud application.
I do not like comment-based annotations from an aesthetics standpoint, and abstractions feel slightly weaker than Ampt or Eventual annotations. As such, I have not investigated further.
Shuttle
Shuttle is a Rust-based framework for building cloud applications. It reminds me of Chalice and is very web-app focused. Right now it seems to offer no support for long-running jobs or event-driven systems. I have not tried it out, but it seems to have a far smaller scope than most frameworks that bill themselves as infrastructure-from-code frameworks.
Modal
I really like the ideas underpinning Modal by Erik Bernhardsson that he introduced in December. You develop locally and deploy to Modal’s managed infrastructure. The development experience lets you simply designate sections of code and specify the type of infrastructure you want to run them on using python decorators. Modal takes care of orchestrating these runtime18s for you.
I also want to call out Modal’s native support for asynchronous Python APIs, which is welcome in the python eco-system where performance and asynchronous programming are often neglected.
Modal represents an impressive feat of engineering, along with a fantastic developer experience. Especially for data-science-related tasks. Modal’s main drawback currently is its inability to deploy to infrastructure building blocks that you know and understand, like AWS compute and storage solutions.
I will continue to watch Modal closely and wait to understand its platform reliability better. I am also curious to see how workloads built on Modal are securely integrated into existing cloud environments.
Concluding Thoughts
I am very optimistic about the trends we are seeing in cloud-development tooling. Emerging frameworks seem to be language-specific, and focused on identifying the right abstractions necessary to build distributed cloud applications.
My expectation is that we will not be able to dispense with infrastructure-as-code in favor of infrastructure-from-code. The two paradigms are going to evolve independently of one another, and infrastructure-as-code tools will deploy infrastructure-from-code frameworks going forward.
Infrastructure-as-code languages will likely move away from DSLs like HCL to statically typed, tried-and-tested programming languages like TypeScript. I also expect these to eventually become lazy and provide good async programming models.
Meanwhile, applications and business logic will live in purpose-built frameworks. They will integrate natively with your infrastructure-as-code engine mirroring the integration models of the Chalice construct for the CDK, SST constructs for deploying modern web frameworks, and Eventual’s native infrastructure-as-code construct deployment model.
For those with a background in building Spring applications in Java, having code that generates a set of static dependencies into a context, and business logic that dynamically consumes these dependencies during application run-time is a familiar concept. We are moving to a world where these static contexts are distributed and built by infrastructure-as-code tools. Our business logic will still be defined simply and concisely, built within infrastructure-from-code frameworks, and integrate seamlessly with this distributed infrastructure context.
With better abstractions on the horizon and a continued focus on refining development-iteration time, the future of cloud development tooling looks bright.
I talk about runtimes a lot in this blog post. In this context when I use the term runtime, I am referring to a set of logically isolated compute resources dedicated to a specific task. A runtime might be an EC2 instance, Fargate task, lambda function, modal container, CodeBuild job, or any other location you might want to run your code.
See this old GitHub issue for more info.
As an aside, 6 months prior to AWS launching Chalice, Rich Jones created Zappa. Zappa is a framework with an incredibly similar API to Chalice and is, in some ways, more configurable. At the time, knowing little about either framework, I elected to use Chalice because of its friendlier documentation.
As far as I am aware, Jeremy Daly’s team coined this term to describe the Serverless Cloud framework.
Hashicorp Configuration Language or HCL is the domain-specific language used to define terraform configuration.
At the time, I strongly believed that infrastructure configuration needed to be declarative. After all, rerunning an infrastructure deployment stack must ensure existing infrastructure conforms with the desired configuration. It should not create an entirely new set of infrastructure.
It is probably no coincidence that Dax and Sam work on some of my favorite dev tools SST and Eventual. They are deeply focused on developer experience and seem hyper-aware of many pain points that I experience.
Anecdotally, a lot of the time I and my team spend with existing cloud development SRE teams setups is targeted at reducing deployment blast radiuses and simplifying deployment processes.
If you are unfamiliar with how AWS architectural guidance divides services between data-plane and control-plane operations, I highly recommend the linked section in the AWS whitepaper, AWS Fault Isolation Boundaries.
Infrastructure is often mutable and stateful. One must understand potential blast radiuses during deployment times, and infrastructure-as-code tools let you create plans and map out deployment blast radiuses ahead of time. This lets cloud developers enact safe deployment processes and practices.
Some mostly serverless applications might also require server-dependent components like an ElasticSearch cluster or NAT gateways and network firewalls to keep traffic from traversing the internet. These resources certainly don’t belong in a serverless framework.
Zappa has no built-in answer to its infrastructure-as-code gap that I am aware of. Any infrastructure that the Zappa app depends on must be created outside of the Zappa app.
See footnote 1 for what I mean when I say runtime.
Sam Goodwin has an open pull request for Pulumi support. This is visible on the Github repository that is currently only open to members of the beta program.
See footnote 1 for what I mean when I say runtime.
Just a short note to say that I find Asher Sterkin’s writing about infrastructure-from-code to be especially comprehensive.
Another potentially large issue with this decision is that Ampt documentation indicates that it manages the control planes for persistent resources with its storage and data APIs. I think that these components should not be created or managed by the application source code as they are stateful and the lifecycle should be delegated to an infrastructure-as-code tool.
See footnote 1 for what I mean when I say runtime.