Organizing and Integrating Distributed Processes with AWS
Part 2 of 2: A comprehensive guide to communication in distributed systems with AWS
Previously, we identified and explored a set of components that enable us to communicate from a source to a destination using push-based and pull-based communication models. We explored application servers, load balancers, message queues, and data streams.
This second installment is designed to be consumable independently of the first and focuses on identifying a second set of components including message topics, event buses, and workflow engines. It explores how these components can help us operate and scale increasingly complex integrations. We examine how these building blocks can be used to achieve reliable, observable, flexible, and performant distributed processes.
Hopefully all without boring you senseless.
We will examine three types of components that are more complex than communicating from a source to a destination:
Message fan-out
Rule-based conditional routing of messages across a set of consumers
Orchestrating and choreographing complex workflows
Event-driven choreography
Engine-driven orchestration
Note that these topics are exclusively related to asynchronous, pull-based processes. We will therefore not cover service meshes or other components that are designed solely with push-based distributed service integrations in mind.
Tackling messaging complexity in human systems
Sending a message from a source to a destination can be thought of as passing a note in middle school. (I haven’t been in middle school for a while, so things may have changed. Either way, dinosaur that I am, I will proceed with this analogy.)
You have a note that you want to send to your buddy. If your buddy’s desk is next to yours, you just pass it along. If you’ve got good aim and you know where your buddy is sitting, you lob it over when the teacher is not looking. This method is both less reliable and less secure. If your note isn’t very private or you trust your neighbor not to look, you might write your buddy’s name on the cover and pass it off to your neighbor for delivery. This method also has reliability and security implications.
But let’s say you wanted to invite everyone in the class to your birthday party on Sunday. You might write a note and have everyone toss it around the class. Of course, your buddy would probably get mad at you if the note makes its way to everyone in the class except them.
A better approach might be to have a class bulletin board. You put your invitation up on the bulletin board, and your buddies check the bulletin board to see if any new events are posted for the class. This way, if your buddy doesn’t see the birthday invitation, they have only themselves to blame.
This approach fails in several scenarios. What if you want your buddies to attend your party, but your buddies are not very diligent about checking the bulletin board? What if you want to selectively invite a subset of the class?
And thus emerged the practice of personalized party invitations.
You make a list of everyone you want to invite to your party. You create a separate invitation for each person on the list. You deliver these invitations to everyone you want to invite. You check off when you’ve delivered the invitation and you check off RSVPs you receive.
The notepad approach is manageable for a small party with a short guest list. With increasing party size, however, comes the need to use more complex tools. But your notepad quickly evolves along with the scale of your problems:
Your invitation list becomes longer. You start needing to keep track of gifts you receive. You need to collect RSVP counts for catering reasons. When delivering all the invitations in person is no longer feasible, you need to record and keep track of mail addresses. You need to send out reminders when you haven’t heard back.
For my bar-mitzvah and my sisters’ bat-mitzvahs, my father kept spreadsheets. My wife and I used the same spreadsheets for our wedding.
Message complexity in human systems and message complexity in distributed systems are very similar.
Tackling messaging complexity in distributed processes
Message Fan-out
Akin to the scenario of wanting to broadcast an invitation to everyone in your class, systems frequently need to broadcast information to a set of consumers. As such, engineers have built messaging primitives to enable pub/sub or publish/subscribe capabilities. In AWS, the simplest publish primitive is provided by Simple Notification Service or SNS. SNS enables you to create message topics to which you can publish messages.
Each message topic represents a bulletin board-like construct to which publishers can deliver messages. Messages can be posted to this bulletin board, and different consumers can subscribe to these messages. Subscribers can be lambda functions or SQS queues or email or text messages. For an exhaustive list of subscribers see AWS documentation.
A single SNS topic allows up to 12.5 million subscriptions. As such, if you need to take a single message and fan it out to a lot of consumers, SNS is a pretty good bet. RabbitMQ and ActiveMQ allow you to define limitless subscriptions per topic, but since you’re in control of the infrastructure, you bear the responsibility of scaling it to meet the demand.
AWS does provide a managed Amazon MQ solution that lets you operate RabbitMQ and ActiveMQ clusters, but you need to interpret metrics and determine scaling policies yourself. This will usually add more operational complexity to your solutions than architecting with SNS limitations in mind.
Rule-based conditional routing of messages across a set of consumers
Frequently, when you fan out messages in distributed systems, you might not want to fan all messages out equally.
Assume, for instance, that you have a topic called ProductPurchaseNotifications in a web store. Now, assume you are selling both physical and digital products from the same storefront. You might have an upstream system responsible for ordering new inventory which is very interested in physical product purchases but is not interested in digital products at all. You might also have a royalty platform responsible that is only interested in digital products. You’ll also probably have an email system that is indiscriminate and wishes to receive all product purchase notifications.
It is, of course, possible to deliver all product purchase notifications to all consumers and let them discard what they are not interested in, but this isn’t a very efficient solution. It results in a more congested network and resource waste.
Instead, many Pub/Sub systems have therefore implemented filtering capabilities in the subscription layer. In fact, as one of the pre:Invent announcements in November (2022), Amazon added support for payload-based message filtering in SNS. Before this, more rudimentary metadata-based message filtering was already available, but as AWS Serverless Hero, Yan Cui noted at the time on Twitter, this is a big deal.
When Yan mentions EventBridge, he is referring to Amazon’s other service capable of enabling implementation of pub / sub patterns.
EventBridge vs SNS
Unfortunately, in the world of AWS, there is no neat separation of responsibilities between services. That is to say that EventBridge and SNS definitely have some level of overlap.
Like SNS, EventBridge enables you to create an endpoint to which producers can publish messages. EventBridge calls this message topic-like construct an Event Bus. These constructs are similar but not identical.
SNS Merits
We mentioned earlier that you can create 12.5 million subscriptions for each SNS topic. In EventBridge, instead of creating subscriptions you create rules. Each rule can deliver to up to 5 destinations, and you can have up to 300 rules associated with an EventBus. Even in the case that you maximize your rule configurations, an EventBridge bus will never be able to achieve 1/1000th of the fan-out of an SNS topic.
Luc van Donkersgoed also measures latency of an SNS event as four times as fast as EventBridge when triggering a Lambda function.
Lastly, as Yan mentions further in the same Twitter thread, unlike EventBridge, SNS also has the ability to guarantee ordered delivery of events.
EventBridge Merits
But SNS is not superior to EventBridge in all cases. In fact, the increased latency of EventBridge is likely due to some of the capabilities that EventBridge offers that are not offered by SNS.
Yan provides a couple of examples in another tweet in the above thread:
You see, in addition to enabling pub/sub architectures, EventBridge aims to provide all the necessary functionality for choreographing event driven processes. Later on in this blog post we will dive deeply into what choreographing event driven processes entails, but for now let’s just understand how EventBridge supports a few features that help with this.
As Yan notes, in distributed workflows it might be necessary to archive, and replay events in the case of system failure. EventBridge therefore offers these as options even though SNS does not. It also lets you target a far more diverse set of upstream systems than SNS does.
I do want to mention one more important benefit of EventBridge not mentioned by Yan. If your upstream inventory system has an API requiring a productId and quantity as an input and your email system also requires the purchaser userId, EventBridge lets you use the same source message to invoke both upstream services.
When defining an EventBridge rule, engineers can map and transform the input message into the exact format expected by each specific consumer in its rule configuration. This property enables consumers to expose simple, and minimal endpoints that need not be aware of producer message formats.
While SNS message producers must publish messages in the same format that consumers expect to receive those messages, EventBridge’s input transformation enables integration to be handled without code changes in either the consumer or producer. (This assumes, of course, that the producer provides all of the data required by the consumer.)
EventBridge + SNS
In the rare cases where you require the input transformation of EventBridge and the fan-out of SNS, the two can be used in conjunction. You can send a message to an EventBridge bus, and use a rule to send it to an SNS topic.
Unfortunately, this does not help with latency or delivery order limitations of EventBridge, but it does help alleviate the fan-out limitations.
Other Pub/Sub Services
Outside the scope of this blog-post, are two other managed services capable of enabling pub / sub workloads on AWS that I want to quickly touch on.
AWS IoT
AWS IoT Core provides MQTT topics. While these topics are designed for IoT devices to send data to AWS and then process this data with a rules engine, it is possible to (ab)use these topics for general service communication and pub/sub.
AWS IoT Core Topic Rules, like EventBridge rules, can target a wide variety of destinations. It also supports input transformations using a SQL like syntax.
I have not investigated the drawbacks to (ab)using AWS IoT as an EventBridge alternative outside of the IoT space, but I intend to do this at some point.
AWS MSK
In MSK, AWS provides a managed solution for Apache Kafka. Kafka is capable of doing pub / sub, but is mostly intended as a data streaming platform. I will remark, however, that Kafka is probably overkill if all you are looking for is a pub-sub solution. We will therefore leave it out of the scope of this discussion.
Orchestrating and choreographing complex workflows
At this point, we have identified some powerful components AWS provides us to help manage service integrations in distributed processes.
Event Driven Choreography
Using EventBridge and SNS we can choreograph incredibly complex processes. Process choreography is like dance choreography. Like a dancer in a dance, every service in a process waits for its cue before performing its role and then exiting the stage. Each service only needs to understand the events that it reacts to, and no service need know about any other services in the workflow. Services are therefore said to be loosely coupled to each other.
We achieve this loose coupling by having each service publish important state changes it performs to a central event bus. Each service need, therefore, only know about the state changes it performs. It need not know about any consuming service. When consuming events, integrators can look to the central event bus to define inputs for event consumers and need not be aware of event producers.
Dangers of Event Driven Choreography
Without very careful planning, however, choreographed processes can pose some challenging operational challenges. Especially as the dependencies in the processes expand in number and complexity.
Here are some examples of the type of operational challenges I am referring to:
You notice some incorrect data in the output of your system. Your system consists of lots of producers and lots of consumers. How do you work out the source of the data issue?
A process which was working before is no longer working. How do you trace which services have failed?
How do you discover that a process that was working is no longer working? Without the right monitors in place and without a central actor making sure that each step in the process is invoked and succeeds, you might only discover that your system is broken when it is too late to recover gracefully.
A producer wishes to deprecate a message. How do you ensure that the event is no longer in use?
How do you keep track of all the different kinds of messages that are available to consume? How do you keep track of the structure of these new messages and enable integrators to effectively integrate?
To help make sense of these choreographed and dynamic processes, distributed tracing solutions like Zipkin, Jaeger, and AWS X-Ray have emerged. EventBridge integrates with AWS X-Ray and X-Ray dynamically provides visualizations that stitch together steps in the process so that it is easier to visualize these processes.
Distributed tracing solutions alongside rigorous monitoring and alerting implemented throughout event consumers and producers can operate together to form a rigorous observability program and help address the first three challenges mentioned.
With regard to the remaining two challenges, EventBridge offers a schema registry with a schema discovery solution to enable the dynamic cataloging of events that are present in your system.
To me, aside from the challenges involved in implementing a comprehensive observability program, there is another great hazard present when implementing choreographed processes. Perhaps the greatest danger of choreographed processes is that they tend to become very complex very quickly.
Because of the prospect that consumers can start consuming events with very little friction, the overall distributed process grows organically. This property is powerful in that it enables choreographed processes to be extended easily. It is also dangerous in that organic growth of processes is chaotic. Without engineers constantly planning and centrally managing processes and ensuring processes remain as simple as possible, entropy will prevail.
I cannot profess to be a follower of Tony Robbins, nor can I vouch for the efficacy of his advice, but if he has ever said something true it is these six words:
The more complex a system is, the harder it is to secure, scale, extend safely, or fix if it breaks.
Consider, the event planner who publishes invitations to a set of invitees and needs to collect RSVPs.
If 50% of the invitation list doesn’t RSVP in time, when the time comes to submit catering numbers the planner risks over-catering or under-catering. This is a consequence of not planning adequately up-front. If this concern was top of mind at the outset, the event planner could have sent RSVP reminders ahead of time, and reduced the catering uncertainty.
If the caterer asks for special dietary requirements prior to the party, the event planner might need to scramble to collect this information. Had this been top of mind at the outset, the event planner could have collected this information on the RSVP card.
Compared with many distributed processes, party planning is simple and yet even in this process it becomes quickly apparent that without planning adequately, surprises often make event planners lives more difficult.
I am not claiming that it is impossible to build a reliable, minimalist, and simple choreographed large distributed system. I am saying that I have never seen one.
Engine Driven Orchestration
Workflow engine orchestration is an alternative to event driven choreography that helps alleviate some of the challenges posed by choreographed systems.
Workflow engines operate by requiring engineers to build their processes explicitly in a central system. This system then becomes responsible for invoking each step of the distributed process with the appropriate input at the appropriate time and collecting any output for use in subsequent steps.
The workflow engine resembles a conductor of an orchestra who stands in front of the musicians providing input as to when to and how to perform their roles. In the end, the goal is to produce wonderful music. The workflow engine tells a set of services when to and how to execute the different steps that make up the process, with the goal of ensuring the process completes successfully and completely.
Forcing engineers to explicitly map out the process up front has two main benefits:
It forces engineers to pay attention to the entire process from beginning to end and gives them an opportunity consider edge cases at the process level.
It gives engineers the opportunity to simplify complexity as it is introduced into the system.
Having a central engine conducting the process and saving the inputs and outputs of every step has other great benefits:
It becomes easy to understand the status of every process execution in real time.
Bugs become easy to find by inspecting the central defined process and input and output of steps.
In case of a service outage or an unexpected step failure, the engine can raise an alarm making it far easier to implement an observability program.
For some additional reading material about monitoring and managing distributed workflows, I recommend reading the work of Berndt Ruecker, a cofounder and the CTO at Camunda (A very popular workflow platform) starting with this excellent essay that first started me down this rabbit hole several years ago.
When you wish to build distributed processes on AWS, I recommend using AWS Step Functions. Step Functions lets you define workflows in a visual editor or by using Amazon States Language, a DSL that enables you to define expressive workflows that build state machines.
Step Functions provides the capability to invoke AWS services as steps, perform decisions, perform steps in parallel, iterate over lists and perform a step for every item in the list and perform data transformations between steps using intrinsic functions. The step functions console also provides a great debugging experience wherein you can view execution graphs and step inputs and outputs, and click through to CloudWatch logs for lambda functions.
If you prefer to define your workflows using a programming language, there is an awesome CDK construct you can use to build Step Functions. If you want to write workflows using variables instead of JSONPath to reference data that is passed between steps, explore Temporal or the older SWF w/ Flow framework.
Or wait a few months. There’s a very exciting project I’ve been watching which provides an exceptional developer experience for building workflows. More on that soon.
Conclusion
This post should provide you a starting point and enable you to start build effective distributed processes on AWS. Using building blocks like SNS topics, EventBridge buses, and workflow engines, it becomes far easier to manage and operate reliable distributed processes.
I’ll leave you with one last word regarding the orchestration vs choreography debate. My perspective is that both orchestration and choreography have their place. When defining processes that must be reliable and are mission critical I tend to prefer orchestration. When integrating with other systems, I tend to prefer event driven choreography. Our team at Foresight loves both Step Functions and EventBridge and our team at Foresight use both the two to provide flexibility when we need flexibility and form and structure when we want to mandate form and structure.
Finally, AWS announced EventBridge Pipes at Re:Invent this year. You will notice we didn’t cover them in this blog post. This is because they are used to integrating a single source with a single destination and belong in part one of this guide which I hope to update soon.
Thank you for reading, and please let me know your thoughts as always!