ElasticSearch has long been one of my favorite products, but I wear no rose colored glasses. Operating an ElasticSearch cluster is expensive and requires work, and even operating a cluster with AWS’s managed OpenSearch Service is no picnic. Especially if you don’t want to break the bank.
I usually run production clusters with 3 dedicated master and an even number of data nodes scheduled across two availability zones. The cluster in question is no different. What was different was the nature of this failure. Some application nodes were experiencing connection timeouts while trying to communicate with the cluster endpoint. Others were working just fine.
This after just yesterday all processes were healthy and no configuration changes. This is confirmed by CloudTrail. A quick look, and CloudWatch metrics report healthy. The nature of the requests is identical between nodes.
Maybe a networking issue that has always existed and lay dormant. OpenSearch nodes and application nodes are deployed to the same subnets. Routes exist. OpenSearch nodes are in one security group, application nodes in a second. Security group rules look good. No relevant NACLs exist.
Network interfaces for all nodes, healthy and unhealthy have expected security groups attached to their network interfaces.
I connect to the client VPN endpoint and try to telnet to the OpenSearch domain. Nothing. Okay, so this is on the OpenSearch side not the application side.
I nslookup the endpoint and all four IP addresses yielded are in my private subnets. I try to telnet to each of them. Two connect, two fail. One in each subnet connects. One in each subnet fails. I validate that failing application nodes are trying to connect to the same two network interfaces which are timing out for me.
Production customers are getting grumpy, as is our client.
With no other resource to debug at hand, I try to apply a software update to kick off a blue-green deployment and replace all nodes. In parallel, I reach out to AWS support to radio silence. This update sits in the pending state for a few hours. I replace data instances with a different machine type to force a blue-green deployment.
This second blue-green deployment immediately begins to apply. When the dust settles, all new nodes work just fine. I can connect to all of them, as can application nodes. It is as though the problem never existed.
What happened? With no access to the underlying infrastructure, I probably will never know. Do I need to worry about something like this in the future? How would I automate a response? No idea.
If this was your OpenSearch cluster, Reader, how would you respond?