Monitoring Shard Allocation In OpenSearch and ElasticSearch

Filling in a critical gap in the AWS OpenSearch managed Cloudwatch Metrics

Mar 07, 2022

AWS’s managed OpenSearch Service (formerly ElasticSearch Service) is one of the AWS managed services which requires quite a lot of custom tuning to get to work well for you. It’s also a service where the cost of not knowing what you are doing is very expensive.

On a related note, knowing what you are doing while operating an OpenSearch domain is not always intuitive. This is because each OpenSearch domain’s health is very dependent on its data and access patterns. A cluster’s health cannot simply be determined by the health of the software and hardware that is running. The very same domain that might perform some tasks quickly and without any hiccups can fail to index a single new document.

I’ll give you an example.

You’re using OpenSearch Service as a log search engine and ingest your logs using a typical ELK or EKK solution. Every day, a new index is created in your domain (possibly a new index per environment, or per service, per environment depending on your implementation). As each index is created, by default 5 primary shards are allocated to that index (this number is configurable, but each index must be allocated at least 1 shard). Along with each primary shard you’ll want to allocate a replica shard to each primary shard.

Assuming you have not tuned your index configuration at all, every day, dev, uat, and production environments each create a new index in your OpenSearch domain. Ten shards for each environment (5 primary and 5 replica) are allocated per day. If you have two data nodes which each allow the default 1000 shards per node, you can store 66 days worth of logs on your cluster ((2 node *1000 shards/node) / (3 * 10 shards / day).

The result of such a configuration is that even if you only log 1 MB worth of data per day, writes to your cluster will fail on day 67 if you haven’t cleaned up your older logs. What about reads? Well you’ll probably get sub-100ms response times on those. What about your OpenSearch domain metrics? Well most everything will look healthy. Everything but your write request error response codes, that is.

You might be able to monitor error responses for your OpenSearch domain and be alerted that something is wrong post-facto, but by then its too late. All your available shards are already allocated and some writes have already failed. Failures like this will take the form of a 400 error with the following error message:

Error{
  "type": "illegal_argument_exception",
  "reason": "Validation Failed: 1: this action would add [10] total shards, but this cluster currently has [5992]/[6000] maximum shards open;"
}

Now this is not a blog post about sizing OpenSearch clusters. I’ll write that one another time (AWS actually has some reasonably thorough documentation on that). It’s about augmenting your OpenSearch domain monitoring suite to capture shard allocation. That way, when your shards are 90% used, you can scale out (if you have lots of $$$ to throw at the problem), or do some tuning to ensure your cluster is more correctly sized before things go wrong.

If none of the built-in domain monitors help with this, let’s build our own.

We can grab the total shards in the cluster by making a request to:

GET _cluster/stats?filter_path=indices.shards.total.

We can grab the total shards of our domain by making a request to:

GET _cluster/settings?include_defaults=true&flat_settings=true&pretty

From the response we can grab the `persistent` property and within that the `search.max_buckets` property.

Dividing the used shards by the total shards gives us a shard allocation percentage value between zero and one.

Performing this calculation every hour, and publish a CloudWatch metric allows us to proactively warn administrators before OpenSearch shards become fully allocated.

Below is a small Chalice project to publish a Serverless scheduled task to help proactively identify over-allocation of shards in an OpenSearch domain. I’ve deployed this on several occasions for our customers at Foresight Technologies and it has, saved me a couple of times.

from chalice import Chalice

app = Chalice(app_name='es-monitor')
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3
from os import environ
import requests

def es_get(endpoint, path, payload={}, headers={
    'Content-Type': "application/json"
}):
    if not endpoint:
        endpoint = environ.get('ES_ENDPOINT')

    credentials = boto3.Session().get_credentials()
    awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, os.environ['AWS_DEFAULT_REGION'], 'es', session_token=credentials.token)

    return requests.get(
        url=f'{endpoint}/{path}',
        headers=headers,
        data=payload,
        auth=awsauth
    ).json()

def get_used_es_shards():
    used_shards = es_get('_cluster/stats?filter_path=indices.shards.total').get('indices').get('shards').get('total')
    available_shards = int(es_get('_cluster/settings?include_defaults=true&flat_settings=true&pretty').get('persistent').get("search.max_buckets"))
    return used_shards / available_shards

def emit_metric(metric_namespace, metric_name, metric_value):
    cloudwatch = boto3.client('cloudwatch')
    cloudwatch.put_metric_data(MetricData={
        'MetricName': metric_name,
        'Dimensions': [],
        'Unit': None,
        'Value': metric_value
    },
    Namespace=metric_namespace)


@app.schedule('rate(1 hour)')
def track_elasticsearch_shard_allocation():
    emit_metric(
        "Elasticsearch/Reliability", 
        'ElasticsearchShardAllocation', 
        get_used_es_shards()

This is a very cheap solution to run. It’s essentially free, when you compare it against the AWS bill that comes alongside running a production ready OpenSearch domain.

I don’t have anything to add here, other than to stress that I’ve seen this type of failure on multiple occasions. Restoring logs from secondary data sinks is far more painful than preventing the problem in the first place.

Thank you for making it this far, Reader. I do hope that you learned something along the way.

Fun With The Cloud

Discussion about this post