Fixing the AWS Sagemaker SDK

And frustrating open source contribution experiences...

May 28, 2021

Part 1: Fixing the Sagemaker SDK

One of the perks of my job is that I get to spend a lot of time playing with new technology. (I lead a team of awesome cloud engineers over at Foresight Technologies.) One of the less exciting things (or more exciting things) about working with new technology is that sometimes things don’t work the way that they are advertised.

I jumped into Sagemaker Studio excited to start deploying some models. So standard machine learning stuff, right.

Configure training environment

xgb = sagemaker.estimator.Estimator(
    container,
    get_execution_role(), 
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    sagemaker_session=sess
)

Set some standard hyperparameters and train a model.

xgb.set_hyperparameters(
    max_depth=3,
    gamma=0,
    eta=0.1,
    num_round=100)
xgb.fit(
   {
       'train': s3_input_train,
       'validation': s3_input_validation
   }
)

Finally, deploy the model. And it is here that the fun starts:

xgb_predictor = xgb.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large"
)

Despite being a very very typical sagemaker studio workflow, I am greeted by a cryptic error message:

ClientError: An error occurred (ValidationException) when calling the CreateEndpoint operation: The provided tags "Tag(sagemaker:project-id, x-xxxxxxx),Tag(sagemaker:project-name, xxxxxxx),Tag(sagemaker:project-id, xxxxxx),Tag(sagemaker:project-name, xxxxxx)" must not have duplicate keys.

And when I get a cryptic error like that on something so routine, I look to see if anyone else has had similar issues. I ask google and don’t get anything useful, so I go to github and search open issues. Lo and behold there is one other person who faced the same problem a month ago without any responses. No responses, no likes on the issue. Probably someone else who like me didn’t know how to use the Sagemaker SDK.

This error message was useful, however, and along with the stack trace, I might be able to understand what I had misconfigured. How then did I mess up configuration so sagemaker:project-id and sagemaker:project-name tags appeared multiple times?

I started tracing the code.

I discovered that these tags are created by an _append_project_tags method in _studio.py1. If these tags were appearing twice, this method was being called twice by the sagemaker SDK, and my code did not create any tags itself. So probably an unintended workflow. My Estimator might be invoking this method in its constructor, or while setting the hyperparameters, or while training the model. Maybe I was meant to clear all tags from the Estimator prior to deploying and the API was just stateful and brittle.

So I did that, and then I tried to deploy again with the same result. This was very suspicious. Some other state nested deep within the estimator’s attributes then, or I began to suspect a bug in the Sagemaker SDK.

Cutting a rather long exercise in tracing through the code short, I confirmed that it was indeed a bug with the Sagemaker SDK, and that using an estimator to deploy a class would indeed invoke this _append_project_tags() method twice deterministically, rendering the SDK virtually unusable for my purpose. Someone had clearly introduced a bug into the SDK that had made its way into production.

So I monkey-patched a fix, tested it, and deployed my model without incident.

Part 2: Frustrating Open Source Experiences

After going through the pain of discovering this bug, I wasn’t about to let other unfortunate developers do the same. So in the spirit of open source software I opened a pull request with my fixes, following the contribution guidelines.

I didn’t hear back for a couple of days, but then I noticed some merge conflicts introduced in my pull request. Who had updated the same lines of code and merged them to master?

A virtually identical fix with a virtually identical test case has been merged into master. This fix has been opened by the maintainers of the Sagemaker Python SDK a few hours ago. Needless to say, no comment has been posted on the open issue, and my pull request is still untouched.

I suppose I would have felt less slighted by this if they hadn’t made me fill in a checkbox requiring that I read the contribution guidelines that explicitly require that:

You check the existing open and recently merged pull requests to make sure someone else hasn't already addressed the problem.

If you maintain open source software, please know that it isn’t good form to ignore pull requests, fix the same issues yourself, and not respond to your github issues. Please treat your contributors with a little respect and follow your own contribution guidelines. Definitely leaves me feeling a little salty.

A relatively rainy day.

This is rather irrelevant to the narrative here, but feel free to take a look if you’re curious.

Ning

Aug 11, 2021

Thank you for sharing the bug. I run into the same issue. Based on your article, it seems the bug has been fixed for a couple of months, not sure why I am still experiencing it now. What is this patch you did? Maybe I can do some one-off fix locally on my own?

Expand full comment

3 replies

3 more comments...

Fun With The Cloud

Discussion about this post