LeanerCloud - cloud optimization

AutoSpotting - Frequently Asked Questions

Looking for more info about AutoSpotting? Here are some things we're commonly asked.

What is AutoSpotting?

AutoSpotting is a tool that makes it easy to adopt EC2 Spot instances in AutoScaling groups, for up to 90% savings. It is enabled and configured using tags and requires no launch template or launch configuration changes.

What are Spot instances and how do they work?

Cloud providers such as Amazon AWS need to always have some spare capacity available, so that any customer willing to launch new instance would be able to do it without getting errors.

Amazon offers this spare capacity to customers as Spot instances, at a steep discount, but they will take it back within minutes when it needs to be allocated to on-demand users.

From the functionality perspective Spot instances are identical to on-demand instances, the only difference is the fact that they can be interrupted.

How reliable are Spot instances?

Individual instances will be interrupted occasionally, but stateless workloads can usually sustain individual instance interruptions.

If your application may run on multiple instance types, when capacity for one of them is running low you can get capacity from other instance types.

AutoSpotting automatically diversifies over multiple instance types on your behalf, and also in the event of insufficient capacity for all of them it would failover to the OnDemand instance type configured on your AutoScaling group.

Which workloads are tolerant to Spot interruptions?

Spot is a great fit for fault tolerant, stateless workloads that have some instance type flexibility.

Most workloads running on AutoScaling groups that scale dynamically or sit behind
HTTP load balancers or consume data from SQS queues are also good candicates to Spot,
because new requests can be routed quickly to new instances without any visible
user impact. Containerized applications and big data processing are also a great fit.

How does AutoSpotting work?

AutoSpotting monitors the AutoScaling groups where it was enabled and it continuously replaces existing or new on-demand instances found in those groups with compatible and identically configured Spot instances.

It will be default replace all on-demand instances with Spot but it can also keep some of them running as on-demand if configured so.

How are Spot interruptions handled by AutoSpotting?

The interruptions are published in the instance metadata and EventBridge.

AutoSpotting listens for these events and take actions, such as detaching the instances from the load balancer and terminating them gracefully, giving the chance to the AutoScaling group to provision on-demand capacity.

What are the goals and design principles of AutoSpotting?

AutoSpotting is designed to be used against existing AutoScaling groups with long-running instances, and it is trying to be as close as possible to invisible, usually it just does it thing without you noticing anything.

It's also designed towards mass rollouts, such as across entire AWS organizations, where it can sometimes be executed in opt-out mode, converting the entire infrastructure to Spot instances.

The configuration is designed to be minimalist and everything should just work without much tweaking. You're not expected to need to determine which instance types are as good as your initial ones, which instance type is the cheapest in a given availability zone, and so on.

Everything should be determined based on the original instance type using publicly available information and querying the current Spot prices in real time. Your main job is to make sure your application can sustain instance failures.

It also tries as much as possible to avoid locking you in, so if you later decide that Spot instances aren't for you and you want to disable it, you can easily do it with just a few clicks or commands, and immediately revert your environment to your initial on-demand setup, unlike most other solutions where the back-and-forth migration effort may become quite significant.

From the security perspective, it was carefully configured to use the minimum set of IAM permissions needed to get its job done, nothing more, nothing less. There is no cross-accounting IAM role, everything runs from within your AWS account and no information about your infrastructure ever leaves your AWS account.

What is the use case in which AutoSpotting makes most sense to use?

Any workload which can be quickly drained from soon to be terminated instances.

AutoSpotting is designed to work best with relatively similar-sized, redundant and somewhat long-running stateless instances in AutoScaling groups, running workloads easy to transfer or re-do on other nodes in the event of Spot instance terminations. Here are some classical examples:

◦ Development environments where maybe short downtime caused by Spot terminations is not an issue even when instances are not drained at all.

◦ Stateless web server or application server tiers with relatively fast response times (less than a minute in average) where draining is easy to ensure

◦ Batch processing workers taking their jobs from SQS queues, in which the order of processing the items is not so important and short delays are acceptable.

◦ Docker container hosts in ECS, Kubernetes or Swarm clusters.

Note: AutoSpotting implements some termination monitoring and draining logic which can be extended if you use termination lifecycle hooks.

What are some use cases in which it's not a good fit and what to use instead?

Anything that doesn't really match the above cases:

◦ Groups that have no redundancy

If you have a single instance in the group, Spot terminations may often leave your group without any nodes. If this is a problem, you should not run AutoSpotting in such groups, but instead use reserved instances, maybe of T2 burstable instance types if your application works well on those.

◦ Instances which can't be drained quickly

If your application is expected to serve long-running requests, without timing out after longer than a couple of minutes, AutoSpotting(or any Spot automation) may not be for you, and you should be running reserved instances.

◦ Cases in which the order of processing queued items is strict

Spot instance termination may impact such use cases, you should be running them on on-demand or reserved instances.

◦ Stateful workloads

AutoSpotting doesn't support stateful workloads out of the box, particularly in case certain EBS persistent volumes need to be attached to running instances.

The replacement Spot instances will be started but they will fail to attach the volume at boot because it is still attached to the original instance. Additional configuration would have to be in place in order to re-attempt the attach operation a number of times, until the previous on-demand instance is terminated and the volume can be successfully attached to your Spot instance. The Spot instance's software configuration may need to be changed in order to accommodate this EBS volume.

How do I install AutoSpotting?

You can launch it from the AWS Marketplace using the provided CloudFormation or Terraform infrastructure code.

It only takes a couple of minutes and just needs a few clicks in the AWS console or a single execution of awscli from the command-line.

The same CloudFormation stack template can also be used for launching a StackSet against your entire AWS organization.

The below video explains in detail the installation process and the available configuration options:

Why is there Fargate, ECS and VPC in the CloudFormation stack?

These resources are unfortunately required for the AWS Marketplace billing logic, as workarounds for the fact that Lambda is not supported by the AWS Marketplace Metering API. We run a Fargate task every hour for calling this API for billing purposes.

Interfering with these resources is against the AWS Marketplace EULA terms (in particular section 2.3 - Restrictions), and also it may break AutoSpotting and require it to be reinstalled from scratch.

How do I enable AutoSpotting?

The entire configuration is based on tags applied on your AutoScaling groups.

By default it runs in "opt-in" mode, so it will only take action against groups that have the "spot-enabled" tag set to "true", across all the enabled regions, so often all you need to do is apply this tag.

For more advanced users it can also be configured to run in "opt-out" mode, so it will run against all groups except for those tagged with the "spot-enabled" tag set to "false". This mode is unique to AutoSpotting, and when combined to a StackSet deployment it is a great way to adopt Spot at any scale.

Some large companies are using it in "opt-out" configuration even across AWS organizations with hundreds of AWS accounts to ensure the majority of their infrastructure to run on cost-effective Spot instances and migrated to Spot in very short time, in some cases literally overnight, without requiring any cooperation and engineering effort from their development teams.

If you are also considering such a rollout and would like support from someone who did it repeatedly and knows how to avoid the pitfalls of such a large-scale migration project, we're happy to help and also offering significant discounts over the Marketplace offering, just reach out to us using the chat feature from this page.

Note: the keys and values of the "opt-in"/"opt-out" tags are configurable in both modes, and multiple tags can be used.

The below demo shows how to enable AutoSpotting on a new AutoScaling group.

Will it replace all my on-demand instances with Spot instances?

Yes, that's the default behavior (we find it quite safe), but for your peace of mind this is configurable, as you can see below.

Can I keep some on-demand instances running just in case?

Yes, you can set an absolute number or a percentage of the total capacity, using the global configuraton set at install time, or you can override it on a per-AutoScaling group basis using tags set on each group. These tags are mentioned when installing the CloudFormation template.

How does AutoSpotting compare to the AutoScaling mixed groups?

AutoSpotting has a few additional capabilities:

◦ configuration using tags, without having to give a list of instance type or attributes for each AutoScaling group.

◦ automated fail-over to on-demand instances when Spot capacity is unavailable and back to Spot soon once Spot capacity becomes available again.

◦ automated selection of the diversified instance types from the cheapest available instance types, but also with preference toward newest instance type generations.

◦ you can enable/disable it at will from CI/CD pipelines and even on a schedule basis

◦ you can roll it out across your entire fleet in opt-out mode, without any configuration changes, in particular you don't need to convert your groups to LaunchTemplates

◦ flexible/automated instance type selection for Spot instances.

To be fair it also does have a few drawbacks:

◦ need to run an additional tool, which you need to install and may occasionally need to update.

◦ no support for the AWS regions located in China or GovCloud, mainly because I don't have access to them, but I can port it if needed.

◦ small software costs charged through your AWS bill

How does AutoSpotting compare to commercial offerings such as Spot.io?

Many of these commercial offerings have in common a number of things:

◦ SaaS model, requiring admin-like privileges and cross-account access to all target AWS accounts which usually raises eyebrows from security auditors. They can read a lot of information from your AWS account and send it back to the vendor and since they are closed source you can't tell how they make use of this data. Instead, AutoSpotting is launched within each target account so it needs no cross-account permissions, and no data is exported out of your account.

◦ Implement new constructs that mimic existing AWS services and expose them with proprietary APIs, such as clones of AutoScaling groups, maybe sometimes extended to load balancers, databases and functions, which expect custom configuration replicating the initial resources from the AWS offering. Much like with Spot fleets, this makes it quite hard and work-intensive to migrate towards but also away from them, which is a great vendor lock-in mechanism if you're a start-up, but not so nice if you are a user. Many of these resources require custom integrations with AWS services, which need to be implemented by the vendor.

Instead, AutoSpotting's goal is to be invisible, easy to install and remove, so there's no vendor lock-in. Under the hood it's all good-old AutoScaling, and all its integrations are available out of the box.

◦ they're all pay-as-you-go solutions charging a hefly percentage of the savings. For example Spotinst charges 20-25%, mainly for the value add of having a GUI dashboard.

AutoSpotting's goal is to simply be useful, and as invisible as possible, also from the price perspective. If you need to see a saving dashboard, just look at the Bills section of the AWS console and see how the bill reduced over time.

How much does it cost me to run it?

AutoSpotting is designed to have minimal footprint, and its execution overhead will only cost you a few pennies monthly.

It is based on AWS Lambda, the default configuration is triggering the Lambda function once every 5 minutes, and most of the time it runs for just a few seconds, just enough to evaluate the current state and notice that no action needs to be taken.

In case instance replacement actions are taken it may run for more time because the synchronous execution of some API calls takes more time, but most of the times it finishes in less than a minute. It should still be well within the monthly Lambda free tier, you will only pay a few cents for logging and network traffic performed against AWS API endpoints.

The Cloudwatch logs are by default configured with a 7 days retention period which should be enough for debugging, but shouldn't cost you so much. If desired, you can easily configure the log retention and execution frequency to any other values in the CloudFormation stack parameters.

How about the software costs?

The current version is only available from the AWS Marketplace, as prebuilt binaries that have been thoroughly tested. These cost up to 5% the generated savings and support further development of the software.

Older versions used to be free and open source, and the code is still available on Github.

Does AutoSpotting continuously search and use cheaper Spot instances?

Or in other words if I attach autoSpotting to an AutoScaling group that is 100% Spot instances, will it replace them with cheaper compatible ones when found later on?

The answer is No. The current logic won't terminate any running Spot instances as long as they are running.

The only times when AutoSpotting interacts with your instances is at the beginning, after scaling actions or immediately after Spot instances are terminated and on-demand instances are launched again in the group.

I enabled AutoSpotting but nothing happens. What may cause this?

Assuming the installation of AutoSpotting completed successfully, if you set it up on an existing group it may take up to 30min to see the first instance replacements. You may try increasing the group capacity to see if the new instance gets replaced.

Spot instances may also fail to launch for a number of reasons, such as Spot market conditions that manifest in low capacity across all the compatible instance types.

Another common reason is enabling AutoSpotting on a group configured with a Mixed Instances Policy. Such groups are ignored by AutoSpotting in order to avoid possible race conditions. To avoid this make sure your groups are configured without any instance type overrides and use the instance type configured on the Launch Template, as you can see below:

Which IAM permissions does AutoSpotting need and why are they needed?

You can see the current IAM permissions in the CloudFormation template.

It basically boils down to the following:

◦ describing the resources you have in order to decide what needs to be done
(things such as regions, instances, Spot prices, existing Spot requests,
AutoScaling groups, etc.)
◦ launching Spot instances
◦ attaching and detaching instances to/from Autoscaling groups
◦ terminating detached instances
◦ logging all actions to CloudWatch Logs
◦ billing to the AWS Marketplace
◦ loading/saving configuration values to SSM

In addition to these, the AutoSpotting Lambda function's IAM role also needs another special IAM permission called "iam:passRole", which is needed in order to be able to clone the IAM roles used by the on demand instances when launching the replacement Spot instances. This requirement is also pretty well documented by AWS.

How do I Uninstall it?

You just need to remove the AutoSpotting CloudFormation or Terraform stack.

The groups will eventually revert to the original state once the Spot market price fluctuations terminate all the Spot instances. In some cases this may take months, so you can also terminate them immediately, the best way to achieve this in a controlled manner is by configuring AutoSpotting to use 100% on-demand capacity for a while before uninstalling it.

Fine-grained control on a per group level can be achieved by removing or setting the "spot-enabled" tag to any other value. AutoSpotting only touches groups where this tag is set to "true".

Note: this is the default tag configuration, but it is configurable so you may be using different values.