Kenji Kaneda, Chief Architect

We all know the automatic savings and optimal performance that comes with leveraging AWS Spot instances. But the challenge of not having enough spot instances for a business critical workload can quickly get expensive because of the on-demand nodes

AWS EC2 Spot is one of the common ways to reduce compute costs. Since Spot instances can be preempted at any time, it is also typical for Kubernetes clusters to have a mix of Spot instances and On-Demand nodes. This ensures the workloads that cannot tolerate node preemption run on On-Demand instances, while other workloads run on Spot instances.

One common technique used is Cluster Autoscaler with Priority based expander. This setting allows Cluster Autoscaler to prefer Spot instances, but falls back to On-Demand instances when AWS doesn't have available Spot capacity. In theory, it’s a great solution.

Cluster Autoscaler, however, does not guarantee the optimal allocation of On-Demand instances and Spot instances. There are often cases where a cluster has On-Demand instances occupied by Spot-eligible workloads. Let’s walk through a problematic scenario:

A Spot-eligible pod is pending.
Cluster Autoscaler attempts to scale-up the cluster, but AWS is at the peak time and doesn't have sufficient Spot capacity.
Cluster Autoscaler falls back to the On-Demand node creation.
AWS Spot becomes off-peak time on the same day (e.g., midnight, and Spot capacity becomes available. The Spot-eligible pod, however, continues to run on the On-Demand node.

The end result is that the cluster cost grows unnecessarily high with these extra On-Demand nodes. Suppose that you have 100 nodes (30 On-Demand nodes and 70 Spot nodes) and 20 On-Demand nodes can be converted to Spot nodes. If On-Demand price is 3x higher than Spot price, you’re paying 33% more than it should be.

How to fix it

While you could replace On-Demand nodes to Spot nodes manually on your own in theory, it’s time consuming, cumbersome, and an error-prone manual process. A dev-op person would need to run a custom script daily or weekly to replace On-Demand nodes to Spot nodes by saving evicting pods.

Kubernetes workloads were built to eliminate manual processes and automate scaling resources. But as with any solution, it requires continuous improvement. For that reason, we built CloudNatix Spot rebalancing technology. It keeps the optimal number of On-Demand instances and Spot instances. It monitors On-Demand instances in the cluster and converts them to Spot nodes as more become available in a few simple steps:

Examine On-Demand nodes that are occupied by Spot-eligible pods
If an On-Demand node is found, CloudNatix Spot rebalancing then checks if the pods running on the node can be moved to other nodes in the cluster.
If so, drain the node and reschedule the pods. Spot-eligible pods are rescheduled to a Spot node.

With CloudNatix, you can instantly see how much potential saving with this Spot rebalancing technology (GUI below) and realize the savings.

Spot instances for cost optimization, when you’re using on-demand instances

Kenji Kaneda, Chief Architect

CloudNatix now Available on Azure Marketplace

5 Steps for Accurate Cloud Cost Attribution