Elevating Availability and Cost Efficiency in Autoscaling

Author: Aya Ozawa

Date: Oct 12, 2023

Autoscaling is key for a cost-effective and scalable platform. However, autoscaling increases the frequency of Pod eviction. This can lead to availability issues. Eviction occurs when a Cluster Autoscaler scales down nodes or when Vertical Pod Autoscaler (VPA) recommendations are applied. Although Kubernetes provides a Pod Disruption Budget (PDB) to limit disruption, in the real world, its use could be very limited with constraints like where Deployment has only one replica. To mitigate this impact, we created a feature that can safely evict Pods without allocating unnecessary resources for redundancy.

What is PDB?

PDB is a resource that can limit the number of concurrent disruptions in a Kubernetes cluster. Let's consider a scenario where there is a Deployment with two replicas protected by a PDB with a minUnavailable=1 (= the number of unavailable Pods is at most one when an eviction request is admitted). The sample manifest is like below:

YAML Code

If all Deployment Pods are running, an evacuation request will succeed. This is because even if one Pod is temporarily stopped due to an eviction request, the other Pod is still alive, so the minUnavailable=1 condition can be satisfied.

Figure 1

However, if one Pod is already down, no more Pods can be evacuated while holding the minUnavailable within the set threshold, and the eviction request will fail until there are two Pods running undisrupted. This way, if there are two or more replicas, a Pod can be evacuated from a node without causing complete disruption.

What's the Limitation of PDB?

As explained in the previous section, we have learned that multiple replicas of a Deployment allow the safe eviction of Pods while maintaining the required number of disruptions. However, if a Deployment only has a single replica, any unavailability would result in a loss of availability. Therefore, PDB cannot be set up for a single-replica workload.

In real-world scenarios, there are many cases where a workload has only one replica, such as legacy applications or development environments that do not require high availability through replication. In such cases, PDB cannot be used, which means there is a risk of a service interruption at scale or during node maintenance.

However, even in a development environment, service disruptions should be avoided as much as possible. Increasing the number of replicas will improve availability, but it will also increase cost. So, how can we achieve service levels and keep costs down?

Introducing Dynamic Eviction Controller

To handle the above limitation, we have developed the Dynamic Eviction Controller. This controller temporarily adjusts the number of replicas during an eviction process in order to ensure that the pods can be safely evicted without wasting any resources. The advantage of this feature is that you don't need to make any changes to the tools that manage pod eviction, such as the autoscaler and node maintenance tools. You can use Kubernetes standard eviction requests as usual. To apply this feature to your workload, simply annotate the pod like below:

This feature is composed of two components: webhook and controller. The diagram below illustrates this dynamic eviction controller in action.

First, the webhook intercepts any eviction requests. If the Pod being evicted has the `enable-dynamic-eviction` annotation, the webhook notifies the event to the controller, and then the controller takes over the eviction process. The controller temporarily increases the number of replicas for the corresponding workload (in this example, Deployment) that manages the target Pod. After confirming that the new Pod is ready, the number of replicas is set back to 1, and then the old Pod is evicted.

Conclusion

Deployments with only one replica can face issues during eviction. To mitigate this impact Cloudnatix created a Dynamic Eviction Controller. This way, the evictions can be carried out safely without allocating redundant resources, thus maintaining availability and minimizing the cost to do so.

To learn more about cost-optimization and how we can assist you, please feel free to reach out to us via email contact@cloudnatix.com.

Previous
Previous

AWS Billing Part 1: Usage Calculation

Next
Next

Datadog and CloudNatix Partnership