Autoscaling is the automatic increase or decrease of computational resources that are available for assignment to workloads.
Autoscaling is closely associated with load balancing. Strictly speaking, a load balancer does not require autoscaling capabilities. However, load balancers which include autoscaling are generally much more effective. Also, autoscaling is ideally suited for load-balanced workloads which run in the cloud. Today, the large cloud providers integrate autoscaling into their cloud load balancing capabilities.
Autoscaling is a straightforward concept. Backend servers are brought online or offline automatically, depending on the computational workloads that they must handle. Load balancing then distributes the workloads across the pool of servers. (Note that in this context, “server” can be, but is not necessarily, a physical machine. It could also be a virtual machine, a cloud instance, etc.)
Autoscaling has two primary tasks:
- To have enough servers available so that workloads are processed quickly and efficiently.
- To avoid having excess available capacity, so that operational expenses are minimized.
Generally, the first task is prioritized over the second. For example, when using reactive autoscaling (explained below), it is common for organizations to use a configuration that scales up aggressively when workloads increase, but scales down more slowly after workloads decrease.
A perfect autoscaler would be able to accomplish both tasks. At any given time, there would be just enough available server capacity to handle the current workload effectively. Although this perfection is impossible in the real world, much progress has been made toward this ideal, as reflected in the policy options below.
Types of Autoscaling Policies
When configuring an autoscaling system, there is usually a setting (often called its policy) that defines how scaling occurs. There are three typical choices: scheduled, reactive, and predictive.
Scheduled scaling means that servers are brought up and down automatically, according to a preset schedule. For example, an organization might have more workload requirements during business hours. Therefore, a certain number of servers can be scheduled to go offline each night. This reduces electricity usage and other operational expenses.
Reactive scaling means that servers are brought up and down in reaction to changes in workloads. As workloads increase, the system responds by bringing more servers online. Subsequently, when workload requirements decline, servers are taken offline again.
Reactive scaling is much more effective than scheduled autoscaling. (In other words, available server capacity is much more likely to closely track current workload requirements.) However, it is also much more complicated.
The system must be able to closely and correctly assess incoming workloads, while also gauging available extra capacity within the servers that are currently online. This is closely tied into the operation of the underlying load balancer. Indeed, many of the same considerations apply; for example, there are a variety of possible metrics to use for assessing capacity (e.g., current bandwidth usage, number of connections, CPU usage, memory usage, etc.). Each metric has its advantages and disadvantages.(For more information on this, see How a Load Balancer Works.).
Reactive autoscaling can be very effective, but it has one potential flaw. It waits until workloads increase before it scales up computational resources. This means there is a slight delay: a short period in which workloads are higher, but more capacity is not yet available. When workloads leap up quickly, this can cause problems, and clients can experience degraded performance.
Predictive scaling is designed to address this problem. The system analyzes historical data to identify resource usage patterns, and applies this information to current conditions. Its goal is to predict when workloads will increase, and to expand server capacity immediately before they do.
Of the three types of scaling, predictive scaling is the most sophisticated, and the most challenging to do correctly. It is also the newest approach: for example, AWS added it to EC2 in November 2018.
The Future of Autoscaling
Today’s autoscaler are powerful and sophisticated, but providers are continuing their efforts to make them even better.
For example, the three types of scaling described above are not mutually exclusive, and a powerful autoscaler will offer the ability to use multiple policies. For example, AWS EC2 can be configured to use “dynamic” (i.e., reactive) scaling and predictive scaling simultaneously. Researchers are working to quantify and formalize the best practices for mixing these policies, so that organizations can achieve optimal results for specific applications.