A backend engineer lost in the DevOps world - Auto Scaling In Kubernetes

Introduction
Hello folks and welcome to the second part of the series I made where I discover DevOps concepts that I wanted to understand as a backend engineer. In this one we dive into Kubernetes AutoScaling where we’ll be going through its basics and testing it to make sure we understand everything going on. Let’s start!
AutoScaler Basics
Metrics Server
An auto scaler scales automatically when certain metrics reach an agreed upon threshold. Simple right? there’s a lot more to it though
AutoScaler relies on a metrics source in order to actually watch for metric changes. Kubernetes uses a component called the Metrics Server to collect resource metrics (like CPU and memory usage) for pods and nodes in a cluster. The Metrics Server aggregates these metrics and makes them available to components like the Horizontal Pod Autoscaler (HPA) and other monitoring tools.
The Metrics Server is a lightweight, cluster-wide aggregator of resource usage data (like CPU and memory) for nodes and pods.
It does not store historical data — it only provides the current resource usage (live metrics).
The Metrics Server collects data from the kubelet (the primary node agent that runs on each node).
The kubelet exposes the metrics of the nodes’ containers/pods on port 10250 /metrics/resource
In managed Kubernetes Enviroments (EKS, GKE) the metrics server by default is installed in the cluster. However if you’re using something like kind or Minikube it isn’t installed by default. To install it to your local cluster
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
And Verify using this command & make sure it’s running.
kubectl get deployment metrics-server -n kube-system
the command
kubectl top podsis only available to use once the metrics server is installed and running
Now we have a metrics server pulling metrics from every nodes’ kubelet. We need to do something with this information, yep you probably guessed it, autoscaling!!
Nginx Deployment (Example)
Before moving forward to the actual autoscaling, our example will include a simple nginx deployment where we will monitor CPU usage and add an autoscaler to this deployment. We will do a infinite while loop hitting requests to the nginx server and watch the autoscaling happen.
This is the deployment/service manifest file for nginx
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
---
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer
Each pod has a request of 100 millicores CPU (0.1 Core) and 128 Mebibytes and a limit of 0.2 Core & 256 Mebibytes
Mebibytes are binary units (unlike megabytes which is decimal). In computing, memory is inherently binary (base-2). For example, RAM sizes are measured in powers of 2 (e.g., 512 MiB, 1 GiB). MegaBytes can cause confusion because its decimal nature doesn’t match binary-based memory calculations. 1 Mebibyte =1024 ² Bytes
HPA Manifest
The basic autoscaling manifest looks something like this (In this context we have an nginx deployment where we’re going to apply HPA to it)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 20
When the average CPU utilization of all current running pods of the nginx deployment exceeds 20% (Just an example not practical obviously) autoscaling starts.
Let’s say we have 2 replicas already running where both have a cpu utilization of 25%
Thus the average cpu utilization is:
(utilization of first pod + utilization of second pod)/ 2 (pod count) = 25%
In order to know how many more replicas we need so our average utilization becomes under 20% again we can see the ratio between the actual and target utilizations 25/20 = 1.25
Meaning, the actual utilization is 1.25 x the target utilization. (1.25 times higher than the target)
By multiplying the scaling factor (1.25 in this case) by the current number of replicas, you determine how many replicas are needed to bring the CPU utilization down to the target. In our case 2 replicas so (2×1.25=2.5) and we ceil that because we can’t create a fraction of a pod so it’s ceiled to 3 so after scaling up we should have 3 replicas instead of 2.
The manifest file for HPA above is a very simple implementation. There’s a lot of configurations regarding scaling down (how long should we wait to scale down again) and other important configs but for the sake of the article we’ll go simple.
Upon applying the above manifests. We should have HPA installed to our nginx deployment.
Testing
We can test using the BusyBox image where it gives us a shell to execute a while loop where we send requests to the nginx web server.
kubectl run busybox --image=busybox --rm -it -- /bin/sh
Then inside the shell
while true; do wget -q -O- http://nginx-deployment; done
If we execute kubectl get hpa we can find a TARGETS section x%/20% Which means the current utilization over the threshold utilization specified in the HPA manifest.
We can monitor and check that once the current passes the threshold new replicas are created according to the equation mentioned above!
Auto Scaling Down
When a Kubernetes HPA is configured, it monitors certain metrics (like CPU or memory utilization) at regular intervals (typically every 30 seconds). If the current resource usage falls below the defined target utilization threshold, the HPA will scale down the number of pods.
Scaling Down Criteria:
Target Utilization vs. Current Resource Usage: where if it finds the current less than the target it will scale down
MinReplicas: where it doesn’t scale down this number
Cooldown Period (Stabilization Window): Kubernetes doesn’t immediately scale down when resource usage decreases slightly. It has a built-in stabilization periodto avoid flapping, which is when scaling occurs rapidly back and forth.
HPA manifest has a behavior section which allows you to specify custom scaling behavior, including how quickly to scale down.
In our HPA manifest we can update it as follows:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nginx-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nginx-deployment
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 20
behavior:
scaleDown:
policies:
- type: Percent
value: 20
periodSeconds: 60
# Scale down by 20% every 60 seconds
The behavior here is scale down by 20% every 60 seconds. The default stabilizationWindowSeconds is 5 minutes but can be configured too.
Summary
In this article we took a look at how HPA works in Kubernetes according to metrics such as CPU and memory, we looked at how we get these metrics using the metrics server and how we utilize them in scaling up and down. I’ll make a part two of this where we auto scale based on custom metrics such as response percentiles and load. This will require extra work but worth it for the content I guess. See you in the next one!




