Installing KEDA (Kubernetes Event-driven Autoscaler)
As speech workloads fluctuate throughout the day, static resource allocation inevitably leads to either wasted capacity or degraded performance. Capacity Private Cloud integrates seamlessly with KEDA (Kubernetes Event-driven Autoscaler) to deliver intelligent, metric-driven scaling that responds to real demand rather than arbitrary thresholds. By coupling KEDA with the platform's native Prometheus telemetry, operations teams gain precise, event-driven control over every microservice in the speech pipeline.
KEDA extends the Kubernetes Horizontal Pod Autoscaler (HPA) with event-driven capabilities. It connects to external metrics sources such as RabbitMQ and Prometheus, exposing their data as custom metrics that Kubernetes can act on. Unlike the default HPA, KEDA can scale workloads all the way down to zero replicas when no events are being produced, significantly reducing compute costs during periods of inactivity.
KEDA operates in two main ways:
- Metric Feeding — KEDA connects to event sources (RabbitMQ, Prometheus) and exposes their data as custom metrics in Kubernetes, enabling scaled workloads.
- Event Source Autoscaling — It can scale down workloads to 0 when no events exist, saving compute costs when there is no demand.
KEDA augments the standard Horizontal Pod Autoscaler (HPA) with event-driven triggers and can automatically scale workloads to zero when no events are being produced.
Note: KEDA requires a minimum Kubernetes version of 1.27.
Installation
KEDA can be deployed into any Kubernetes cluster using its Helm chart, static manifests, or an operator. The following Helm commands add the KEDA repository and install it into a dedicated namespace:
helm repo add kedacore https://kedacore.github.io/charts helm repo update helm install keda kedacore/keda --namespace keda --create-namespace
YAML Scaling File Examples
The following configurations demonstrate how to scale deployments based on Prometheus metrics using KEDA ScaledObjects. Each definition specifies a Prometheus query and a threshold that determines when scaling should occur.
Each configuration also defines the minimum and maximum number of replicas that can be scaled.
In KEDA, the cooldownPeriod is the duration (in seconds) that KEDA waits after the last scale-down event before considering scaling down again. This ensures the workload remains stable even when there are minor fluctuations in the metrics. When used with a Prometheus ScaledObject, the cooldownPeriod applies to how long KEDA waits before scaling down after metrics fall below the specified threshold.
You can define scaling for various services, some of which are shown below.
keda-asr-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: asr-en
spec:
scaleTargetRef:
kind: Deployment # Default
name: asr-en
pollingInterval: 10 # Default 30
cooldownPeriod: 300 # Default 300
minReplicaCount: 1 # Default 0
maxReplicaCount: 10 # Default 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.prometheus.svc.cluster.local
metricName: asr_active_asr_requests # Grammar based ASR interactions
threshold: "55"
query: sum(asr_active_asr_requests{app="asr-en"})
- type: prometheus
metadata:
serverAddress: http://prometheus-server.prometheus.svc.cluster.local
metricName: asr_active_transcription_requests # Transcription based ASR interactions
threshold: "30"
query: sum(asr_active_transcription_requests{app="asr-en"})keda-lumenvox-api-prom-scale.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: lumenvox-api
spec:
scaleTargetRef:
kind: Deployment # Default
name: lumenvox-api
pollingInterval: 10 # Default 30
cooldownPeriod: 300 # Default 300
minReplicaCount: 1 # Default 0
maxReplicaCount: 10 # Default 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local
metricName: lumenvox_api_active_requests
threshold: "100"
query: sum(lumenvox_api_active_requests{app="lumenvox-api"})keda-session-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: session
spec:
scaleTargetRef:
kind: Deployment # Default
name: session
pollingInterval: 10 # Default 30
cooldownPeriod: 300 # Default 300
minReplicaCount: 1 # Default 0
maxReplicaCount: 10 # Default 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local
metricName: session_active_streams
threshold: "100"
query: sum(session_active_streams{app="session"})keda-grammar-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: grammar
spec:
scaleTargetRef:
kind: Deployment # Default
name: grammar
pollingInterval: 10 # Default 30
cooldownPeriod: 300 # Default 300
minReplicaCount: 1 # Default 0
maxReplicaCount: 10 # Default 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local
metricName: grammar_active_grammars
threshold: "1000"
query: sum(grammar_active_grammars{app="grammar"})keda-vad-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vad
spec:
scaleTargetRef:
kind: Deployment # Default
name: vad
pollingInterval: 10 # Default 30
cooldownPeriod: 300 # Default 300
minReplicaCount: 1 # Default 0
maxReplicaCount: 10 # Default 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local
metricName: vad_active_requests
threshold: "100"
query: sum(vad_active_requests{app="vad"})keda-tts-prom-scale.yaml
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: neural-tts
spec:
scaleTargetRef:
kind: Deployment # Default
name: neural-tts-en-us
pollingInterval: 10 # Default 30
cooldownPeriod: 300 # Default 300
minReplicaCount: 1 # Default 0
maxReplicaCount: 10 # Default 100
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local
metricName: tts_active_requests
threshold: "100"
query: sum(tts_active_requests{app="neural-tts-en-us"}) Applying a Manifest File
Use the following command to apply a KEDA ScaledObject manifest and create the autoscaling resources for a given service:
kubectl apply -f keda-<service>-prom-scale.yaml
List ScaledObjects
To list all KEDA ScaledObjects currently running in the cluster, use the following command:
kubectl get scaledobject -n lumenvox
This produces output similar to:
| Name | ScaleTargetKind | ScaleTargetName | Min | Max | Triggers | Ready | Active | Age |
|---|---|---|---|---|---|---|---|---|
| asr-en | apps/v1.Deployment | asr-en | 1 | 10 | prometheus | True | False | 44h |
| grammar | apps/v1.Deployment | grammar | 1 | 10 | prometheus | True | True | 44h |
| lumenvox-api | apps/v1.Deployment | lumenvox-api | 1 | 10 | prometheus | True | False | 44h |
| session | apps/v1.Deployment | session | 1 | 10 | prometheus | True | False | 44h |
Node Scaling
Scaling nodes in a Kubernetes cluster ensures sufficient resources (CPU, memory, storage) are available to handle growing application workloads and maintain high availability. Equally important is scaling nodes back down to manage costs effectively. The recommended method depends on your cluster setup, use case, and cost objectives.
Note: The following are examples only. We strongly recommend consulting the Kubernetes documentation to determine the best scaling method for your own use cases and budget. Do not adopt any of these approaches without carefully considering the costs and behavior. Capacity is not prescriptive about which node-scaling strategy to use; the right choice depends on your infrastructure and operational requirements.
1. Cluster Autoscaler (Best Practice for Node Scaling)
The Cluster Autoscaler is the most widely used and recommended method for scaling nodes in Kubernetes. It is an open-source project developed by the Kubernetes community.
How it works:
- Automatically adjusts the size of your Kubernetes node pool based on the pending workload.
- Adds nodes when pods cannot be scheduled due to insufficient resources on existing nodes (CPU, memory).
- Removes nodes when they are underutilized and the pods running on them can be rescheduled on other nodes.
2. Manual Node Scaling
Manual scaling involves explicitly adding or removing nodes in your cluster. For example:
- Adding nodes by increasing the number of virtual machines or instances in your cloud provider.
- Removing nodes when they are no longer needed using the cloud dashboard, CLI, or API.
3. Node Autoscalers Provided by Cloud Providers
Many Kubernetes cloud platforms (e.g., AWS EKS, GCP GKE, Azure AKS) come with managed node autoscaling tools that abstract away the complexities of configuring a Cluster Autoscaler.
4. Using Karpenter (Alternative to Cluster Autoscaler)
Karpenter is an open-source project developed by AWS as an alternative to the Cluster Autoscaler. It is designed to scale nodes quickly and dynamically without relying on pre-defined capacity in node groups.
5. Spot Instances for Cost-Effective Scaling
For cost optimization, you can combine autoscalers with spot or low-priority instances offered by many cloud providers (e.g., AWS Spot EC2, GCP Preemptible VMs, Azure Low Priority Nodes).
6. On-Premises Clusters: DIY Solutions
For on-premises or self-managed Kubernetes clusters (e.g., via kubeadm), the process can involve:
- Manually adding or removing physical/virtual machines to support workloads.
- Using Cluster Autoscaler on custom infrastructure (e.g., using API integrations with your VM provider).
Best Practices for Scaling Nodes
- Use Cluster Autoscaler — If running in a cloud environment, set a reasonable minNodes and maxNodes for capacity limits.
- Define Proper Resource Requests and Limits — Ensure every pod in your cluster has well-defined CPU and memory requests, as autoscalers rely on these values to decide scaling.
- Use Pod Disruption Budgets (PDBs) — Ensure pods of critical workloads are not disrupted during scaling events.
- Workload-Specific Node Pools — Create separate node pools for workloads with unique requirements (e.g., GPU, memory-intensive workloads).
- Monitor Node Usage — Use monitoring tools like Prometheus, Grafana, or the cloud provider's metrics dashboards to track node utilization and autoscaler effectiveness.
Related Articles
