Installing KEDA (Kubernetes Event-driven Autoscaler)

As speech workloads fluctuate throughout the day, static resource allocation inevitably leads to either wasted capacity or degraded performance. Capacity Private Cloud integrates seamlessly with KEDA (Kubernetes Event-driven Autoscaler) to deliver intelligent, metric-driven scaling that responds to real demand rather than arbitrary thresholds. By coupling KEDA with the platform's native Prometheus telemetry, operations teams gain precise, event-driven control over every microservice in the speech pipeline.

KEDA extends the Kubernetes Horizontal Pod Autoscaler (HPA) with event-driven capabilities. It connects to external metrics sources such as RabbitMQ and Prometheus, exposing their data as custom metrics that Kubernetes can act on. Unlike the default HPA, KEDA can scale workloads all the way down to zero replicas when no events are being produced, significantly reducing compute costs during periods of inactivity.

KEDA operates in two main ways:

Metric Feeding — KEDA connects to event sources (RabbitMQ, Prometheus) and exposes their data as custom metrics in Kubernetes, enabling scaled workloads.
Event Source Autoscaling — It can scale down workloads to 0 when no events exist, saving compute costs when there is no demand.

KEDA augments the standard Horizontal Pod Autoscaler (HPA) with event-driven triggers and can automatically scale workloads to zero when no events are being produced.

Note: KEDA requires a minimum Kubernetes version of 1.27.

Installation

KEDA can be deployed into any Kubernetes cluster using its Helm chart, static manifests, or an operator. The following Helm commands add the KEDA repository and install it into a dedicated namespace:

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

YAML Scaling File Examples

The following configurations demonstrate how to scale deployments based on Prometheus metrics using KEDA ScaledObjects. Each definition specifies a Prometheus query and a threshold that determines when scaling should occur.

Each configuration also defines the minimum and maximum number of replicas that can be scaled.

In KEDA, the cooldownPeriod is the duration (in seconds) that KEDA waits after the last scale-down event before considering scaling down again. This ensures the workload remains stable even when there are minor fluctuations in the metrics. When used with a Prometheus ScaledObject, the cooldownPeriod applies to how long KEDA waits before scaling down after metrics fall below the specified threshold.

You can define scaling for various services, some of which are shown below.

keda-asr-prom-scale.yaml

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: asr-en
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: asr-en
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.prometheus.svc.cluster.local
      metricName: asr_active_asr_requests # Grammar based ASR interactions
      threshold: "55"
      query: sum(asr_active_asr_requests{app="asr-en"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.prometheus.svc.cluster.local
      metricName: asr_active_transcription_requests # Transcription based ASR interactions
      threshold: "30"
      query: sum(asr_active_transcription_requests{app="asr-en"})

keda-lumenvox-api-prom-scale.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: lumenvox-api
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: lumenvox-api
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local
      metricName: lumenvox_api_active_requests
      threshold: "100"
      query: sum(lumenvox_api_active_requests{app="lumenvox-api"})

keda-session-prom-scale.yaml

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: session
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: session
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local
      metricName: session_active_streams
      threshold: "100"
      query: sum(session_active_streams{app="session"})

keda-grammar-prom-scale.yaml

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: grammar
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: grammar
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local
      metricName: grammar_active_grammars
      threshold: "1000"
      query: sum(grammar_active_grammars{app="grammar"})

keda-vad-prom-scale.yaml

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vad
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: vad
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local
      metricName: vad_active_requests
      threshold: "100"
      query: sum(vad_active_requests{app="vad"})

keda-tts-prom-scale.yaml

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: neural-tts
spec:
  scaleTargetRef:
    kind: Deployment # Default
    name: neural-tts-en-us
  pollingInterval: 10 # Default 30
  cooldownPeriod: 300 # Default 300
  minReplicaCount: 1 # Default 0
  maxReplicaCount: 10 # Default 100
  triggers:
  - type: prometheus
   metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local  
      metricName: tts_active_requests  
      threshold: "100" 
      query: sum(tts_active_requests{app="neural-tts-en-us"})

Applying a Manifest File

Use the following command to apply a KEDA ScaledObject manifest and create the autoscaling resources for a given service:

kubectl apply -f keda-<service>-prom-scale.yaml

List ScaledObjects

To list all KEDA ScaledObjects currently running in the cluster, use the following command:

kubectl get scaledobject -n lumenvox

This produces output similar to:

Name	ScaleTargetKind	ScaleTargetName	Min	Max	Triggers	Ready	Active	Age
asr-en	apps/v1.Deployment	asr-en	1	10	prometheus	True	False	44h
grammar	apps/v1.Deployment	grammar	1	10	prometheus	True	True	44h
lumenvox-api	apps/v1.Deployment	lumenvox-api	1	10	prometheus	True	False	44h
session	apps/v1.Deployment	session	1	10	prometheus	True	False	44h

Node Scaling

Scaling nodes in a Kubernetes cluster ensures sufficient resources (CPU, memory, storage) are available to handle growing application workloads and maintain high availability. Equally important is scaling nodes back down to manage costs effectively. The recommended method depends on your cluster setup, use case, and cost objectives.

Note: The following are examples only. We strongly recommend consulting the Kubernetes documentation to determine the best scaling method for your own use cases and budget. Do not adopt any of these approaches without carefully considering the costs and behavior. Capacity is not prescriptive about which node-scaling strategy to use; the right choice depends on your infrastructure and operational requirements.

1. Cluster Autoscaler (Best Practice for Node Scaling)

The Cluster Autoscaler is the most widely used and recommended method for scaling nodes in Kubernetes. It is an open-source project developed by the Kubernetes community.

How it works:

Automatically adjusts the size of your Kubernetes node pool based on the pending workload.
Adds nodes when pods cannot be scheduled due to insufficient resources on existing nodes (CPU, memory).
Removes nodes when they are underutilized and the pods running on them can be rescheduled on other nodes.

2. Manual Node Scaling

Manual scaling involves explicitly adding or removing nodes in your cluster. For example:

Adding nodes by increasing the number of virtual machines or instances in your cloud provider.
Removing nodes when they are no longer needed using the cloud dashboard, CLI, or API.

3. Node Autoscalers Provided by Cloud Providers

Many Kubernetes cloud platforms (e.g., AWS EKS, GCP GKE, Azure AKS) come with managed node autoscaling tools that abstract away the complexities of configuring a Cluster Autoscaler.

4. Using Karpenter (Alternative to Cluster Autoscaler)

Karpenter is an open-source project developed by AWS as an alternative to the Cluster Autoscaler. It is designed to scale nodes quickly and dynamically without relying on pre-defined capacity in node groups.

5. Spot Instances for Cost-Effective Scaling

For cost optimization, you can combine autoscalers with spot or low-priority instances offered by many cloud providers (e.g., AWS Spot EC2, GCP Preemptible VMs, Azure Low Priority Nodes).

6. On-Premises Clusters: DIY Solutions

For on-premises or self-managed Kubernetes clusters (e.g., via kubeadm), the process can involve:

Manually adding or removing physical/virtual machines to support workloads.
Using Cluster Autoscaler on custom infrastructure (e.g., using API integrations with your VM provider).

Best Practices for Scaling Nodes

Use Cluster Autoscaler — If running in a cloud environment, set a reasonable minNodes and maxNodes for capacity limits.
Define Proper Resource Requests and Limits — Ensure every pod in your cluster has well-defined CPU and memory requests, as autoscalers rely on these values to decide scaling.
Use Pod Disruption Budgets (PDBs) — Ensure pods of critical workloads are not disrupted during scaling events.
Workload-Specific Node Pools — Create separate node pools for workloads with unique requirements (e.g., GPU, memory-intensive workloads).
Monitor Node Usage — Use monitoring tools like Prometheus, Grafana, or the cloud provider's metrics dashboards to track node utilization and autoscaler effectiveness.

Related Articles

Was this article helpful?