For applications with slow startup times, such as Java applications, the default scale-to-zero policy in community Knative can cause high cold start latency. ACK Knative offers a reserved instance feature. This feature keeps low-cost instances running to ensure fast response times. This reduces the impact of cold starts and helps control resource costs.
How it works
To save costs, community Knative scales instances down to zero when there is no traffic. When a new request arrives, the system must schedule resources, pull the image, and start the application. This "cold start" process can cause high latency for the first request.
To address this issue, ACK Knative provides a reserved instance feature. This feature keeps one or more low-specification instances running even when there is no traffic. The workflow is as follows.
When there is no traffic: The Service (pod) scales in gradually. At least one instance remains online to provide immediate responses.
First request and scale-out trigger:
When the first request arrives, two operations are triggered in parallel:
Immediate service: The request is immediately routed to the online reserved instance for processing. This avoids cold start latency.
Scale-out instruction: Knative immediately creates standard-specification instances.
Traffic switch: After the first standard-specification instance is ready, subsequent requests are automatically routed to it.
Resource removal: The original reserved instance goes offline automatically after it finishes processing the initial requests it received.
Usage
After you deploy Knative in the cluster, you can configure the reserved instance feature by adding specific annotations to the Knative Service.
knative.aliyun.com/reserve-instance: Set toenableto enable the reserved instance feature.knative.aliyun.com/reserve-instance-type: eci: Specifies the resource type for the reserved instance. Supported values areeci(default),ecs, andacs.
Configure ECI-based reserved instances
To use ECI computing power, you must install ACK Virtual Node. For more information, see Components.
Specify ECI instance types
To use specific instance types, you can specify them using the knative.aliyun.com/reserve-instance-eci-use-specs annotation.
The following example specifies the ecs.t6-c1m1.large and ecs.t5-lc1m2.small instance types.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-spec-1
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8Specify CPU and memory specifications
If you are unsure about which specific instance types to use, you can directly define the required CPU and memory resources.
The following example specifies a 1-core 2 GiB instance.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-spec-2
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-eci-use-specs: "1-2Gi"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8Configure ACS-based reserved instances
You can enable ACS-based reserved instances by setting knative.aliyun.com/reserve-instance-type: acs.
To use ACS computing power, you must install ACK Virtual Node. For more information, see Components.
Specify the computing power type and quality
The following example shows a basic configuration for an ACS-based reserved instance. You can specify the compute type (compute-class) and compute quality (compute-qos).
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: acs
# (Optional) Configure the compute type for the ACS pod
knative.aliyun.com/reserve-instance-acs-compute-class: "general-purpose"
# (Optional) Configure the compute quality for the ACS pod
knative.aliyun.com/reserve-instance-acs-compute-qos: "default"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"Specify CPU and memory specifications
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go-resource
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: acs
knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"Configure ECS-based reserved instances
You can configure a reserved instance to use an ECS instance type with lower specifications than the standard instance. This helps reduce long-term running costs.
GPU
The following example configures a low-specification GPU-accelerated instance as a reserved instance for a GPU inference service.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
labels:
release: qwen
name: qwen
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: "concurrency"
# Enable and configure ECS-based reserved instances. You can configure one or more instance types.
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: ecs
knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge
labels:
release: qwen
spec:
containers:
- command:
- sh
- -c
- python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
--served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
0.95 --quantization gptq --max-model-len=6144
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
imagePullPolicy: IfNotPresent
name: vllm-container
resources:
# Resource configuration for standard instances
limits:
cpu: "16"
memory: 60Gi
nvidia.com/gpu: "1"
requests:
cpu: "8"
memory: 36Gi
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /mnt/models/Qwen-7B-Chat-Int8
name: qwen-7b-chat-int8
volumes:
- name: qwen-7b-chat-int8
persistentVolumeClaim:
claimName: qwen-7b-chat-int8-datasetCPU
The following example specifies a 1-core 2 GiB instance.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-resource
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-type: ecs
knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
knative.aliyun.com/reserve-instance-cpu-resource-limit: "1"
knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
knative.aliyun.com/reserve-instance-memory-resource-limit: "2Gi"
spec:
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
env:
- name: TARGET
value: "Knative"Configure a reserved instance resource pool
To handle high traffic bursts, you can expand a single reserved instance into a resource pool. You can specify the number of replicas for the reserved instances using the knative.aliyun.com/reserve-instance-replicas annotation.
When traffic arrives, the prefetched reserved instances respond immediately. At the same time, the system scales out standard-specification instances as needed. After enough standard instances are available to handle the traffic, the system smoothly routes traffic to them. The instances in the reserved instance pool are then automatically scaled down to zero.
The following example creates a reserved pool of three low-specification instances.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: hello-reserve-pool
spec:
template:
metadata:
annotations:
knative.aliyun.com/reserve-instance: enable
knative.aliyun.com/reserve-instance-replicas: "3"
knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8Recommendations for production environments
Choose a suitable instance type for reserved instances. Use the minimum configuration that can run the application stably and handle at least one request.
If your service might experience high traffic when scaling up from zero, use a reserved instance resource pool to increase traffic capacity.
Billing
Reserved instances run continuously and incur charges. For more information about the billing rules, see the following topics:
References
You can use cost-effective spot instances in Knative. For more information, see Use spot instances.
You can implement automatic scaling for workloads in Knative. For more information, see Use HPA in Knative, Automatically scale Services based on the number of traffic requests, and Use AHPA to implement scheduled automatic scaling.