Configure reserved instances for Knative Services to reduce cold start latency

更新时间:
复制 MD 格式

For applications with slow startup times, such as Java applications, the default scale-to-zero policy in community Knative can cause high cold start latency. ACK Knative offers a reserved instance feature. This feature keeps low-cost instances running to ensure fast response times. This reduces the impact of cold starts and helps control resource costs.

How it works

To save costs, community Knative scales instances down to zero when there is no traffic. When a new request arrives, the system must schedule resources, pull the image, and start the application. This "cold start" process can cause high latency for the first request.

To address this issue, ACK Knative provides a reserved instance feature. This feature keeps one or more low-specification instances running even when there is no traffic. The workflow is as follows.

  1. When there is no traffic: The Service (pod) scales in gradually. At least one instance remains online to provide immediate responses.

  2. First request and scale-out trigger:

    When the first request arrives, two operations are triggered in parallel:

    • Immediate service: The request is immediately routed to the online reserved instance for processing. This avoids cold start latency.

    • Scale-out instruction: Knative immediately creates standard-specification instances.

  3. Traffic switch: After the first standard-specification instance is ready, subsequent requests are automatically routed to it.

  4. Resource removal: The original reserved instance goes offline automatically after it finishes processing the initial requests it received.

image

Usage

After you deploy Knative in the cluster, you can configure the reserved instance feature by adding specific annotations to the Knative Service.

  • knative.aliyun.com/reserve-instance: Set to enable to enable the reserved instance feature.

  • knative.aliyun.com/reserve-instance-type: eci: Specifies the resource type for the reserved instance. Supported values are eci (default), ecs, and acs.

Configure ECI-based reserved instances

To use ECI computing power, you must install ACK Virtual Node. For more information, see Components.

Specify ECI instance types

To use specific instance types, you can specify them using the knative.aliyun.com/reserve-instance-eci-use-specs annotation.

The following example specifies the ecs.t6-c1m1.large and ecs.t5-lc1m2.small instance types.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-1
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Specify CPU and memory specifications

If you are unsure about which specific instance types to use, you can directly define the required CPU and memory resources.

The following example specifies a 1-core 2 GiB instance.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-2
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "1-2Gi"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Configure ACS-based reserved instances

You can enable ACS-based reserved instances by setting knative.aliyun.com/reserve-instance-type: acs.

To use ACS computing power, you must install ACK Virtual Node. For more information, see Components.

Specify the computing power type and quality

The following example shows a basic configuration for an ACS-based reserved instance. You can specify the compute type (compute-class) and compute quality (compute-qos).

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        # (Optional) Configure the compute type for the ACS pod
        knative.aliyun.com/reserve-instance-acs-compute-class: "general-purpose"
        # (Optional) Configure the compute quality for the ACS pod
        knative.aliyun.com/reserve-instance-acs-compute-qos: "default"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Specify CPU and memory specifications

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Configure ECS-based reserved instances

You can configure a reserved instance to use an ECS instance type with lower specifications than the standard instance. This helps reduce long-term running costs.

GPU

The following example configures a low-specification GPU-accelerated instance as a reserved instance for a GPU inference service.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency" 
        # Enable and configure ECS-based reserved instances. You can configure one or more instance types.
        knative.aliyun.com/reserve-instance: enable 
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge 
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
          --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
          0.95 --quantization gptq --max-model-len=6144
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
        imagePullPolicy: IfNotPresent
        name: vllm-container
        resources:
          # Resource configuration for standard instances
          limits:
            cpu: "16"
            memory: 60Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "8"
            memory: 36Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /mnt/models/Qwen-7B-Chat-Int8
          name: qwen-7b-chat-int8
      volumes:
      - name: qwen-7b-chat-int8
        persistentVolumeClaim:
          claimName: qwen-7b-chat-int8-dataset

CPU

The following example specifies a 1-core 2 GiB instance.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-cpu-resource-limit: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
        knative.aliyun.com/reserve-instance-memory-resource-limit: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

Configure a reserved instance resource pool

To handle high traffic bursts, you can expand a single reserved instance into a resource pool. You can specify the number of replicas for the reserved instances using the knative.aliyun.com/reserve-instance-replicas annotation.

When traffic arrives, the prefetched reserved instances respond immediately. At the same time, the system scales out standard-specification instances as needed. After enough standard instances are available to handle the traffic, the system smoothly routes traffic to them. The instances in the reserved instance pool are then automatically scaled down to zero.

The following example creates a reserved pool of three low-specification instances.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-reserve-pool
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-replicas: "3"
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

Recommendations for production environments

  • Choose a suitable instance type for reserved instances. Use the minimum configuration that can run the application stably and handle at least one request.

  • If your service might experience high traffic when scaling up from zero, use a reserved instance resource pool to increase traffic capacity.

Billing

Reserved instances run continuously and incur charges. For more information about the billing rules, see the following topics:

References