配置Knative服务保留实例以解决冷启动延迟-容器服务 Kubernetes 版 ACK-阿里云

对于启动较慢的业务应用（如 Java 应用等），社区Knative默认的“缩容至0”策略可能会引发较高的冷启动延迟。ACK Knative提供保留实例功能，通过维持低成本的常驻实例实现快速响应，在降低冷启动影响的同时有效控制资源成本。

工作原理

为节约成本，社区Knative会在服务无流量时将实例数缩减至0。新请求到达时，需经历资源调度、镜像拉取和应用启动等一系列过程，即“冷启动”，可能带来较高的首次请求延迟。

为此，ACK Knative提供保留实例，在无流量期间仍然保留一个或多个低规格的保留实例持续运行。其工作流程如下。

无流量时：服务（Pod）逐步缩容，最终至少保留一个实例在线，维持基本响应能力。
首个请求到达与扩容触发：
当第一个请求到达时，会同时触发两个并行操作：
- 即时服务：请求立即路由至在线的保留实例进行处理，避免冷启动延迟。
- 扩容指令：Knative立即创建标准规格的实例。
流量切换：当首个标准规格的实例准备就绪后，后续新请求自动转发至标准规格实例。
资源回收：原保留实例在处理完接收到的初始请求后自动下线。

使用方式

在集群中部署Knative后，可通过为Knative Service添加特定Annotation来配置保留实例功能。

knative.aliyun.com/reserve-instance：设置为enable，开启保留实例。
knative.aliyun.com/reserve-instance-type: eci ：指定保留实例的资源类型，支持eci（默认）、ecs和acs。

配置ECI类型的保留实例

使用ECI算力时，需安装ACK Virtual Node，操作入口请参见组件。

指定ECI实例规格

如需使用特定实例规格，可通过knative.aliyun.com/reserve-instance-eci-use-specs来指定。

以指定ecs.t6-c1m1.large和ecs.t5-lc1m2.small规格为例。

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-1
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

指定CPU和内存规格

若不确定具体的实例规格，可直接定义所需的CPU和内存资源。

以指定1 Core 2 GiB的规格为例。

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-spec-2
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-eci-use-specs: "1-2Gi"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

配置ACS类型的保留实例

可通过knative.aliyun.com/reserve-instance-type: acs启用ACS类型的保留实例。

使用ACS算力时，需安装ACK Virtual Node，操作入口请参见组件。

指定算力类型与质量

以下为基础的ACS保留实例配置，支持指定计算类型（compute-class）和算力质量（compute-qos）。

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        # (可选) 配置ACS Pod的算力类型
        knative.aliyun.com/reserve-instance-acs-compute-class: "general-purpose"
        # (可选) 配置ACS Pod的算力质量
        knative.aliyun.com/reserve-instance-acs-compute-qos: "default"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

指定CPU和内存的规格

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: acs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

配置 ECS 类型的保留实例

可为保留实例配置一个低于标准规格的ECS实例规格，以降低长期运行成本。

GPU

以下示例为一个GPU推理服务配置了低规格的GPU实例作为保留实例。

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency" 
        # 启用并配置ECS类型的保留实例，可配置一个或多个规格
        knative.aliyun.com/reserve-instance: enable 
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge 
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
          --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
          0.95 --quantization gptq --max-model-len=6144
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
        imagePullPolicy: IfNotPresent
        name: vllm-container
        resources:
          # 标准实例的资源配置
          limits:
            cpu: "16"
            memory: 60Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "8"
            memory: 36Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /mnt/models/Qwen-7B-Chat-Int8
          name: qwen-7b-chat-int8
      volumes:
      - name: qwen-7b-chat-int8
        persistentVolumeClaim:
          claimName: qwen-7b-chat-int8-dataset

CPU

以指定1 Core 2 GiB的规格为例。

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-resource
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-type: ecs
        knative.aliyun.com/reserve-instance-cpu-resource-request: "1"
        knative.aliyun.com/reserve-instance-cpu-resource-limit: "1"
        knative.aliyun.com/reserve-instance-memory-resource-request: "2Gi"
        knative.aliyun.com/reserve-instance-memory-resource-limit: "2Gi"
    spec:
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:73fbdd56
        env:
        - name: TARGET
          value: "Knative"

配置保留实例资源池

为了应对较高的突发流量请求，可将单个保留实例扩展为一个资源池，通过knative.aliyun.com/reserve-instance-replicas指定保留实例的副本数量。

以下示例创建了一个由3个低规格实例组成的保留池。

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello-reserve-pool
spec:
  template:
    metadata:
      annotations:
        knative.aliyun.com/reserve-instance: enable
        knative.aliyun.com/reserve-instance-replicas: "3"
        knative.aliyun.com/reserve-instance-eci-use-specs: "ecs.t6-c1m1.large,ecs.t5-lc1m2.small"
    spec:
      containers:
        - image: registry.cn-hangzhou.aliyuncs.com/knative-sample/helloworld-go:160e4dc8

生产环境使用建议

合理选择保留实例的规格，推荐采用能够稳定运行应用并处理至少一个请求的最低可行配置。
若业务场景请求从0到1时可能出现较高的请求流量，建议使用保留实例资源池，以提升流量承接能力。

计费说明

保留实例会持续运行并产生相应费用。计费规则，请参见：

为Knative服务配置保留实例以降低冷启动延迟