Deploy a production-grade Stable Diffusion service with Knative

更新时间:
复制 MD 格式

When you deploy Stable Diffusion in an ACK cluster with Knative, Knative allows you to precisely control the number of concurrent requests a single pod can process based on its throughput. This ensures service stability. Knative can also automatically scale pods down to zero when there is no traffic, reducing GPU resource costs.

Prerequisites

  • You have created an ACK cluster that includes GPU nodes. Your cluster must run Kubernetes 1.24 or later. For more information, see Create an ACK managed cluster.

    We recommend the following instance types: ecs.gn5-c4g1.xlarge, ecs.gn5i-c8g1.2xlarge, or ecs.gn5-c8g1.2xlarge.

  • You have deployed Knative in the cluster. For more information, see Deploy and manage Knative components.

How it works

Important

You must comply with the user agreements, usage specifications, and applicable laws and regulations of the third-party Stable Diffusion model. Alibaba Cloud does not guarantee the legality, security, or accuracy of the Stable Diffusion model and assumes no liability for any damages arising from its use.

Stable Diffusion generates target scenes and images quickly and accurately. However, production environments typically have the following requirements:

  • A single pod has limited throughput, and handling multiple concurrent requests can overload the pod. Therefore, you need to precisely control the number of concurrent requests per pod.

  • GPU resources are expensive. You need to allocate GPU resources on demand and promptly release them during off-peak hours.

ACK Knative supports precise scheduling based on the number of concurrent requests and implements autoscaling to build a production-grade Stable Diffusion service. The following figure illustrates this process.

Step 1: Deploy the Stable Diffusion service

Important

You must ensure that the Stable Diffusion service is deployed on GPU nodes. Otherwise, the service will not work.

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Knative.

  3. Deploy the Stable Diffusion service.

    ACK Knative provides popular application templates. You can use an application template to deploy quickly or deploy the service by using YAML.

    Application template

    Click the Popular Apps tab and click Deploy on the Stable Diffusion service card.

    After the deployment is complete, click the Services tab to view the deployment progress in the service list. The service is deployed when the Status shows Success.

    YAML

    Click the Services tab, select default from the Namespace drop-down list, and then click Create from Template. Paste the following YAML sample into the template, and then click Create to create a service named stable-diffusion.

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      name: stable-diffusion
      annotations:
        serving.knative.dev.alibabacloud/affinity: "cookie"
        serving.knative.dev.alibabacloud/cookie-name: "sd"
        serving.knative.dev.alibabacloud/cookie-timeout: "1800"
    spec:
      template:
        metadata:
          annotations:
            autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
            autoscaling.knative.dev/maxScale: '10'
            autoscaling.knative.dev/targetUtilizationPercentage: "100"
            k8s.aliyun.com/eci-use-specs: ecs.gn5-c4g1.xlarge,ecs.gn5i-c8g1.2xlarge,ecs.gn5-c8g1.2xlarge  
        spec:
          containerConcurrency: 1
          containers:
          - args:
            - --listen
            - --skip-torch-cuda-test
            - --api
            command:
            - python3
            - launch.py
            image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion@sha256:62b3228f4b02d9e89e221abe6f1731498a894b042925ab8d4326a571b3e992bc
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 7860
              name: http1
              protocol: TCP
            name: stable-diffusion
            readinessProbe:
              tcpSocket:
                port: 7860
              initialDelaySeconds: 5
              periodSeconds: 1
              failureThreshold: 3

    On the Services page, the default domain name for the stable-diffusion service is stable-diffusion.default.example.com.

Step 2: Access the service

  1. On the Services tab, obtain the Gateway and Default Domain of the service.

    Note

    If the gateway type is ALB Gateway, you can use the curl command to access the service address. The format is as follows:

    curl -H "Host: stable-diffusion.default.example.com" http://alb-XXX.cn-hangzhou.alb.aliyuncsslb.com # Replace the placeholder with your ALB Gateway address.

    To access the service directly by using its domain name, you can configure a CNAME record for the ALB instance.

  2. For example:

    47.xx.xxx.xx stable-diffusion.default.example.com # Replace the placeholder with the gateway IP address.
  3. After you configure the host mapping, on the Services tab, click the default domain name of the stable-diffusion service.

    You can now access Stable Diffusion directly using its domain name.

    If the access is successful, your browser displays the txt2img page of the Stable Diffusion web UI, and the address bar shows the Knative service domain name. In the positive prompt input box, enter text, such as cat, and then click Generate. If the corresponding AI image is generated, the Stable Diffusion inference service deployed by Knative is running correctly.

Step 3: Verify request-based autoscaling

  1. Use the Hey load testing tool to run a stress test.

    Note

    For more information about the Hey load testing tool, see Hey.

    hey -n 50 -c 5 -t 180 -m POST -H "Content-Type: application/json" -d '{"prompt": "pretty dog"}' http://stable-diffusion.default.example.com/sdapi/v1/txt2img

    This command sends 50 requests with a concurrency level of 5 and a timeout of 180 seconds.

    Expected output:

    Summary:
      Total:    252.1749 secs
      Slowest:    62.4155 secs
      Fastest:    9.9399 secs
      Average:    23.9748 secs
      Requests/sec:    0.1983
    Response time histogram:
      9.940 [1]    |■■
      15.187 [17]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
      20.435 [9]    |■■■■■■■■■■■■■■■■■■■■■
      25.683 [11]    |■■■■■■■■■■■■■■■■■■■■■■■■■■
      30.930 [1]    |■■
      36.178 [1]    |■■
      41.425 [3]    |■■■■■■■
      46.673 [1]    |■■
      51.920 [2]    |■■■■■
      57.168 [1]    |■■
      62.415 [3]    |■■■■■■■
    Latency distribution:
      10% in 10.4695 secs
      25% in 14.8245 secs
      50% in 20.0772 secs
      75% in 30.5207 secs
      90% in 50.7006 secs
      95% in 61.5010 secs
      0% in 0.0000 secs
    Details (average, fastest, slowest):
      DNS+dialup:    0.0424 secs, 9.9399 secs, 62.4155 secs
      DNS-lookup:    0.0385 secs, 0.0000 secs, 0.3855 secs
      req write:    0.0000 secs, 0.0000 secs, 0.0004 secs
      resp wait:    23.8850 secs, 9.9089 secs, 62.3562 secs
      resp read:    0.0471 secs, 0.0166 secs, 0.1834 secs
    Status code distribution:
      [200]    50 responses

    The output shows that 50 requests were sent, and the success rate is 100%.

  2. Run the following command to observe pod autoscaling in real time:

    watch -n 1 'kubectl get po'

    During the stress test, the pod count automatically scales to five, and the status of all pods is Running.

    Because the maximum concurrency for a single pod is set to 1 (containerConcurrency: 1), the Stable Diffusion service automatically scales out to 5 pods during the stress test.

Step 4: View service monitoring data

Knative provides out-of-the-box observability. On the Knative page, click the Monitoring Dashboards tab to view monitoring data for the Stable Diffusion service. For more information on enabling and using the Knative Monitoring Dashboard, see View the Knative Monitoring Dashboard.

Related documentation

For configuration suggestions on deploying AI model inference services with Knative, see Best practices for deploying AI model inference services in Knative.