Deploy a production-grade Stable Diffusion service with Knative-Container Service for Kubernetes(ACK)-阿里云帮助中心

When you deploy Stable Diffusion in an ACK cluster with Knative, Knative allows you to precisely control the number of concurrent requests a single pod can process based on its throughput. This ensures service stability. Knative can also automatically scale pods down to zero when there is no traffic, reducing GPU resource costs.

Prerequisites

You have created an ACK cluster that includes GPU nodes. Your cluster must run Kubernetes 1.24 or later. For more information, see Create an ACK managed cluster.

We recommend the following instance types: ecs.gn5-c4g1.xlarge, ecs.gn5i-c8g1.2xlarge, or ecs.gn5-c8g1.2xlarge.
You have deployed Knative in the cluster. For more information, see Deploy and manage Knative components.

How it works

Important

You must comply with the user agreements, usage specifications, and applicable laws and regulations of the third-party Stable Diffusion model. Alibaba Cloud does not guarantee the legality, security, or accuracy of the Stable Diffusion model and assumes no liability for any damages arising from its use.

Stable Diffusion generates target scenes and images quickly and accurately. However, production environments typically have the following requirements:

A single pod has limited throughput, and handling multiple concurrent requests can overload the pod. Therefore, you need to precisely control the number of concurrent requests per pod.
GPU resources are expensive. You need to allocate GPU resources on demand and promptly release them during off-peak hours.

ACK Knative supports precise scheduling based on the number of concurrent requests and implements autoscaling to build a production-grade Stable Diffusion service. The following figure illustrates this process.

Step 1: Deploy the Stable Diffusion service

Important

You must ensure that the Stable Diffusion service is deployed on GPU nodes. Otherwise, the service will not work.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Knative.

Deploy the Stable Diffusion service.

ACK Knative provides popular application templates. You can use an application template to deploy quickly or deploy the service by using YAML.

Application template

Click the Popular Apps tab and click Deploy on the Stable Diffusion service card.

After the deployment is complete, click the Services tab to view the deployment progress in the service list. The service is deployed when the Status shows Success.

YAML

Click the Services tab, select default from the Namespace drop-down list, and then click Create from Template. Paste the following YAML sample into the template, and then click Create to create a service named stable-diffusion.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: stable-diffusion
  annotations:
    serving.knative.dev.alibabacloud/affinity: "cookie"
    serving.knative.dev.alibabacloud/cookie-name: "sd"
    serving.knative.dev.alibabacloud/cookie-timeout: "1800"
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/maxScale: '10'
        autoscaling.knative.dev/targetUtilizationPercentage: "100"
        k8s.aliyun.com/eci-use-specs: ecs.gn5-c4g1.xlarge,ecs.gn5i-c8g1.2xlarge,ecs.gn5-c8g1.2xlarge  
    spec:
      containerConcurrency: 1
      containers:
      - args:
        - --listen
        - --skip-torch-cuda-test
        - --api
        command:
        - python3
        - launch.py
        image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion@sha256:62b3228f4b02d9e89e221abe6f1731498a894b042925ab8d4326a571b3e992bc
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 7860
          name: http1
          protocol: TCP
        name: stable-diffusion
        readinessProbe:
          tcpSocket:
            port: 7860
          initialDelaySeconds: 5
          periodSeconds: 1
          failureThreshold: 3

On the Services page, the default domain name for the stable-diffusion service is stable-diffusion.default.example.com.

Step 2: Access the service

On the Services tab, obtain the Gateway and Default Domain of the service.
Note
If the gateway type is ALB Gateway, you can use the curl command to access the service address. The format is as follows:
```
curl -H "Host: stable-diffusion.default.example.com" http://alb-XXX.cn-hangzhou.alb.aliyuncsslb.com # Replace the placeholder with your ALB Gateway address.
```
To access the service directly by using its domain name, you can configure a CNAME record for the ALB instance.

For example:

47.xx.xxx.xx stable-diffusion.default.example.com # Replace the placeholder with the gateway IP address.

After you configure the host mapping, on the Services tab, click the default domain name of the stable-diffusion service.

You can now access Stable Diffusion directly using its domain name.

If the access is successful, your browser displays the txt2img page of the Stable Diffusion web UI, and the address bar shows the Knative service domain name. In the positive prompt input box, enter text, such as cat, and then click Generate. If the corresponding AI image is generated, the Stable Diffusion inference service deployed by Knative is running correctly.

Step 3: Verify request-based autoscaling

Use the Hey load testing tool to run a stress test.

Note

For more information about the Hey load testing tool, see Hey.

hey -n 50 -c 5 -t 180 -m POST -H "Content-Type: application/json" -d '{"prompt": "pretty dog"}' http://stable-diffusion.default.example.com/sdapi/v1/txt2img

This command sends 50 requests with a concurrency level of 5 and a timeout of 180 seconds.

Expected output:

Summary:
  Total:    252.1749 secs
  Slowest:    62.4155 secs
  Fastest:    9.9399 secs
  Average:    23.9748 secs
  Requests/sec:    0.1983
Response time histogram:
  9.940 [1]    |■■
  15.187 [17]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  20.435 [9]    |■■■■■■■■■■■■■■■■■■■■■
  25.683 [11]    |■■■■■■■■■■■■■■■■■■■■■■■■■■
  30.930 [1]    |■■
  36.178 [1]    |■■
  41.425 [3]    |■■■■■■■
  46.673 [1]    |■■
  51.920 [2]    |■■■■■
  57.168 [1]    |■■
  62.415 [3]    |■■■■■■■
Latency distribution:
  10% in 10.4695 secs
  25% in 14.8245 secs
  50% in 20.0772 secs
  75% in 30.5207 secs
  90% in 50.7006 secs
  95% in 61.5010 secs
  0% in 0.0000 secs
Details (average, fastest, slowest):
  DNS+dialup:    0.0424 secs, 9.9399 secs, 62.4155 secs
  DNS-lookup:    0.0385 secs, 0.0000 secs, 0.3855 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0004 secs
  resp wait:    23.8850 secs, 9.9089 secs, 62.3562 secs
  resp read:    0.0471 secs, 0.0166 secs, 0.1834 secs
Status code distribution:
  [200]    50 responses

The output shows that 50 requests were sent, and the success rate is 100%.

Run the following command to observe pod autoscaling in real time:
```
watch -n 1 'kubectl get po'
```
During the stress test, the pod count automatically scales to five, and the status of all pods is Running.

Because the maximum concurrency for a single pod is set to 1 (containerConcurrency: 1), the Stable Diffusion service automatically scales out to 5 pods during the stress test.

Step 4: View service monitoring data

Knative provides out-of-the-box observability. On the Knative page, click the Monitoring Dashboards tab to view monitoring data for the Stable Diffusion service. For more information on enabling and using the Knative Monitoring Dashboard, see View the Knative Monitoring Dashboard.