When you deploy Stable Diffusion in an ACK cluster with Knative, Knative allows you to precisely control the number of concurrent requests a single pod can process based on its throughput. This ensures service stability. Knative can also automatically scale pods down to zero when there is no traffic, reducing GPU resource costs.
Prerequisites
-
You have created an ACK cluster that includes GPU nodes. Your cluster must run Kubernetes 1.24 or later. For more information, see Create an ACK managed cluster.
We recommend the following instance types: ecs.gn5-c4g1.xlarge, ecs.gn5i-c8g1.2xlarge, or ecs.gn5-c8g1.2xlarge.
-
You have deployed Knative in the cluster. For more information, see Deploy and manage Knative components.
How it works
You must comply with the user agreements, usage specifications, and applicable laws and regulations of the third-party Stable Diffusion model. Alibaba Cloud does not guarantee the legality, security, or accuracy of the Stable Diffusion model and assumes no liability for any damages arising from its use.
Stable Diffusion generates target scenes and images quickly and accurately. However, production environments typically have the following requirements:
-
A single pod has limited throughput, and handling multiple concurrent requests can overload the pod. Therefore, you need to precisely control the number of concurrent requests per pod.
-
GPU resources are expensive. You need to allocate GPU resources on demand and promptly release them during off-peak hours.
ACK Knative supports precise scheduling based on the number of concurrent requests and implements autoscaling to build a production-grade Stable Diffusion service. The following figure illustrates this process.
Step 1: Deploy the Stable Diffusion service
You must ensure that the Stable Diffusion service is deployed on GPU nodes. Otherwise, the service will not work.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
Deploy the Stable Diffusion service.
ACK Knative provides popular application templates. You can use an application template to deploy quickly or deploy the service by using YAML.
Application template
Click the Popular Apps tab and click Deploy on the Stable Diffusion service card.
After the deployment is complete, click the Services tab to view the deployment progress in the service list. The service is deployed when the Status shows Success.
YAML
Click the Services tab, select default from the Namespace drop-down list, and then click Create from Template. Paste the following YAML sample into the template, and then click Create to create a service named stable-diffusion.
apiVersion: serving.knative.dev/v1 kind: Service metadata: name: stable-diffusion annotations: serving.knative.dev.alibabacloud/affinity: "cookie" serving.knative.dev.alibabacloud/cookie-name: "sd" serving.knative.dev.alibabacloud/cookie-timeout: "1800" spec: template: metadata: annotations: autoscaling.knative.dev/class: kpa.autoscaling.knative.dev autoscaling.knative.dev/maxScale: '10' autoscaling.knative.dev/targetUtilizationPercentage: "100" k8s.aliyun.com/eci-use-specs: ecs.gn5-c4g1.xlarge,ecs.gn5i-c8g1.2xlarge,ecs.gn5-c8g1.2xlarge spec: containerConcurrency: 1 containers: - args: - --listen - --skip-torch-cuda-test - --api command: - python3 - launch.py image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion@sha256:62b3228f4b02d9e89e221abe6f1731498a894b042925ab8d4326a571b3e992bc imagePullPolicy: IfNotPresent ports: - containerPort: 7860 name: http1 protocol: TCP name: stable-diffusion readinessProbe: tcpSocket: port: 7860 initialDelaySeconds: 5 periodSeconds: 1 failureThreshold: 3On the Services page, the default domain name for the stable-diffusion service is
stable-diffusion.default.example.com.
Step 2: Access the service
-
On the Services tab, obtain the Gateway and Default Domain of the service.
NoteIf the gateway type is ALB Gateway, you can use the
curlcommand to access the service address. The format is as follows:curl -H "Host: stable-diffusion.default.example.com" http://alb-XXX.cn-hangzhou.alb.aliyuncsslb.com # Replace the placeholder with your ALB Gateway address.To access the service directly by using its domain name, you can configure a CNAME record for the ALB instance.
-
For example:
47.xx.xxx.xx stable-diffusion.default.example.com # Replace the placeholder with the gateway IP address. -
After you configure the host mapping, on the Services tab, click the default domain name of the stable-diffusion service.
You can now access Stable Diffusion directly using its domain name.
If the access is successful, your browser displays the txt2img page of the Stable Diffusion web UI, and the address bar shows the Knative service domain name. In the positive prompt input box, enter text, such as
cat, and then click Generate. If the corresponding AI image is generated, the Stable Diffusion inference service deployed by Knative is running correctly.
Step 3: Verify request-based autoscaling
-
Use the Hey load testing tool to run a stress test.
NoteFor more information about the Hey load testing tool, see Hey.
hey -n 50 -c 5 -t 180 -m POST -H "Content-Type: application/json" -d '{"prompt": "pretty dog"}' http://stable-diffusion.default.example.com/sdapi/v1/txt2imgThis command sends 50 requests with a concurrency level of 5 and a timeout of 180 seconds.
Expected output:
Summary: Total: 252.1749 secs Slowest: 62.4155 secs Fastest: 9.9399 secs Average: 23.9748 secs Requests/sec: 0.1983 Response time histogram: 9.940 [1] |■■ 15.187 [17] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 20.435 [9] |■■■■■■■■■■■■■■■■■■■■■ 25.683 [11] |■■■■■■■■■■■■■■■■■■■■■■■■■■ 30.930 [1] |■■ 36.178 [1] |■■ 41.425 [3] |■■■■■■■ 46.673 [1] |■■ 51.920 [2] |■■■■■ 57.168 [1] |■■ 62.415 [3] |■■■■■■■ Latency distribution: 10% in 10.4695 secs 25% in 14.8245 secs 50% in 20.0772 secs 75% in 30.5207 secs 90% in 50.7006 secs 95% in 61.5010 secs 0% in 0.0000 secs Details (average, fastest, slowest): DNS+dialup: 0.0424 secs, 9.9399 secs, 62.4155 secs DNS-lookup: 0.0385 secs, 0.0000 secs, 0.3855 secs req write: 0.0000 secs, 0.0000 secs, 0.0004 secs resp wait: 23.8850 secs, 9.9089 secs, 62.3562 secs resp read: 0.0471 secs, 0.0166 secs, 0.1834 secs Status code distribution: [200] 50 responsesThe output shows that 50 requests were sent, and the success rate is 100%.
-
Run the following command to observe pod autoscaling in real time:
watch -n 1 'kubectl get po'During the stress test, the pod count automatically scales to five, and the status of all pods is Running.
Because the maximum concurrency for a single pod is set to 1 (
containerConcurrency: 1), the Stable Diffusion service automatically scales out to 5 pods during the stress test.
Step 4: View service monitoring data
Knative provides out-of-the-box observability. On the Knative page, click the Monitoring Dashboards tab to view monitoring data for the Stable Diffusion service. For more information on enabling and using the Knative Monitoring Dashboard, see View the Knative Monitoring Dashboard.
Related documentation
For configuration suggestions on deploying AI model inference services with Knative, see Best practices for deploying AI model inference services in Knative.