The wan2.2-s2v digital human model generates natural-looking speaking, singing, or performing videos from a single image and an audio file. It supports any aspect ratio and works with portrait, full-body, or half-body images.
This document applies only to the China (Beijing) region. Use an API key from this region.
Model overview
Example output
|
Input example |
Output video |
|
Input audio |
Model and pricing
|
Model name |
Description |
Price per unit |
Rate limit (shared by Alibaba Cloud account and RAM users) |
Free quota(View) |
|
|
RPS limit for task submission API |
Maximum concurrent tasks |
||||
|
wan2.2-s2v-detect |
Checks whether the input image meets requirements, such as definition, single-person composition, and front-facing pose. |
CNY 0.004 per image |
5 |
No limit for synchronous APIs |
200 images |
|
wan2.2-s2v |
Generates a dynamic video of a person using a validated image and an audio file. |
480P: CNY 0.5 per second 720P: CNY 0.9 per second |
5 |
1 |
100 seconds |
To generate a digital human video, follow these steps:
-
Step 1: Call the wan2.2-s2v-detect API with the image URL to verify compliance.
-
Step 2: If the image passes validation, call the asynchronous wan2.2-s2v API with the image URL and audio URL to submit the video generation task. Then poll for the result.
Getting Started
Prerequisites
Before calling the API, enable the model service and obtain an API key. Then set the API key as an environment variable.
Sample code
The sample image in this topic has already passed image detection. The following shows sample code for video generation.
HTTP requests occur in two steps: first create a task, then retrieve the result. Beginners should use Postman to call the API.
Step 1: Create a task and get the task ID
This request returns a task_id that you can use to query the result.
curl 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
--header 'X-DashScope-Async: enable' \
--header "Authorization: Bearer $DASHSCOPE_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"model": "wan2.2-s2v",
"input": {
"image_url": "https://img.alicdn.com/imgextra/i3/O1CN011FObkp1T7Ttowoq4F_!!6000000002335-0-tps-1440-1797.jpg",
"audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250825/iaqpio/input_audio.MP3"
},
"parameters": {
"style": "speech"
}
}'
Step 2: Query the result using the task ID
Replace 86ecf553-d340-4e21-xxxxxxxxx with your actual task_id.
If you use a model in the Singapore region, replacebase_urlwithhttps://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx, where WorkspaceId is your actual workspace ID.
curl -X GET https://dashscope.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"
A task_id is valid for 24 hours. After it expires, queries fail and the API returns a status of UNKNOWN.
Model comparison
Model selection guidance: Use wan2.2-s2v to generate full-body or upper-body videos. For cost-effective portrait videos, choose EMO.
|
Feature comparison |
digital human wan2.2-s2v |
EMO (View) |
|
Description |
Larger, more natural motion. Supports wide aspect ratios, especially full-body shots. Works with cartoon characters. |
Better for close-ups or portraits. Natural lip-sync and facial expressions. |
|
Supported aspect ratios |
Full-body, half-body, portrait |
Portrait, half-body (recommended) |
|
Calling method |
Two-step process. The detection API checks compliance only. Integration is simple. |
Two-step process. Coordinates returned by the detection API are required input for the generation API. |
|
Style control |
Scenario-driven (speaking, singing, performing) |
Style-driven (moderate, calm, lively) |
|
Output specifications |
By resolution (480P, 720P) |
By aspect ratio (1:1, 3:4) |
|
Pricing |
|
|
Next steps
Review the API documentation to start development based on your needs:
