How to generate lip-sync videos using Wanxiang Digital Human-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

The wan2.2-s2v digital human model generates natural-looking speaking, singing, or performing videos from a single image and an audio file. It supports any aspect ratio and works with portrait, full-body, or half-body images.

Important

This document applies only to the China (Beijing) region. Use an API key from this region.

Model overview

Example output

Input example

Output video

input_image

Input audio

Model and pricing

Model name	Description	Price per unit	Rate limit (shared by Alibaba Cloud account and RAM users)		Free quota(View)
Model name	Description	Price per unit	RPS limit for task submission API	Maximum concurrent tasks	Free quota(View)
wan2.2-s2v-detect	Checks whether the input image meets requirements, such as definition, single-person composition, and front-facing pose.	CNY 0.004 per image	5	No limit for synchronous APIs	200 images
wan2.2-s2v	Generates a dynamic video of a person using a validated image and an audio file.	480P: CNY 0.5 per second 720P: CNY 0.9 per second	5	1	100 seconds

To generate a digital human video, follow these steps:

Step 1: Call the wan2.2-s2v-detect API with the image URL to verify compliance.
Step 2: If the image passes validation, call the asynchronous wan2.2-s2v API with the image URL and audio URL to submit the video generation task. Then poll for the result.

Getting Started

Prerequisites

Before calling the API, enable the model service and obtain an API key. Then set the API key as an environment variable.

Sample code

The sample image in this topic has already passed image detection. The following shows sample code for video generation.

Note

HTTP requests occur in two steps: first create a task, then retrieve the result. Beginners should use Postman to call the API.

Step 1: Create a task and get the task ID

This request returns a task_id that you can use to query the result.

curl 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
 --header 'X-DashScope-Async: enable' \
 --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
 --header 'Content-Type: application/json' \
 --data '{
     "model": "wan2.2-s2v",
     "input": {
            "image_url": "https://img.alicdn.com/imgextra/i3/O1CN011FObkp1T7Ttowoq4F_!!6000000002335-0-tps-1440-1797.jpg",
            "audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250825/iaqpio/input_audio.MP3"
        },
        "parameters": {
            "style": "speech"
        }
    }'

Step 2: Query the result using the task ID

Replace 86ecf553-d340-4e21-xxxxxxxxx with your actual task_id.

If you use a model in the Singapore region, replace base_url with https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx, where WorkspaceId is your actual workspace ID.

curl -X GET https://dashscope.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

A task_id is valid for 24 hours. After it expires, queries fail and the API returns a status of UNKNOWN.

Model comparison

Model selection guidance: Use wan2.2-s2v to generate full-body or upper-body videos. For cost-effective portrait videos, choose EMO.

Feature comparison	digital human wan2.2-s2v	EMO (View)
Description	Larger, more natural motion. Supports wide aspect ratios, especially full-body shots. Works with cartoon characters.	Better for close-ups or portraits. Natural lip-sync and facial expressions.
Supported aspect ratios	Full-body, half-body, portrait	Portrait, half-body (recommended)
Calling method	Two-step process. The detection API checks compliance only. Integration is simple.	Two-step process. Coordinates returned by the detection API are required input for the generation API.
Style control	Scenario-driven (speaking, singing, performing)	Style-driven (moderate, calm, lively)
Output specifications	By resolution (480P, 720P)	By aspect ratio (1:1, 3:4)
Pricing	Image detection: CNY 0.004 per image Video generation: 480P: CNY 0.5 per second 720P: CNY 0.9 per second	Image detection: CNY 0.004 per image Video generation: 1:1 aspect ratio: CNY 0.08 per second 3:4 aspect ratio: CNY 0.16 per second

Next steps

Review the API documentation to start development based on your needs:

Image detection API

Video generation API