Wanxiang - Digital Human

更新时间:
复制 MD 格式

The wan2.2-s2v digital human model generates natural-looking speaking, singing, or performing videos from a single image and an audio file. It supports any aspect ratio and works with portrait, full-body, or half-body images.

Important

This document applies only to the China (Beijing) region. Use an API key from this region.

Model overview

Example output

Input example

Output video

input_image

Input audio

Model and pricing

Model name

Description

Price per unit

Rate limit (shared by Alibaba Cloud account and RAM users)

Free quota(View)

RPS limit for task submission API

Maximum concurrent tasks

wan2.2-s2v-detect

Checks whether the input image meets requirements, such as definition, single-person composition, and front-facing pose.

CNY 0.004 per image

5

No limit for synchronous APIs

200 images

wan2.2-s2v

Generates a dynamic video of a person using a validated image and an audio file.

480P: CNY 0.5 per second

720P: CNY 0.9 per second

5

1

100 seconds

To generate a digital human video, follow these steps:

  • Step 1: Call the wan2.2-s2v-detect API with the image URL to verify compliance.

  • Step 2: If the image passes validation, call the asynchronous wan2.2-s2v API with the image URL and audio URL to submit the video generation task. Then poll for the result.

Getting Started

Prerequisites

Before calling the API, enable the model service and obtain an API key. Then set the API key as an environment variable.

Sample code

The sample image in this topic has already passed image detection. The following shows sample code for video generation.

Note

HTTP requests occur in two steps: first create a task, then retrieve the result. Beginners should use Postman to call the API.

Step 1: Create a task and get the task ID

This request returns a task_id that you can use to query the result.

curl 'https://dashscope.aliyuncs.com/api/v1/services/aigc/image2video/video-synthesis/' \
 --header 'X-DashScope-Async: enable' \
 --header "Authorization: Bearer $DASHSCOPE_API_KEY" \
 --header 'Content-Type: application/json' \
 --data '{
     "model": "wan2.2-s2v",
     "input": {
            "image_url": "https://img.alicdn.com/imgextra/i3/O1CN011FObkp1T7Ttowoq4F_!!6000000002335-0-tps-1440-1797.jpg",
            "audio_url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250825/iaqpio/input_audio.MP3"
        },
        "parameters": {
            "style": "speech"
        }
    }'
Step 2: Query the result using the task ID

Replace 86ecf553-d340-4e21-xxxxxxxxx with your actual task_id.

If you use a model in the Singapore region, replace base_url with https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx, where WorkspaceId is your actual workspace ID.
curl -X GET https://dashscope.aliyuncs.com/api/v1/tasks/86ecf553-d340-4e21-xxxxxxxxx \
--header "Authorization: Bearer $DASHSCOPE_API_KEY"

A task_id is valid for 24 hours. After it expires, queries fail and the API returns a status of UNKNOWN.

Model comparison

Model selection guidance: Use wan2.2-s2v to generate full-body or upper-body videos. For cost-effective portrait videos, choose EMO.

Feature comparison

digital human wan2.2-s2v

EMO (View)

Description

Larger, more natural motion. Supports wide aspect ratios, especially full-body shots. Works with cartoon characters.

Better for close-ups or portraits. Natural lip-sync and facial expressions.

Supported aspect ratios

Full-body, half-body, portrait

Portrait, half-body (recommended)

Calling method

Two-step process. The detection API checks compliance only. Integration is simple.

Two-step process. Coordinates returned by the detection API are required input for the generation API.

Style control

Scenario-driven (speaking, singing, performing)

Style-driven (moderate, calm, lively)

Output specifications

By resolution (480P, 720P)

By aspect ratio (1:1, 3:4)

Pricing

  • Image detection: CNY 0.004 per image

  • Video generation:

    • 480P: CNY 0.5 per second

    • 720P: CNY 0.9 per second

  • Image detection: CNY 0.004 per image

  • Video generation:

    • 1:1 aspect ratio: CNY 0.08 per second

    • 3:4 aspect ratio: CNY 0.16 per second

Next steps

Review the API documentation to start development based on your needs:

Image detection API

Video generation API