Billing

更新时间:
复制 MD 格式

Billing overview

The Multimodal Interactive Development Suite supports postpaid (pay-as-you-go, savings plan) and prepaid (device subscription) billing methods. If you use the postpaid method, keep your account balance sufficient to avoid service disruption due to overdue payment. Device subscription (License mode) currently applies to RTOS SDK, Linux SDK, Android SDK, and iOS SDK.

Billing method

Description

Postpaid (pay-as-you-go)

Billed per actual interaction round

Prepaid (device subscription/License)

Subscribe to an AI interaction resource plan per device per year

Savings plan

Commit to a spending amount to get a discount that offsets all postpaid charges

Free trial

Go to free trial. New users receive a one-time CNY 10 trial quota for the postpaid method. The free trial is valid for three months after claiming.

Postpaid (pay-as-you-go)

Billing rules

The multimodal interaction pipeline includes four billable items: speech recognition (ASR), intent recognition, large language model (LLM) conversation, and speech synthesis (TTS). Each interaction round is billed separately based on the components actually used. Unused components are not billed.

  1. Speech recognition: Supports two real-time speech recognition models. Not billed when unused.

  2. Speech synthesis: Supports two speech synthesis options. Not billed when unused.

  3. Intent recognition: Classifies user intent and routes it to downstream modules. Not billed when unused.

  4. LLM conversation: Includes casual chat (with plugins, instructions, and web search), knowledge base Q&A, and various Agents. Billed per capability invoked per round.

    When invoking Agents or plugins from Alibaba Cloud Model Studio, fees are settled directly by Model Studio with no additional charge.

Billable items and standard pricing

Pricing depends on whether each component is used and which model or capability is selected. See the table below for details:

Interaction pipeline

Standard price (CNY per 1,000 calls)

Notes

Voice interaction

Multimodal lightweight speech recognition

0.05

Optional. Counted as one call per interaction round

Standard speech recognition

0.75

Multimodal lightweight speech synthesis

0.09

Optional. Counted as one call per interaction round

Standard speech synthesis

1.70

Intent recognition

Intent recognition (optional)

0.80

Counted as one call per interaction round

LLM conversation

(choose one per interaction round)

Casual chat (includes built-in plugins, instructions, music radio, multimodal notes, map and travel)

2.20

Counted as one call per interaction round

Knowledge base Q&A

3.70

Counted as one call per interaction round

Fee details:

  • This price covers LLM inference only: retrieved knowledge base snippets are injected into the prompt, increasing token count. This token increase is already included in the CNY 3.70 per 1,000 calls.

  • The knowledge base service’s own specification fee and model invocation fee are billed separately by Model Studio and are not included in the CNY 3.70 per 1,000 calls.

Voice translation

5.70

Counted as one call per Agent launch

News radio/children’s stories

0.02

Counted as one call per Agent launch

Photo Q&A – Balanced Edition

0.18

Counted as one call per interaction round. Valid only in direct connection mode. Image size must not exceed 640×480. For higher-resolution scenarios, use Photo Q&A Premium Edition.

Video conversation/photo translation/photo Q&A – Premium Edition

12.00

Counted as one call per interaction round

Proactive tour guide

1.20

Counted as one call per interaction round. Each proactive image analysis also counts as one interaction.

Meeting minutes from audio

Billed through Tongyi Tingwu. For details, see Product Overview.

Special cases:

  1. News radio/children’s stories: Persistent broadcasting Agents. After launch, using news radio or children’s stories triggers multiple speech synthesis charges—one per spoken sentence.

  2. Voice translation: Billed once per Agent entry, regardless of translation length.

  3. Web search and long-term memory are free for a limited time.

  4. In postpaid mode, some premium models cost more and may count as more than one billing unit per interaction. Choosing a more expensive model increases the billing multiplier, as shown below:

Component

Base specification

(CNY per 1,000 calls)

Model

Billing Times

Speech recognition

Speech recognition

Fun-ASR, Qwen3-ASR-Flash-Realtime

3x

Speech synthesis

Speech synthesis

CosyVoice-v3-Plus, Qwen3-TTS series

3x

LLM conversation

Casual chat and plugins

Qwen3.7-Plus, Qwen3.6-Plus, Qwen-Max, Qwen3-Coder-Plus

2x

deepseek series except v4pro, GLM, Kimi, MiniMax, Qwen3.5-Omni-Flash

4x

Qwen3.6-Max

6x

Qwen3.7-Max, Qwen3.5-Omni-Plus (text output), deepseek-v4-pro

8x

Qwen3.5-Omni-Plus (audio output), farui-plus

13x

Qwen-Deep-Research

32x

Knowledge base Q&A

Knowledge base Q&A

Matches the number of pushes for the selected LLM

Knowledge base billing details

Knowledge base Q&A involves two separate fees charged by different services:

Fee item

Fee content

Reason for charge

Billed by

LLM inference fee

CNY 3.70 per 1,000 calls

Knowledge base snippets are appended to the prompt, increasing token count processed by the LLM

Multimodal Interactive Development Suite

Knowledge base service fee

Billed separately

Costs for knowledge base operation, vectorization, and retrieval

Knowledge base service (for details, see Knowledge Base Billing Details)

Key distinction:

  • The CNY 3.70 per 1,000 calls pays the LLM for processing extra content.

  • The knowledge base fee pays the knowledge base service for running and retrieving data.

Typical scenario cost estimates

Examples of common interaction scenarios:

Scenario

Estimated cost (CNY per 1,000 calls)

Cost breakdown

Notes

Standard voice casual chat

5.45

0.75 (standard speech recognition) + 0.8 (intent recognition) + 1.7 (standard speech synthesis) + 2.2 (casual chat, may include plugins, instructions, web search)

Counted as one call per interaction round

Lightweight voice casual chat

2.3

0.05 (lightweight speech recognition) + 0.09 (lightweight speech synthesis) + 2.2 (casual chat)

Counted as one call per interaction round. Without intent recognition, plugins, instructions, web search, and Agents are unavailable.

Plain text conversation

3.0

0.8 (intent recognition) + 2.2 (casual chat, may include plugins, instructions, web search)

Uses no speech recognition or synthesis—input and output are plain text only.

Knowledge base retrieval

6.25

0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 3.7 (knowledge base retrieval)

Counted as one call per interaction round

Voice translation

8.25

0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 5.7 (voice translation)

Counted as one call per voice translation session

Real-time video conversation/photo Q&A

14.55

0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 12 (visual understanding)

Counted as one call per interaction round

News radio/children’s stories

Calculated based on number of spoken sentences

0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) × n (number of sentences) + 0.02 (news radio)

Cost includes: speech recognition + intent recognition + one-time launch + speech synthesis. Each spoken sentence triggers a speech synthesis call, billed per sentence.

Prepaid (device subscription)

The Multimodal Interactive Development Suite offers a prepaid device subscription model. Subscribe to an AI interaction resource plan per device per year. After selecting a tier, you receive a unified resource pool covering the full pipeline—including speech recognition, LLM conversation, speech synthesis—and all extended features.

Tier options

Select a tier based on your business needs.

Tier

Price (CNY per device per year)

Trial Edition

2

Basic Edition

5

Standard Edition

10

Billing rules

  • Minimum purchase: 100 devices.

  • Resource pool mechanism: Purchase grants a unified resource pool. Because different features consume varying AI resources, actual available interaction counts depend on usage distribution. The purchase page includes a calculator to estimate available interactions based on feature configuration (for reference only; actual usage depends on feature distribution).

  • Add-on purchases: You can purchase additional subscriptions for the same application or workspace. Resource pool quotas are combined and shared.

  • Shared mode: At purchase, bind the subscription to a workspace. Multiple applications within the same workspace share the same resource pool.

Activation and validity

  • Each License must be activated within one year.

  • Usage period starts on the device activation date:

    • Example: Order 100 one-year device subscriptions on July 1, 2026, and activate them on October 1, 2026. The usage period ends on October 1, 2027.

Refund policy: No refunds after purchase.

Savings plan

The savings plan is a discount program that offsets all pay-as-you-go charges for multimodal interactions. Commit to a spending amount to receive a discount, reducing costs by 10%–50% compared to standard pay-as-you-go rates.

Committed spend (CNY)

Discount

Validity

20 ≤ value < 100

0.95

3 months

100 ≤ value ≤ 2,000

0.90

1 year

2,000 < value ≤ 5,000

0.85

1 year

5,000 < value ≤ 20,000

0.80

1 year

20,000 < value ≤ 50,000

0.75

1 year

50,000 < value ≤ 100,000

0.70

1 year

100,000 < value ≤ 200,000

0.65

1 year

200,000 < value ≤ 300,000

0.60

1 year

300,000 < value ≤ 400,000

0.55

1 year

400,000 < value ≤ 500,000

0.50

1 year

How to purchase

Buy General-purpose Savings Plan

  • Payment type: One-time upfront payment

  • Validity: 3 months for CNY 20–100; 1 year for CNY 100 and above (unused balance is not refunded upon expiration)