Billing overview
The Multimodal Interactive Development Suite supports postpaid (pay-as-you-go, savings plan) and prepaid (device subscription) billing methods. If you use the postpaid method, keep your account balance sufficient to avoid service disruption due to overdue payment. Device subscription (License mode) currently applies to RTOS SDK, Linux SDK, Android SDK, and iOS SDK.
|
Billing method |
Description |
|
Postpaid (pay-as-you-go) |
Billed per actual interaction round |
|
Prepaid (device subscription/License) |
Subscribe to an AI interaction resource plan per device per year |
|
Savings plan |
Commit to a spending amount to get a discount that offsets all postpaid charges |
Free trial
Go to free trial. New users receive a one-time CNY 10 trial quota for the postpaid method. The free trial is valid for three months after claiming.
Postpaid (pay-as-you-go)
Billing rules
The multimodal interaction pipeline includes four billable items: speech recognition (ASR), intent recognition, large language model (LLM) conversation, and speech synthesis (TTS). Each interaction round is billed separately based on the components actually used. Unused components are not billed.
-
Speech recognition: Supports two real-time speech recognition models. Not billed when unused.
-
Speech synthesis: Supports two speech synthesis options. Not billed when unused.
-
Intent recognition: Classifies user intent and routes it to downstream modules. Not billed when unused.
-
LLM conversation: Includes casual chat (with plugins, instructions, and web search), knowledge base Q&A, and various Agents. Billed per capability invoked per round.
When invoking Agents or plugins from Alibaba Cloud Model Studio, fees are settled directly by Model Studio with no additional charge.
Billable items and standard pricing
Pricing depends on whether each component is used and which model or capability is selected. See the table below for details:
|
Interaction pipeline |
Standard price (CNY per 1,000 calls) |
Notes |
|
|
Voice interaction |
Multimodal lightweight speech recognition |
0.05 |
Optional. Counted as one call per interaction round |
|
Standard speech recognition |
0.75 |
||
|
Multimodal lightweight speech synthesis |
0.09 |
Optional. Counted as one call per interaction round |
|
|
Standard speech synthesis |
1.70 |
||
|
Intent recognition |
Intent recognition (optional) |
0.80 |
Counted as one call per interaction round |
|
LLM conversation (choose one per interaction round) |
Casual chat (includes built-in plugins, instructions, music radio, multimodal notes, map and travel) |
2.20 |
Counted as one call per interaction round |
|
Knowledge base Q&A |
3.70 |
Counted as one call per interaction round Fee details:
|
|
|
Voice translation |
5.70 |
Counted as one call per Agent launch |
|
|
News radio/children’s stories |
0.02 |
Counted as one call per Agent launch |
|
|
Photo Q&A – Balanced Edition |
0.18 |
Counted as one call per interaction round. Valid only in direct connection mode. Image size must not exceed 640×480. For higher-resolution scenarios, use Photo Q&A Premium Edition. |
|
|
Video conversation/photo translation/photo Q&A – Premium Edition |
12.00 |
Counted as one call per interaction round |
|
|
Proactive tour guide |
1.20 |
Counted as one call per interaction round. Each proactive image analysis also counts as one interaction. |
|
|
Meeting minutes from audio |
Billed through Tongyi Tingwu. For details, see Product Overview. |
||
Special cases:
-
News radio/children’s stories: Persistent broadcasting Agents. After launch, using news radio or children’s stories triggers multiple speech synthesis charges—one per spoken sentence.
-
Voice translation: Billed once per Agent entry, regardless of translation length.
-
Web search and long-term memory are free for a limited time.
-
In postpaid mode, some premium models cost more and may count as more than one billing unit per interaction. Choosing a more expensive model increases the billing multiplier, as shown below:
|
Component |
Base specification (CNY per 1,000 calls) |
Model |
Billing Times |
|
Speech recognition |
Speech recognition |
Fun-ASR, Qwen3-ASR-Flash-Realtime |
3x |
|
Speech synthesis |
Speech synthesis |
CosyVoice-v3-Plus, Qwen3-TTS series |
3x |
|
LLM conversation |
Casual chat and plugins |
Qwen3.7-Plus, Qwen3.6-Plus, Qwen-Max, Qwen3-Coder-Plus |
2x |
|
deepseek series except v4pro, GLM, Kimi, MiniMax, Qwen3.5-Omni-Flash |
4x |
||
|
Qwen3.6-Max |
6x |
||
|
Qwen3.7-Max, Qwen3.5-Omni-Plus (text output), deepseek-v4-pro |
8x |
||
|
Qwen3.5-Omni-Plus (audio output), farui-plus |
13x |
||
|
Qwen-Deep-Research |
32x |
||
|
Knowledge base Q&A |
Knowledge base Q&A |
Matches the number of pushes for the selected LLM |
|
Knowledge base billing details
Knowledge base Q&A involves two separate fees charged by different services:
|
Fee item |
Fee content |
Reason for charge |
Billed by |
|
LLM inference fee |
CNY 3.70 per 1,000 calls |
Knowledge base snippets are appended to the prompt, increasing token count processed by the LLM |
Multimodal Interactive Development Suite |
|
Knowledge base service fee |
Billed separately |
Costs for knowledge base operation, vectorization, and retrieval |
Knowledge base service (for details, see Knowledge Base Billing Details) |
Key distinction:
-
The CNY 3.70 per 1,000 calls pays the LLM for processing extra content.
-
The knowledge base fee pays the knowledge base service for running and retrieving data.
Typical scenario cost estimates
Examples of common interaction scenarios:
|
Scenario |
Estimated cost (CNY per 1,000 calls) |
Cost breakdown |
Notes |
|
Standard voice casual chat |
5.45 |
0.75 (standard speech recognition) + 0.8 (intent recognition) + 1.7 (standard speech synthesis) + 2.2 (casual chat, may include plugins, instructions, web search) |
Counted as one call per interaction round |
|
Lightweight voice casual chat |
2.3 |
0.05 (lightweight speech recognition) + 0.09 (lightweight speech synthesis) + 2.2 (casual chat) |
Counted as one call per interaction round. Without intent recognition, plugins, instructions, web search, and Agents are unavailable. |
|
Plain text conversation |
3.0 |
0.8 (intent recognition) + 2.2 (casual chat, may include plugins, instructions, web search) |
Uses no speech recognition or synthesis—input and output are plain text only. |
|
Knowledge base retrieval |
6.25 |
0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 3.7 (knowledge base retrieval) |
Counted as one call per interaction round |
|
Voice translation |
8.25 |
0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 5.7 (voice translation) |
Counted as one call per voice translation session |
|
Real-time video conversation/photo Q&A |
14.55 |
0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 12 (visual understanding) |
Counted as one call per interaction round |
|
News radio/children’s stories |
Calculated based on number of spoken sentences |
0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) × n (number of sentences) + 0.02 (news radio) |
Cost includes: speech recognition + intent recognition + one-time launch + speech synthesis. Each spoken sentence triggers a speech synthesis call, billed per sentence. |
Prepaid (device subscription)
The Multimodal Interactive Development Suite offers a prepaid device subscription model. Subscribe to an AI interaction resource plan per device per year. After selecting a tier, you receive a unified resource pool covering the full pipeline—including speech recognition, LLM conversation, speech synthesis—and all extended features.
Tier options
Select a tier based on your business needs.
|
Tier |
Price (CNY per device per year) |
|
Trial Edition |
2 |
|
Basic Edition |
5 |
|
Standard Edition |
10 |
Billing rules
-
Minimum purchase: 100 devices.
-
Resource pool mechanism: Purchase grants a unified resource pool. Because different features consume varying AI resources, actual available interaction counts depend on usage distribution. The purchase page includes a calculator to estimate available interactions based on feature configuration (for reference only; actual usage depends on feature distribution).
-
Add-on purchases: You can purchase additional subscriptions for the same application or workspace. Resource pool quotas are combined and shared.
-
Shared mode: At purchase, bind the subscription to a workspace. Multiple applications within the same workspace share the same resource pool.
Activation and validity
-
Each License must be activated within one year.
-
Usage period starts on the device activation date:
-
Example: Order 100 one-year device subscriptions on July 1, 2026, and activate them on October 1, 2026. The usage period ends on October 1, 2027.
-
Refund policy: No refunds after purchase.
Savings plan
The savings plan is a discount program that offsets all pay-as-you-go charges for multimodal interactions. Commit to a spending amount to receive a discount, reducing costs by 10%–50% compared to standard pay-as-you-go rates.
|
Committed spend (CNY) |
Discount |
Validity |
|
20 ≤ value < 100 |
0.95 |
3 months |
|
100 ≤ value ≤ 2,000 |
0.90 |
1 year |
|
2,000 < value ≤ 5,000 |
0.85 |
1 year |
|
5,000 < value ≤ 20,000 |
0.80 |
1 year |
|
20,000 < value ≤ 50,000 |
0.75 |
1 year |
|
50,000 < value ≤ 100,000 |
0.70 |
1 year |
|
100,000 < value ≤ 200,000 |
0.65 |
1 year |
|
200,000 < value ≤ 300,000 |
0.60 |
1 year |
|
300,000 < value ≤ 400,000 |
0.55 |
1 year |
|
400,000 < value ≤ 500,000 |
0.50 |
1 year |
How to purchase
Buy General-purpose Savings Plan
-
Payment type: One-time upfront payment
-
Validity: 3 months for CNY 20–100; 1 year for CNY 100 and above (unused balance is not refunded upon expiration)