Billing-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

Billing overview

The Multimodal Interactive Development Suite supports postpaid (pay-as-you-go, savings plan) and prepaid (device subscription) billing methods. If you use the postpaid method, keep your account balance sufficient to avoid service disruption due to overdue payment. Device subscription (License mode) currently applies to RTOS SDK, Linux SDK, Android SDK, and iOS SDK.

Billing method	Description
Postpaid (pay-as-you-go)	Billed per actual interaction round
Prepaid (device subscription/License)	Subscribe to an AI interaction resource plan per device per year
Savings plan	Commit to a spending amount to get a discount that offsets all postpaid charges

Free trial

Go to free trial. New users receive a one-time CNY 10 trial quota for the postpaid method. The free trial is valid for three months after claiming.

Postpaid (pay-as-you-go)

Billing rules

The multimodal interaction pipeline includes four billable items: speech recognition (ASR), intent recognition, large language model (LLM) conversation, and speech synthesis (TTS). Each interaction round is billed separately based on the components actually used. Unused components are not billed.

Speech recognition: Supports two real-time speech recognition models. Not billed when unused.
Speech synthesis: Supports two speech synthesis options. Not billed when unused.
Intent recognition: Classifies user intent and routes it to downstream modules. Not billed when unused.
LLM conversation: Includes casual chat (with plugins, instructions, and web search), knowledge base Q&A, and various Agents. Billed per capability invoked per round.

When invoking Agents or plugins from Alibaba Cloud Model Studio, fees are settled directly by Model Studio with no additional charge.

Billable items and standard pricing

Pricing depends on whether each component is used and which model or capability is selected. See the table below for details:

Interaction pipeline		Standard price (CNY per 1,000 calls)	Notes
Voice interaction	Multimodal lightweight speech recognition	0.05	Optional. Counted as one call per interaction round
	Standard speech recognition	0.75	Optional. Counted as one call per interaction round
	Multimodal lightweight speech synthesis	0.09	Optional. Counted as one call per interaction round
	Standard speech synthesis	1.70	Optional. Counted as one call per interaction round
Intent recognition	Intent recognition (optional)	0.80	Counted as one call per interaction round
LLM conversation (choose one per interaction round)	Casual chat (includes built-in plugins, instructions, music radio, multimodal notes, map and travel)	2.20	Counted as one call per interaction round
	Knowledge base Q&A	3.70	Counted as one call per interaction round Fee details: This price covers LLM inference only: retrieved knowledge base snippets are injected into the prompt, increasing token count. This token increase is already included in the CNY 3.70 per 1,000 calls. The knowledge base service’s own specification fee and model invocation fee are billed separately by Model Studio and are not included in the CNY 3.70 per 1,000 calls.
	Voice translation	5.70	Counted as one call per Agent launch
	News radio/children’s stories	0.02	Counted as one call per Agent launch
	Photo Q&A – Balanced Edition	0.18	Counted as one call per interaction round. Valid only in direct connection mode. Image size must not exceed 640×480. For higher-resolution scenarios, use Photo Q&A Premium Edition.
	Video conversation/photo translation/photo Q&A – Premium Edition	12.00	Counted as one call per interaction round
	Proactive tour guide	1.20	Counted as one call per interaction round. Each proactive image analysis also counts as one interaction.
	Meeting minutes from audio	Billed through Tongyi Tingwu. For details, see Product Overview.

Special cases:

News radio/children’s stories: Persistent broadcasting Agents. After launch, using news radio or children’s stories triggers multiple speech synthesis charges—one per spoken sentence.
Voice translation: Billed once per Agent entry, regardless of translation length.
Web search and long-term memory are free for a limited time.
In postpaid mode, some premium models cost more and may count as more than one billing unit per interaction. Choosing a more expensive model increases the billing multiplier, as shown below:

Component	Base specification (CNY per 1,000 calls)	Model	Billing Times
Speech recognition	Speech recognition	Fun-ASR, Qwen3-ASR-Flash-Realtime	3x
Speech synthesis	Speech synthesis	CosyVoice-v3-Plus, Qwen3-TTS series	3x
LLM conversation	Casual chat and plugins	Qwen3.7-Plus, Qwen3.6-Plus, Qwen-Max, Qwen3-Coder-Plus	2x
		deepseek series except v4pro, GLM, Kimi, MiniMax, Qwen3.5-Omni-Flash	4x
		Qwen3.6-Max	6x
		Qwen3.7-Max, Qwen3.5-Omni-Plus (text output), deepseek-v4-pro	8x
		Qwen3.5-Omni-Plus (audio output), farui-plus	13x
		Qwen-Deep-Research	32x
Knowledge base Q&A	Knowledge base Q&A	Matches the number of pushes for the selected LLM

Knowledge base billing details

Knowledge base Q&A involves two separate fees charged by different services:

Fee item	Fee content	Reason for charge	Billed by
LLM inference fee	CNY 3.70 per 1,000 calls	Knowledge base snippets are appended to the prompt, increasing token count processed by the LLM	Multimodal Interactive Development Suite
Knowledge base service fee	Billed separately	Costs for knowledge base operation, vectorization, and retrieval	Knowledge base service (for details, see Knowledge Base Billing Details)

Key distinction:

The CNY 3.70 per 1,000 calls pays the LLM for processing extra content.
The knowledge base fee pays the knowledge base service for running and retrieving data.

Typical scenario cost estimates

Examples of common interaction scenarios:

Scenario	Estimated cost (CNY per 1,000 calls)	Cost breakdown	Notes
Standard voice casual chat	5.45	0.75 (standard speech recognition) + 0.8 (intent recognition) + 1.7 (standard speech synthesis) + 2.2 (casual chat, may include plugins, instructions, web search)	Counted as one call per interaction round
Lightweight voice casual chat	2.3	0.05 (lightweight speech recognition) + 0.09 (lightweight speech synthesis) + 2.2 (casual chat)	Counted as one call per interaction round. Without intent recognition, plugins, instructions, web search, and Agents are unavailable.
Plain text conversation	3.0	0.8 (intent recognition) + 2.2 (casual chat, may include plugins, instructions, web search)	Uses no speech recognition or synthesis—input and output are plain text only.
Knowledge base retrieval	6.25	0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 3.7 (knowledge base retrieval)	Counted as one call per interaction round
Voice translation	8.25	0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 5.7 (voice translation)	Counted as one call per voice translation session
Real-time video conversation/photo Q&A	14.55	0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) + 12 (visual understanding)	Counted as one call per interaction round
News radio/children’s stories	Calculated based on number of spoken sentences	0.05 (lightweight speech recognition) + 0.8 (intent recognition) + 1.7 (speech synthesis) × n (number of sentences) + 0.02 (news radio)	Cost includes: speech recognition + intent recognition + one-time launch + speech synthesis. Each spoken sentence triggers a speech synthesis call, billed per sentence.

Prepaid (device subscription)

The Multimodal Interactive Development Suite offers a prepaid device subscription model. Subscribe to an AI interaction resource plan per device per year. After selecting a tier, you receive a unified resource pool covering the full pipeline—including speech recognition, LLM conversation, speech synthesis—and all extended features.

Tier options

Select a tier based on your business needs.

Tier	Price (CNY per device per year)
Trial Edition	2
Basic Edition	5
Standard Edition	10

Billing rules

Minimum purchase: 100 devices.
Resource pool mechanism: Purchase grants a unified resource pool. Because different features consume varying AI resources, actual available interaction counts depend on usage distribution. The purchase page includes a calculator to estimate available interactions based on feature configuration (for reference only; actual usage depends on feature distribution).
Add-on purchases: You can purchase additional subscriptions for the same application or workspace. Resource pool quotas are combined and shared.
Shared mode: At purchase, bind the subscription to a workspace. Multiple applications within the same workspace share the same resource pool.

Activation and validity

Each License must be activated within one year.
Usage period starts on the device activation date:
- Example: Order 100 one-year device subscriptions on July 1, 2026, and activate them on October 1, 2026. The usage period ends on October 1, 2027.

Refund policy: No refunds after purchase.

Savings plan

The savings plan is a discount program that offsets all pay-as-you-go charges for multimodal interactions. Commit to a spending amount to receive a discount, reducing costs by 10%–50% compared to standard pay-as-you-go rates.

Committed spend (CNY)	Discount	Validity
20 ≤ value < 100	0.95	3 months
100 ≤ value ≤ 2,000	0.90	1 year
2,000 < value ≤ 5,000	0.85	1 year
5,000 < value ≤ 20,000	0.80	1 year
20,000 < value ≤ 50,000	0.75	1 year
50,000 < value ≤ 100,000	0.70	1 year
100,000 < value ≤ 200,000	0.65	1 year
200,000 < value ≤ 300,000	0.60	1 year
300,000 < value ≤ 400,000	0.55	1 year
400,000 < value ≤ 500,000	0.50	1 year

How to purchase

Buy General-purpose Savings Plan

Payment type: One-time upfront payment
Validity: 3 months for CNY 20–100; 1 year for CNY 100 and above (unused balance is not refunded upon expiration)