Choose a model for real-time speech recognition, audio file transcription, and other speech-to-text scenarios.
Use the four "Selection criteria"—real-time vs. offline, domain terminology, speaker diarization, and emotion recognition—to identify the right model. Then check "Recommended models" for top picks per scenario, "All models" for the full version list by model family, and "Audio specifications" for input constraints. Supported languages (including dialects) are listed under each model family in "All models".
Selection criteria
Evaluate the following four dimensions to narrow down the right model for your use case.
Real-time or offline?
Real-time recognition outputs results while the user is still speaking. Offline recognition transcribes a pre-recorded audio file after the recording ends.
-
Real-time (streaming recognition): Uses a WebSocket connection to stream audio in and text out. Ideal for live captions, voice assistants, and meeting transcription. Recommended models:
fun-asr-realtime(hot words, dialect support) orqwen3.5-omni-plus-realtime(prompt context, multilingual). -
Offline (file transcription): Uses an HTTP API to submit audio files and retrieve transcription results. Suited for call center recordings, podcasts, and interviews. Recommended models:
fun-asr(hot words, speaker diarization) orqwen3.5-omni-plus(prompt context, multilingual).
Fun-ASR and Qwen-ASR real-time models support the DashScope SDK (Java, Python). Fun-ASR models also support Android and iOS SDKs. All other models require direct WebSocket or HTTP API calls.
For real-time integration, see Real-time speech recognition. For offline integration, see Non-real-time speech recognition.
Domain terminology
Two approaches, ranked by flexibility:
-
Prompt context injection: Describe your domain in the system prompt. No setup required—the model adapts on each request. Trade-off: slightly higher per-request latency than dedicated ASR models. Use
qwen3.5-omni-plus-realtime(real-time) orqwen3.5-omni-plus(offline). -
Hot words: Supply a weighted vocabulary list. Best for stable, infrequently changing terminology. Use
fun-asr-realtime(real-time) orfun-asr(offline).
Qwen3.5-Omni isn't a traditional ASR engine—it's a large language model that understands audio. Inject context through the system prompt; the model adapts without a hot word list.
Speaker diarization
Only the Fun-ASR offline models (fun-asr and fun-asr-mtl) support speaker diarization. Use these models when you need to identify "who said what."
Emotion recognition
Qwen-ASR and Qwen3.5-Omni models detect emotions alongside transcription. Recommended models: qwen3-asr-flash-realtime (real-time) or qwen3-asr-flash-filetrans (offline).
Recommended models
The following table lists top picks for common scenarios. For full details, visit Model Studio.
|
Model ID |
Mode |
API |
Accuracy boost |
Emotion recognition |
Speaker diarization |
Languages |
Max audio duration/size |
|
|
Real-time |
WebSocket |
Hot words |
|
|
Chinese, English, Japanese, and dialects |
Unlimited |
|
|
Offline |
HTTP |
Hot words |
|
|
Multilingual with dialects |
12 hours / 2 GB |
|
|
Real-time |
WebSocket |
Prompt context |
|
|
Multilingual |
2 hours |
|
|
Offline |
HTTP (OpenAI-compatible) |
Prompt context |
|
|
Multilingual |
3 hours / 2 GB |
All models
Fun-ASR
|
Model ID |
Mode |
API |
Accuracy boost |
Emotion recognition |
Speaker diarization |
Max audio duration/size |
|
|
|
Real-time |
WebSocket |
Hot words |
|
|
Chinese, English, Japanese, and dialects |
Unlimited |
|
|
Real-time |
WebSocket |
Hot words |
|
|
Chinese, English, Japanese, and dialects |
Unlimited |
|
|
Real-time |
WebSocket |
Hot words |
|
|
Chinese, English, Japanese, and dialects |
Unlimited |
|
|
Real-time |
WebSocket |
Hot words |
|
|
Chinese, English |
Unlimited |
|
|
Real-time |
WebSocket |
Hot words |
|
|
Chinese |
Unlimited |
|
|
Real-time |
WebSocket |
Hot words |
|
|
Chinese |
Unlimited |
|
|
Offline |
HTTP |
Hot words |
|
|
Multilingual with dialects |
12 hours / 2 GB |
|
|
Offline |
HTTP |
Prompt context |
|
|
Multilingual with dialects |
5 minutes / 2 GB |
|
|
Offline |
HTTP |
Hot words |
|
|
Multilingual with dialects |
12 hours / 2 GB |
|
|
Offline |
HTTP |
Hot words |
|
|
Chinese, English |
12 hours / 2 GB |
|
|
Offline |
HTTP |
Hot words |
|
|
Multilingual with dialects |
12 hours / 2 GB |
|
|
Offline |
HTTP |
Hot words |
|
|
Multilingual with dialects |
12 hours / 2 GB |
Supported languages (by version):
-
Fun-ASR-Realtime main versions (
fun-asr-realtime,fun-asr-realtime-2026-02-28,fun-asr-realtime-2025-11-07): Chinese (Mandarin, Cantonese, Wu, Hokkien, Hakka, Gan, Xiang, Jin; plus regional accents including Central Plains, Southwestern, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeastern, Beijing, and Hong Kong/Taiwan—covering Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia accents), English, Japanese -
fun-asr-realtime-2025-09-15: Chinese (Mandarin), English -
Fun-ASR-Flash-8K-Realtime (
fun-asr-flash-8k-realtime,fun-asr-flash-8k-realtime-2026-01-28): Chinese -
Fun-ASR / Fun-ASR-MTL main versions (
fun-asr,fun-asr-2025-11-07): Chinese (Mandarin, Cantonese, Wu, Hokkien, Hakka, Gan, Xiang, Jin; plus regional accents including Central Plains, Southwestern, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeastern, Beijing, and Hong Kong/Taiwan—covering Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia accents), English, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Filipino, Hindi, Arabic, French, German, Spanish, Portuguese, Russian, Italian, Dutch, Swedish, Danish, Finnish, Norwegian, Greek, Polish, Czech, Hungarian, Romanian, Bulgarian, Croatian, Slovak -
Fun-ASR-Flash (
fun-asr-flash-2026-06-15): Chinese (Mandarin, Cantonese, Wu, Hokkien, Hakka, Gan, Xiang, Jin; plus regional accents including Central Plains, Southwestern, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeastern, Beijing, and Hong Kong/Taiwan—covering Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia accents), English, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Filipino, Hindi, Arabic, French, German, Spanish, Portuguese, Russian, Italian, Dutch, Swedish, Danish, Finnish, Norwegian, Greek, Polish, Czech, Hungarian, Romanian, Bulgarian, Croatian, Slovak -
fun-asr-2025-08-25: Chinese (Mandarin), English -
Fun-ASR-MTL (
fun-asr-mtl,fun-asr-mtl-2025-08-25): Chinese (Mandarin, Cantonese), English, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Filipino, Hindi, Arabic, French, German, Spanish, Portuguese, Russian, Italian, Dutch, Swedish, Danish, Finnish, Norwegian, Greek, Polish, Czech, Hungarian, Romanian, Bulgarian, Croatian, Slovak
Qwen-ASR
|
Model ID |
Mode |
API |
Accuracy boost |
Emotion recognition |
Speaker diarization |
Max audio duration/size |
|
|
|
Real-time |
WebSocket |
|
|
|
Multilingual with dialects |
Unlimited |
|
|
Real-time |
WebSocket |
|
|
|
Multilingual with dialects |
Unlimited |
|
|
Real-time |
WebSocket |
|
|
|
Multilingual with dialects |
Unlimited |
|
|
Offline |
HTTP |
|
|
|
Multilingual with dialects |
12 hours / 2 GB |
|
|
Offline |
HTTP |
|
|
|
Multilingual with dialects |
12 hours / 2 GB |
|
|
Offline |
HTTP (OpenAI-compatible) |
|
|
|
Multilingual with dialects |
5 minutes / 10 MB |
|
|
Offline |
HTTP (OpenAI-compatible) |
|
|
|
Multilingual with dialects |
5 minutes / 10 MB |
|
|
Offline |
HTTP (OpenAI-compatible) |
|
|
|
Multilingual with dialects |
5 minutes / 10 MB |
Supported languages: All Qwen-ASR models (qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, qwen3-asr-flash, and their snapshot versions) share the same language list: Chinese (Mandarin, Sichuanese, Hokkien, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese, Czech, Danish, Filipino, Finnish, Icelandic, Malay, Norwegian, Polish, Swedish.
Qwen3.5-Omni / Qwen3-Omni
|
Model ID |
Mode |
API |
Accuracy boost |
Emotion recognition |
Speaker diarization |
Languages |
Max audio duration/size |
|
|
Real-time |
WebSocket |
Prompt context |
|
|
Multilingual |
2 hours |
|
|
Offline |
HTTP (OpenAI-compatible) |
Prompt context |
|
|
Multilingual |
3 hours / 2 GB |
|
|
Real-time |
WebSocket |
Prompt context |
|
|
Multilingual |
2 hours |
|
|
Offline |
HTTP (OpenAI-compatible) |
Prompt context |
|
|
Multilingual |
3 hours / 2 GB |
|
|
Real-time |
WebSocket |
Prompt context |
|
|
Multilingual with dialects |
2 hours |
|
|
Offline |
HTTP (OpenAI-compatible) |
Prompt context |
|
|
Multilingual with dialects |
20 minutes / 100 MB |
Supported languages: Qwen3.5-Omni / Qwen3-Omni aren't dedicated ASR models. Refer to the user guide and API documentation of each model for supported languages.
Paraformer
Paraformer is an older-generation ASR model family with both real-time and offline variants. Migrate to Fun-ASR or Qwen-ASR when your workload allows.
|
Model ID |
API |
Description |
|
|
WebSocket |
Real-time recognition; Chinese, English, Japanese, Korean, German, French, Russian |
|
|
WebSocket |
Real-time recognition; Chinese, English, Japanese, Korean, German, French, Russian |
|
|
WebSocket |
Real-time recognition for 8 kHz telephony; Chinese |
|
|
WebSocket |
Real-time recognition for 8 kHz telephony; Chinese |
|
|
HTTP |
File transcription with speaker diarization; Chinese, English, Japanese, Korean, German, French, Russian |
|
|
HTTP |
File transcription for 8 kHz telephony; Chinese |
|
|
HTTP |
File transcription with speaker diarization; Chinese, English, Japanese, Korean, German, French, Russian |
|
|
HTTP |
File transcription for 8 kHz telephony; Chinese |
|
|
HTTP |
File transcription with speaker diarization; multilingual |
Supported languages (by version):
-
paraformer-realtime-v2,paraformer-v2: Chinese (Mandarin, Cantonese, Wu, Hokkien, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Jiangxi, Yunnan, Shanghai), English, Japanese, Korean, German, French, Russian -
paraformer-realtime-v1,paraformer-realtime-8k-v2,paraformer-realtime-8k-v1,paraformer-8k-v2,paraformer-8k-v1: Chinese (Mandarin) -
paraformer-v1: Chinese (Mandarin), English -
paraformer-mtl-v1: Chinese (Mandarin, Cantonese, Wu, Hokkien, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin), English, Japanese, Korean, Spanish, Indonesian, French, German, Italian, Malay
Others (deprecating)
The following models are scheduled for deprecation and listed for reference only. Migrate to the recommended models as soon as possible.
|
Model ID |
API |
Description |
|
|
WebSocket |
Real-time recognition; Chinese, English, and dialects |
|
|
WebSocket |
Short-audio real-time recognition (1-minute limit); multilingual |
|
|
HTTP |
File transcription; multilingual |
Audio specifications
The following tables list audio specifications—input method, format, sample rate, and size/duration limits—for real-time and offline modes. For supported languages (including dialects), see the corresponding model family subsection under "All models". Qwen3.5-Omni / Qwen3-Omni models aren't dedicated ASR models. For audio specifications, refer to the user guide and API documentation for each model.
Real-time
|
Model ID |
Input method |
Audio format |
Sample rate |
Size/duration |
|
Fun-ASR-Realtime / Fun-ASR-MTL-Realtime ( |
Binary stream |
|
Any |
Unlimited |
|
Fun-ASR-Flash-8K-Realtime ( |
Binary stream |
Same as Fun-ASR-Realtime |
8 kHz |
Unlimited |
|
Qwen-ASR-Realtime ( |
Binary stream |
|
8 kHz, 16 kHz |
Unlimited |
|
Paraformer-Realtime ( |
Binary stream |
Same as Fun-ASR-Realtime |
|
Unlimited |
All real-time models accept mono (single-channel) input only.
Offline
|
Model ID |
Input method |
Audio format |
Sample rate |
File size/duration |
|
Fun-ASR ( |
Publicly accessible file URL; 1 URL per request |
|
Any |
≤2 GB; ≤12 hours (≤2 hours recommended when speaker diarization is enabled) |
|
Fun-ASR-Flash ( |
URL / Base64; 1 file per request |
|
Any |
≤2 GB; ≤5 minutes |
|
Fun-ASR ( |
URL / Base64; 1 file per request |
|
Any |
≤2 GB; ≤5 minutes |
|
Paraformer ( |
Same as Fun-ASR |
Same as Fun-ASR |
paraformer-v2 /v1 any rate; |
Same as Fun-ASR |
|
Qwen3-ASR-Flash-Filetrans ( |
Publicly accessible file URL; 1 URL per request |
|
|
≤2 GB; ≤12 hours |
|
Qwen3-ASR-Flash ( |
URL / Base64 / local file absolute path; 1 file per request |
|
|
≤10 MB; ≤5 minutes |