Speech-to-text

更新时间:
复制 MD 格式

Choose a model for real-time speech recognition, audio file transcription, and other speech-to-text scenarios.

Use the four "Selection criteria"—real-time vs. offline, domain terminology, speaker diarization, and emotion recognition—to identify the right model. Then check "Recommended models" for top picks per scenario, "All models" for the full version list by model family, and "Audio specifications" for input constraints. Supported languages (including dialects) are listed under each model family in "All models".

Selection criteria

Evaluate the following four dimensions to narrow down the right model for your use case.

Real-time or offline?

Real-time recognition outputs results while the user is still speaking. Offline recognition transcribes a pre-recorded audio file after the recording ends.

  • Real-time (streaming recognition): Uses a WebSocket connection to stream audio in and text out. Ideal for live captions, voice assistants, and meeting transcription. Recommended models: fun-asr-realtime (hot words, dialect support) or qwen3.5-omni-plus-realtime (prompt context, multilingual).

  • Offline (file transcription): Uses an HTTP API to submit audio files and retrieve transcription results. Suited for call center recordings, podcasts, and interviews. Recommended models: fun-asr (hot words, speaker diarization) or qwen3.5-omni-plus (prompt context, multilingual).

Fun-ASR and Qwen-ASR real-time models support the DashScope SDK (Java, Python). Fun-ASR models also support Android and iOS SDKs. All other models require direct WebSocket or HTTP API calls.

For real-time integration, see Real-time speech recognition. For offline integration, see Non-real-time speech recognition.

Domain terminology

Two approaches, ranked by flexibility:

  1. Prompt context injection: Describe your domain in the system prompt. No setup required—the model adapts on each request. Trade-off: slightly higher per-request latency than dedicated ASR models. Use qwen3.5-omni-plus-realtime (real-time) or qwen3.5-omni-plus (offline).

  2. Hot words: Supply a weighted vocabulary list. Best for stable, infrequently changing terminology. Use fun-asr-realtime (real-time) or fun-asr (offline).

Note

Qwen3.5-Omni isn't a traditional ASR engine—it's a large language model that understands audio. Inject context through the system prompt; the model adapts without a hot word list.

Speaker diarization

Only the Fun-ASR offline models (fun-asr and fun-asr-mtl) support speaker diarization. Use these models when you need to identify "who said what."

Emotion recognition

Qwen-ASR and Qwen3.5-Omni models detect emotions alongside transcription. Recommended models: qwen3-asr-flash-realtime (real-time) or qwen3-asr-flash-filetrans (offline).

Recommended models

The following table lists top picks for common scenarios. For full details, visit Model Studio.

Model ID

Mode

API

Accuracy boost

Emotion recognition

Speaker diarization

Languages

Max audio duration/size

fun-asr-realtime

Real-time

WebSocket

Hot words

Unsupported

Unsupported

Chinese, English, Japanese, and dialects

Unlimited

fun-asr

Offline

HTTP

Hot words

Unsupported

Supported

Multilingual with dialects

12 hours / 2 GB

qwen3.5-omni-plus-realtime

Real-time

WebSocket

Prompt context

Supported

Unsupported

Multilingual

2 hours

qwen3.5-omni-plus

Offline

HTTP (OpenAI-compatible)

Prompt context

Supported

Unsupported

Multilingual

3 hours / 2 GB

All models

Fun-ASR

Model ID

Mode

API

Accuracy boost

Emotion recognition

Speaker diarization

Languages

Max audio duration/size

fun-asr-realtime

Real-time

WebSocket

Hot words

Unsupported

Unsupported

Chinese, English, Japanese, and dialects

Unlimited

fun-asr-realtime-2026-02-28

Real-time

WebSocket

Hot words

Unsupported

Unsupported

Chinese, English, Japanese, and dialects

Unlimited

fun-asr-realtime-2025-11-07

Real-time

WebSocket

Hot words

Unsupported

Unsupported

Chinese, English, Japanese, and dialects

Unlimited

fun-asr-realtime-2025-09-15

Real-time

WebSocket

Hot words

Unsupported

Unsupported

Chinese, English

Unlimited

fun-asr-flash-8k-realtime

Real-time

WebSocket

Hot words

Unsupported

Unsupported

Chinese

Unlimited

fun-asr-flash-8k-realtime-2026-01-28

Real-time

WebSocket

Hot words

Unsupported

Unsupported

Chinese

Unlimited

fun-asr

Offline

HTTP

Hot words

Unsupported

Supported

Multilingual with dialects

12 hours / 2 GB

fun-asr-flash-2026-06-15

Offline

HTTP

Prompt context

Unsupported

Unsupported

Multilingual with dialects

5 minutes / 2 GB

fun-asr-2025-11-07

Offline

HTTP

Hot words

Unsupported

Supported

Multilingual with dialects

12 hours / 2 GB

fun-asr-2025-08-25

Offline

HTTP

Hot words

Unsupported

Supported

Chinese, English

12 hours / 2 GB

fun-asr-mtl

Offline

HTTP

Hot words

Unsupported

Supported

Multilingual with dialects

12 hours / 2 GB

fun-asr-mtl-2025-08-25

Offline

HTTP

Hot words

Unsupported

Supported

Multilingual with dialects

12 hours / 2 GB

Supported languages (by version):

  • Fun-ASR-Realtime main versions (fun-asr-realtime, fun-asr-realtime-2026-02-28, fun-asr-realtime-2025-11-07): Chinese (Mandarin, Cantonese, Wu, Hokkien, Hakka, Gan, Xiang, Jin; plus regional accents including Central Plains, Southwestern, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeastern, Beijing, and Hong Kong/Taiwan—covering Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia accents), English, Japanese

  • fun-asr-realtime-2025-09-15: Chinese (Mandarin), English

  • Fun-ASR-Flash-8K-Realtime (fun-asr-flash-8k-realtime, fun-asr-flash-8k-realtime-2026-01-28): Chinese

  • Fun-ASR / Fun-ASR-MTL main versions (fun-asr, fun-asr-2025-11-07): Chinese (Mandarin, Cantonese, Wu, Hokkien, Hakka, Gan, Xiang, Jin; plus regional accents including Central Plains, Southwestern, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeastern, Beijing, and Hong Kong/Taiwan—covering Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia accents), English, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Filipino, Hindi, Arabic, French, German, Spanish, Portuguese, Russian, Italian, Dutch, Swedish, Danish, Finnish, Norwegian, Greek, Polish, Czech, Hungarian, Romanian, Bulgarian, Croatian, Slovak

  • Fun-ASR-Flash (fun-asr-flash-2026-06-15): Chinese (Mandarin, Cantonese, Wu, Hokkien, Hakka, Gan, Xiang, Jin; plus regional accents including Central Plains, Southwestern, Ji-Lu, Jianghuai, Lan-Yin, Jiao-Liao, Northeastern, Beijing, and Hong Kong/Taiwan—covering Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia accents), English, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Filipino, Hindi, Arabic, French, German, Spanish, Portuguese, Russian, Italian, Dutch, Swedish, Danish, Finnish, Norwegian, Greek, Polish, Czech, Hungarian, Romanian, Bulgarian, Croatian, Slovak

  • fun-asr-2025-08-25: Chinese (Mandarin), English

  • Fun-ASR-MTL (fun-asr-mtl, fun-asr-mtl-2025-08-25): Chinese (Mandarin, Cantonese), English, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Filipino, Hindi, Arabic, French, German, Spanish, Portuguese, Russian, Italian, Dutch, Swedish, Danish, Finnish, Norwegian, Greek, Polish, Czech, Hungarian, Romanian, Bulgarian, Croatian, Slovak

Qwen-ASR

Model ID

Mode

API

Accuracy boost

Emotion recognition

Speaker diarization

Languages

Max audio duration/size

qwen3-asr-flash-realtime

Real-time

WebSocket

Unsupported

Supported

Unsupported

Multilingual with dialects

Unlimited

qwen3-asr-flash-realtime-2026-02-10

Real-time

WebSocket

Unsupported

Supported

Unsupported

Multilingual with dialects

Unlimited

qwen3-asr-flash-realtime-2025-10-27

Real-time

WebSocket

Unsupported

Supported

Unsupported

Multilingual with dialects

Unlimited

qwen3-asr-flash-filetrans

Offline

HTTP

Unsupported

Supported

Unsupported

Multilingual with dialects

12 hours / 2 GB

qwen3-asr-flash-filetrans-2025-11-17

Offline

HTTP

Unsupported

Supported

Unsupported

Multilingual with dialects

12 hours / 2 GB

qwen3-asr-flash

Offline

HTTP (OpenAI-compatible)

Unsupported

Supported

Unsupported

Multilingual with dialects

5 minutes / 10 MB

qwen3-asr-flash-2026-02-10

Offline

HTTP (OpenAI-compatible)

Unsupported

Supported

Unsupported

Multilingual with dialects

5 minutes / 10 MB

qwen3-asr-flash-2025-09-08

Offline

HTTP (OpenAI-compatible)

Unsupported

Supported

Unsupported

Multilingual with dialects

5 minutes / 10 MB

Supported languages: All Qwen-ASR models (qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, qwen3-asr-flash, and their snapshot versions) share the same language list: Chinese (Mandarin, Sichuanese, Hokkien, Wu, Cantonese), English, Japanese, German, Korean, Russian, French, Portuguese, Arabic, Italian, Spanish, Hindi, Indonesian, Thai, Turkish, Ukrainian, Vietnamese, Czech, Danish, Filipino, Finnish, Icelandic, Malay, Norwegian, Polish, Swedish.

Qwen3.5-Omni / Qwen3-Omni

Model ID

Mode

API

Accuracy boost

Emotion recognition

Speaker diarization

Languages

Max audio duration/size

qwen3.5-omni-plus-realtime

Real-time

WebSocket

Prompt context

Supported

Unsupported

Multilingual

2 hours

qwen3.5-omni-plus

Offline

HTTP (OpenAI-compatible)

Prompt context

Supported

Unsupported

Multilingual

3 hours / 2 GB

qwen3.5-omni-flash-realtime

Real-time

WebSocket

Prompt context

Supported

Unsupported

Multilingual

2 hours

qwen3.5-omni-flash

Offline

HTTP (OpenAI-compatible)

Prompt context

Supported

Unsupported

Multilingual

3 hours / 2 GB

qwen3-omni-flash-realtime

Real-time

WebSocket

Prompt context

Supported

Unsupported

Multilingual with dialects

2 hours

qwen3-omni-flash

Offline

HTTP (OpenAI-compatible)

Prompt context

Supported

Unsupported

Multilingual with dialects

20 minutes / 100 MB

Supported languages: Qwen3.5-Omni / Qwen3-Omni aren't dedicated ASR models. Refer to the user guide and API documentation of each model for supported languages.

Paraformer

Paraformer is an older-generation ASR model family with both real-time and offline variants. Migrate to Fun-ASR or Qwen-ASR when your workload allows.

Model ID

API

Description

paraformer-realtime-v2

WebSocket

Real-time recognition; Chinese, English, Japanese, Korean, German, French, Russian

paraformer-realtime-v1

WebSocket

Real-time recognition; Chinese, English, Japanese, Korean, German, French, Russian

paraformer-realtime-8k-v2

WebSocket

Real-time recognition for 8 kHz telephony; Chinese

paraformer-realtime-8k-v1

WebSocket

Real-time recognition for 8 kHz telephony; Chinese

paraformer-v2

HTTP

File transcription with speaker diarization; Chinese, English, Japanese, Korean, German, French, Russian

paraformer-8k-v2

HTTP

File transcription for 8 kHz telephony; Chinese

paraformer-v1

HTTP

File transcription with speaker diarization; Chinese, English, Japanese, Korean, German, French, Russian

paraformer-8k-v1

HTTP

File transcription for 8 kHz telephony; Chinese

paraformer-mtl-v1

HTTP

File transcription with speaker diarization; multilingual

Supported languages (by version):

  • paraformer-realtime-v2, paraformer-v2: Chinese (Mandarin, Cantonese, Wu, Hokkien, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin, Jiangxi, Yunnan, Shanghai), English, Japanese, Korean, German, French, Russian

  • paraformer-realtime-v1, paraformer-realtime-8k-v2, paraformer-realtime-8k-v1, paraformer-8k-v2, paraformer-8k-v1: Chinese (Mandarin)

  • paraformer-v1: Chinese (Mandarin), English

  • paraformer-mtl-v1: Chinese (Mandarin, Cantonese, Wu, Hokkien, Northeastern, Gansu, Guizhou, Henan, Hubei, Hunan, Ningxia, Shanxi, Shaanxi, Shandong, Sichuan, Tianjin), English, Japanese, Korean, Spanish, Indonesian, French, German, Italian, Malay

Others (deprecating)

The following models are scheduled for deprecation and listed for reference only. Migrate to the recommended models as soon as possible.

Model ID

API

Description

gummy-realtime-v1

WebSocket

Real-time recognition; Chinese, English, and dialects

gummy-chat-v1

WebSocket

Short-audio real-time recognition (1-minute limit); multilingual

sensevoice-v1

HTTP

File transcription; multilingual

Audio specifications

The following tables list audio specifications—input method, format, sample rate, and size/duration limits—for real-time and offline modes. For supported languages (including dialects), see the corresponding model family subsection under "All models". Qwen3.5-Omni / Qwen3-Omni models aren't dedicated ASR models. For audio specifications, refer to the user guide and API documentation for each model.

Real-time

Model ID

Input method

Audio format

Sample rate

Size/duration

Fun-ASR-Realtime / Fun-ASR-MTL-Realtime (fun-asr-realtime, fun-asr-mtl-realtime series)

Binary stream

pcm, wav, mp3, opus, speex, aac, amr

Any

Unlimited

Fun-ASR-Flash-8K-Realtime (fun-asr-flash-8k-realtime series)

Binary stream

Same as Fun-ASR-Realtime

8 kHz

Unlimited

Qwen-ASR-Realtime (qwen3-asr-flash-realtime series)

Binary stream

pcm, opus

8 kHz, 16 kHz

Unlimited

Paraformer-Realtime (paraformer-realtime-v2/v1, paraformer-realtime-8k-v2/v1)

Binary stream

Same as Fun-ASR-Realtime

paraformer-realtime-v2 any rate; paraformer-realtime-v1 16 kHz; paraformer-realtime-8k-* 8 kHz

Unlimited

All real-time models accept mono (single-channel) input only.

Offline

Model ID

Input method

Audio format

Sample rate

File size/duration

Fun-ASR (fun-asr, fun-asr-mtl series)

Publicly accessible file URL; 1 URL per request

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Any

≤2 GB; ≤12 hours (≤2 hours recommended when speaker diarization is enabled)

Fun-ASR-Flash (fun-asr-flash-2026-06-15)

URL / Base64; 1 file per request

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Any

≤2 GB; ≤5 minutes

Fun-ASR (fun-asr-realtimefun-asr-realtime-2026-02-28)

URL / Base64; 1 file per request

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

Any

≤2 GB; ≤5 minutes

Paraformer (paraformer-v2/v1, paraformer-mtl-v1, paraformer-8k-v2/v1)

Same as Fun-ASR

Same as Fun-ASR

paraformer-v2 /v1 any rate; paraformer-8k-* 8 kHz only ; paraformer-mtl-v1 16 kHz or higher

Same as Fun-ASR

Qwen3-ASR-Flash-Filetrans (qwen3-asr-flash-filetrans series)

Publicly accessible file URL; 1 URL per request

aac, amr, avi, flac, flv, m4a, mkv, mov, mp3, mp4, mpeg, ogg, opus, wav, webm, wma, wmv

pcm requires 16 kHz; other formats accept any rate (the server resamples to 16 kHz before recognition)

≤2 GB; ≤12 hours

Qwen3-ASR-Flash (qwen3-asr-flash series)

URL / Base64 / local file absolute path; 1 file per request

aac, amr, avi, aiff, flac, flv, mkv, mp3, mpeg, ogg, opus, wav, webm, wma, wmv

pcm requires 16 kHz; other formats accept any rate (the server resamples to 16 kHz before recognition)

≤10 MB; ≤5 minutes