Speech-to-speech

更新时间:
复制 MD 格式

Choose a model for speech-input-to-speech-output use cases, such as voice conversation, speech translation, and simultaneous interpretation.

This page covers the speech-to-speech use case. For broader multimodal capabilities—visual understanding, audio and video analysis, or content moderation—see the Omni-modal documentation.

S2S (speech-to-speech) vs. pipeline

There are two approaches to building voice applications:

S2S

Pipeline (ASR + LLM + TTS)

Latency

Low -- single-model stream processing

Higher -- three-stage serial processing

Audio understanding

End-to-end -- perceives tone and emotion and responds accordingly

Converts to text before processing, losing subtle audio cues

Voice customization

Selection of preset voices via a system prompt

Voice cloning and voice design (CosyVoice)

  • Use S2S for low latency, audio-aware responses, and interactive conversation.

  • Use a pipeline when you need to customize voices or select best-in-class ASR, LLM, and TTS models for each stage.

This page focuses on the S2S single-model approach (the Omni and Livetranslate series). If you choose the pipeline approach, select each component from the corresponding documentation:

Real-time or file mode?

  • Real-time (WebSocket): For real-time voice interaction such as voice assistants, call centers, and simultaneous interpretation. Supports streaming audio input and speech output. Model names contain -realtime.

  • File mode (HTTP): Trades higher latency for better quality, ideal for video dubbing, podcast translation, and offline content processing. File mode also supports companion capabilities such as function calling, web search, thinking mode, and video context. For details, see Companion capabilities of the S2S single-model approach below.

Choose a model by use case (S2S single-model approach)

The use cases below all assume the S2S single-model approach. For the pipeline approach, choose components from the ASR, LLM, and TTS guides linked above.

Use case

Recommended model

API

Voice assistants and customer-service conversations

qwen3.5-omni-plus-realtime

WebSocket

Cost-sensitive conversations

qwen3.5-omni-flash-realtime

WebSocket

Simultaneous interpretation and live translation

qwen3.5-livetranslate-flash-realtime

WebSocket

Video dubbing and podcast translation

qwen3-livetranslate-flash

HTTP

Video analysis and batch labeling (requires thinking mode)

qwen3-omni-flash

HTTP

Companion capabilities of the S2S single-model approach

Under the S2S single-model approach, the Qwen3.5-Omni and Qwen3-Omni models provide the following capabilities directly. With a pipeline approach, equivalent functionality must come from individual components (typically the LLM).

Function calling

Allows the model to perform actions—such as querying a knowledge base, checking a schedule, or triggering a workflow—based on what it hears and sees. Use Qwen3.5-Omni (in WebSocket or HTTP mode) or Qwen3-Omni (in HTTP mode).

Not supported by real-time models or the Livetranslate model.

Web search

Allows the model to retrieve real-time information to answer questions about current events, stock prices, weather, and similar topics. Use Qwen3.5-Omni in WebSocket or HTTP mode (both Plus and Flash series). The model decides on its own whether to search.

Not supported by Qwen3-Omni-Flash or the Livetranslate model.

Thinking mode

When answer quality outweighs latency, use Qwen3-Omni (HTTP mode). The model reasons step by step before replying, making it ideal for video analysis and batch labeling.

Voice generation is not supported in thinking mode.

Speech translation

The following model series support speech translation:

  • Qwen3.5-Livetranslate: Translates between 60 languages, with 29 producing both audio and text output and 31 producing text only. Covers major languages such as Chinese, English, French, German, Russian, Japanese, Korean, Spanish, Portuguese, and Arabic.

  • Qwen3-Livetranslate: Supports 18 languages and 5 Chinese dialects with a latency of approximately 3 seconds. In file mode, it uses video input to deliver more accurate, context-aware translations. For 7 of these languages, the output is text-only (no speech).

  • Qwen3.5-Omni: Supports 29 output languages and 8 Chinese dialects. Offers strong audio and video understanding and web search. Use a system prompt to inject terminology and domain context. Supports both real-time and file modes.

  • Qwen3-Omni-Flash: Supports 11 output languages and 8 Chinese dialects. Use a system prompt to inject terminology and domain context. Supports both real-time and file modes, at lower cost.

Note

For a quick start with translation applications, use the Livetranslate series. For the highest quality and broadest language coverage, use Qwen3.5-Omni. For cost-sensitive scenarios, use Qwen3-Omni-Flash.

Supported languages

Language

Qwen3.5-Livetranslate

Qwen3-Livetranslate

Qwen3.5-Omni

Qwen3-Omni-Flash

English

Supported

Supported

Supported

Supported

Chinese (Mandarin)

Supported

Supported

Supported

Supported

Cantonese

Text-only

Supported

Supported

Supported

Sichuan dialect

Supported

Supported

Supported

Supported

Shanghainese

Supported

Supported

Supported

Supported

Beijing dialect

Supported

Supported

Supported

Supported

Tianjin dialect

Supported

Supported

Supported

Supported

Nanjing dialect

--

--

Supported

Supported

Shaanxi dialect

--

--

Supported

Supported

Minnan dialect

--

--

Supported

Supported

French

Supported

Supported

Supported

Supported

German

Supported

Supported

Supported

Supported

Russian

Supported

Supported

Supported

Supported

Italian

Supported

Supported

Supported

Supported

Spanish

Supported

Supported

Supported

Supported

Portuguese

Supported

Supported

Supported

Supported

Japanese

Supported

Supported

Supported

Supported

Korean

Supported

Supported

Supported

Supported

Thai

Supported

Text-only

Supported

Supported

Indonesian

Supported

Text-only

Supported

--

Vietnamese

Supported

Text-only

Supported

--

Arabic

Supported

Text-only

Supported

--

Hindi

Supported

Text-only

Supported

--

Turkish

Supported

Text-only

Supported

--

Finnish

Supported

--

Supported

--

Polish

Supported

--

Supported

--

Dutch

Supported

--

Supported

--

Czech

Supported

--

Supported

--

Urdu

Supported

--

Supported

--

Tagalog

Supported

--

Supported

--

Swedish

Supported

--

Supported

--

Danish

Supported

--

Supported

--

Hebrew

Supported

--

Supported

--

Icelandic

Supported

--

Supported

--

Malay

Supported

--

Supported

--

Norwegian

Supported

--

Supported

--

Persian

Supported

--

Supported

--

Greek

Text-only

Text-only

--

--

Afrikaans

Text-only

--

--

--

Asturian

Text-only

--

--

--

Belarusian

Text-only

--

--

--

Bulgarian

Text-only

--

--

--

Bengali

Text-only

--

--

--

Bosnian

Text-only

--

--

--

Catalan

Text-only

--

--

--

Cebuano

Text-only

--

--

--

Estonian

Text-only

--

--

--

Galician

Text-only

--

--

--

Gujarati

Text-only

--

--

--

Croatian

Text-only

--

--

--

Hungarian

Text-only

--

--

--

Javanese

Text-only

--

--

--

Kazakh

Text-only

--

--

--

Kannada

Text-only

--

--

--

Kyrgyz

Text-only

--

--

--

Latvian

Text-only

--

--

--

Macedonian

Text-only

--

--

--

Malayalam

Text-only

--

--

--

Marathi

Text-only

--

--

--

Punjabi

Text-only

--

--

--

Romanian

Text-only

--

--

--

Slovak

Text-only

--

--

--

Slovenian

Text-only

--

--

--

Swahili

Text-only

--

--

--

Tajik

Text-only

--

--

--

Azerbaijani

Text-only

--

--

--

Ukrainian

Text-only

--

--

--

"Supported" means the model produces both speech and text output. "Text-only" means the model produces text output but no speech.

Qwen3.5-Omni supports 113 input languages and dialects.

Qwen3.5-Livetranslate supports 60 languages (29 with audio and text, 31 text only).

The legacy qwen-omni-turbo model supports only Chinese and English.

Recommended models

The table below lists the common entry-point model in each series. To pin a specific dated version (for version regression or stability), see All models below.

Model

API

Input

Function calling

Web search

Thinking mode

Translation

qwen3.5-omni-plus-realtime

WebSocket

text, audio, image

Supported

Supported

--

29 languages

qwen3.5-omni-plus

HTTP

text, audio, image, video

Supported

Supported

--

29 languages

qwen3.5-omni-flash-realtime

WebSocket

text, audio, image

Supported

Supported

--

29 languages

qwen3.5-omni-flash

HTTP

text, audio, image, video

Supported

Supported

--

29 languages

qwen3-omni-flash-realtime

WebSocket

text, audio, image, video

--

--

--

11 languages

qwen3-omni-flash

HTTP

text, audio, image, video

Supported

--

Supported

11 languages

qwen3.5-livetranslate-flash-realtime

WebSocket

audio, image

--

--

--

60 languages

qwen3-livetranslate-flash

HTTP

audio, video

--

--

--

18 languages

All models

Qwen3.5-Omni

Model

API

Input

Function calling

Web search

Thinking mode

qwen3.5-omni-plus-realtime

WebSocket

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-plus-realtime-2026-03-15

WebSocket

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-plus

HTTP

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-plus-2026-03-15

HTTP

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash-realtime

WebSocket

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash-realtime-2026-03-15

WebSocket

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash

HTTP

Text, audio, image, video

Supported

Supported

--

qwen3.5-omni-flash-2026-03-15

HTTP

Text, audio, image, video

Supported

Supported

--

Qwen3-Omni

Model

API

Input

Function calling

Web search

Thinking mode

qwen3-omni-flash-realtime

WebSocket

Text, audio, image, video

--

--

--

qwen3-omni-flash-realtime-2025-12-01

WebSocket

Text, audio, image, video

--

--

--

qwen3-omni-flash-realtime-2025-09-15

WebSocket

Text, audio, image, video

--

--

--

qwen3-omni-flash

HTTP

Text, audio, image, video

Supported

--

Supported

qwen3-omni-flash-2025-12-01

HTTP

Text, audio, image, video

Supported

--

Supported

qwen3-omni-flash-2025-09-15

HTTP

Text, audio, image, video

Supported

--

Supported

Qwen3.5-Livetranslate

Model

API

Input

Languages

qwen3.5-livetranslate-flash-realtime

WebSocket

Audio

60

qwen3.5-livetranslate-flash-realtime-2026-05-19

WebSocket

Audio

60

Qwen3-Livetranslate

Model

API

Input

Languages

qwen3-livetranslate-flash-realtime

WebSocket

Audio

18

qwen3-livetranslate-flash-realtime-2025-09-22

WebSocket

Audio

18

qwen3-livetranslate-flash

HTTP

Audio, video

18

qwen3-livetranslate-flash-2025-12-01

HTTP

Audio, video

18

Legacy models

These models are no longer updated. For new projects, use Qwen3.5-Omni.

Model

Input

API

qwen2.5-omni-7b

Text, audio, image, video

HTTP

qwen-omni-turbo

Text, audio, image, video

HTTP

qwen-omni-turbo-latest

Text, audio, image, video

HTTP

qwen-omni-turbo-2025-03-26

Text, audio, image, video

HTTP

qwen-omni-turbo-realtime

Text, audio

WebSocket

qwen-omni-turbo-realtime-latest

Text, audio

WebSocket

qwen-omni-turbo-realtime-2025-05-08

Text, audio

WebSocket

What's next

After you choose a model, see the corresponding API documentation:

  • Qwen3.5-Omni and Qwen3-Omni (WebSocket, real-time): Real-time Qwen-Omni-Realtime

  • Qwen3.5-Omni and Qwen3-Omni (HTTP, file): Non-real-time Qwen-Omni

  • Qwen3.5-Livetranslate (WebSocket, real-time): Real-time speech and audio-video translation with Qwen

  • Qwen3-Livetranslate (HTTP, file): Audio and video file translation with Qwen