Choose a model for speech-input-to-speech-output use cases, such as voice conversation, speech translation, and simultaneous interpretation.
This page covers the speech-to-speech use case. For broader multimodal capabilities—visual understanding, audio and video analysis, or content moderation—see the Omni-modal documentation.
S2S (speech-to-speech) vs. pipeline
There are two approaches to building voice applications:
S2S | Pipeline (ASR + LLM + TTS) | |
Latency | Low -- single-model stream processing | Higher -- three-stage serial processing |
Audio understanding | End-to-end -- perceives tone and emotion and responds accordingly | Converts to text before processing, losing subtle audio cues |
Voice customization | Selection of preset voices via a system prompt | Voice cloning and voice design (CosyVoice) |
Use S2S for low latency, audio-aware responses, and interactive conversation.
Use a pipeline when you need to customize voices or select best-in-class ASR, LLM, and TTS models for each stage.
This page focuses on the S2S single-model approach (the Omni and Livetranslate series). If you choose the pipeline approach, select each component from the corresponding documentation:
ASR (speech recognition): Speech-to-text
LLM (large language model): Text generation
TTS (text-to-speech): Speech synthesis
Real-time or file mode?
Real-time (WebSocket): For real-time voice interaction such as voice assistants, call centers, and simultaneous interpretation. Supports streaming audio input and speech output. Model names contain
-realtime.File mode (HTTP): Trades higher latency for better quality, ideal for video dubbing, podcast translation, and offline content processing. File mode also supports companion capabilities such as function calling, web search, thinking mode, and video context. For details, see Companion capabilities of the S2S single-model approach below.
Choose a model by use case (S2S single-model approach)
The use cases below all assume the S2S single-model approach. For the pipeline approach, choose components from the ASR, LLM, and TTS guides linked above.
Use case | Recommended model | API |
Voice assistants and customer-service conversations |
| WebSocket |
Cost-sensitive conversations |
| WebSocket |
Simultaneous interpretation and live translation |
| WebSocket |
Video dubbing and podcast translation |
| HTTP |
Video analysis and batch labeling (requires thinking mode) |
| HTTP |
Companion capabilities of the S2S single-model approach
Under the S2S single-model approach, the Qwen3.5-Omni and Qwen3-Omni models provide the following capabilities directly. With a pipeline approach, equivalent functionality must come from individual components (typically the LLM).
Function calling
Allows the model to perform actions—such as querying a knowledge base, checking a schedule, or triggering a workflow—based on what it hears and sees. Use Qwen3.5-Omni (in WebSocket or HTTP mode) or Qwen3-Omni (in HTTP mode).
Not supported by real-time models or the Livetranslate model.
Web search
Allows the model to retrieve real-time information to answer questions about current events, stock prices, weather, and similar topics. Use Qwen3.5-Omni in WebSocket or HTTP mode (both Plus and Flash series). The model decides on its own whether to search.
Not supported by Qwen3-Omni-Flash or the Livetranslate model.
Thinking mode
When answer quality outweighs latency, use Qwen3-Omni (HTTP mode). The model reasons step by step before replying, making it ideal for video analysis and batch labeling.
Voice generation is not supported in thinking mode.
Speech translation
The following model series support speech translation:
Qwen3.5-Livetranslate: Translates between 60 languages, with 29 producing both audio and text output and 31 producing text only. Covers major languages such as Chinese, English, French, German, Russian, Japanese, Korean, Spanish, Portuguese, and Arabic.
Qwen3-Livetranslate: Supports 18 languages and 5 Chinese dialects with a latency of approximately 3 seconds. In file mode, it uses video input to deliver more accurate, context-aware translations. For 7 of these languages, the output is text-only (no speech).
Qwen3.5-Omni: Supports 29 output languages and 8 Chinese dialects. Offers strong audio and video understanding and web search. Use a system prompt to inject terminology and domain context. Supports both real-time and file modes.
Qwen3-Omni-Flash: Supports 11 output languages and 8 Chinese dialects. Use a system prompt to inject terminology and domain context. Supports both real-time and file modes, at lower cost.
For a quick start with translation applications, use the Livetranslate series. For the highest quality and broadest language coverage, use Qwen3.5-Omni. For cost-sensitive scenarios, use Qwen3-Omni-Flash.
Recommended models
The table below lists the common entry-point model in each series. To pin a specific dated version (for version regression or stability), see All models below.
Model | API | Input | Function calling | Web search | Thinking mode | Translation |
| WebSocket | text, audio, image | Supported | Supported | -- | 29 languages |
| HTTP | text, audio, image, video | Supported | Supported | -- | 29 languages |
| WebSocket | text, audio, image | Supported | Supported | -- | 29 languages |
| HTTP | text, audio, image, video | Supported | Supported | -- | 29 languages |
| WebSocket | text, audio, image, video | -- | -- | -- | 11 languages |
| HTTP | text, audio, image, video | Supported | -- | Supported | 11 languages |
| WebSocket | audio, image | -- | -- | -- | 60 languages |
| HTTP | audio, video | -- | -- | -- | 18 languages |
All models
Qwen3.5-Omni
Model | API | Input | Function calling | Web search | Thinking mode |
| WebSocket | Text, audio, image, video | Supported | Supported | -- |
| WebSocket | Text, audio, image, video | Supported | Supported | -- |
| HTTP | Text, audio, image, video | Supported | Supported | -- |
| HTTP | Text, audio, image, video | Supported | Supported | -- |
| WebSocket | Text, audio, image, video | Supported | Supported | -- |
| WebSocket | Text, audio, image, video | Supported | Supported | -- |
| HTTP | Text, audio, image, video | Supported | Supported | -- |
| HTTP | Text, audio, image, video | Supported | Supported | -- |
Qwen3-Omni
Model | API | Input | Function calling | Web search | Thinking mode |
| WebSocket | Text, audio, image, video | -- | -- | -- |
| WebSocket | Text, audio, image, video | -- | -- | -- |
| WebSocket | Text, audio, image, video | -- | -- | -- |
| HTTP | Text, audio, image, video | Supported | -- | Supported |
| HTTP | Text, audio, image, video | Supported | -- | Supported |
| HTTP | Text, audio, image, video | Supported | -- | Supported |
Qwen3.5-Livetranslate
Model | API | Input | Languages |
| WebSocket | Audio | 60 |
| WebSocket | Audio | 60 |
Qwen3-Livetranslate
Model | API | Input | Languages |
| WebSocket | Audio | 18 |
| WebSocket | Audio | 18 |
| HTTP | Audio, video | 18 |
| HTTP | Audio, video | 18 |
Legacy models
These models are no longer updated. For new projects, use Qwen3.5-Omni.
Model | Input | API |
| Text, audio, image, video | HTTP |
| Text, audio, image, video | HTTP |
| Text, audio, image, video | HTTP |
| Text, audio, image, video | HTTP |
| Text, audio | WebSocket |
| Text, audio | WebSocket |
| Text, audio | WebSocket |
What's next
After you choose a model, see the corresponding API documentation:
Qwen3.5-Omni and Qwen3-Omni (WebSocket, real-time): Real-time Qwen-Omni-Realtime
Qwen3.5-Omni and Qwen3-Omni (HTTP, file): Non-real-time Qwen-Omni
Qwen3.5-Livetranslate (WebSocket, real-time): Real-time speech and audio-video translation with Qwen
Qwen3-Livetranslate (HTTP, file): Audio and video file translation with Qwen