Select the right model for multimodal understanding, audio and video analysis, voice conversation, content moderation, voice translation, and other omni-modal use cases.
Use cases
Omni-modal models can simultaneously understand text, audio, images, and video, and produce both text and speech output. Three model families are available: Qwen3.5-Omni (flagship, fullest capabilities), Qwen3-Omni-Flash (lightweight, lower cost, supports deep thinking), and Qwen3.5-Livetranslate (purpose-built translation, out-of-the-box). Select a model based on your use case:
|
Use case |
Recommended model |
User guide |
|
Real-time voice/video conversation: Interact with AI through a microphone and camera (voice assistants, customer service bots, visual Q&A, live-stream analysis) |
Qwen3.5-Omni Realtime (WebSocket) |
|
|
Audio and video analysis: Upload audio or video files for AI-generated text or speech responses (video content moderation, meeting transcription, caption generation) |
Qwen3.5-Omni (HTTP) |
|
|
Lightweight audio and video analysis: Analyze uploaded audio or video files at lower cost (single input capped at 150 seconds). Supports thinking mode (deep thinking) with text-only output |
Qwen3-Omni-Flash (HTTP) |
|
|
Real-time voice translation: Simultaneous interpretation with approximately 3-second latency, supporting 60 languages (live interpretation, multilingual meetings) |
Qwen3.5-Livetranslate (WebSocket) |
|
|
Audio and video file translation: Upload audio or video files and translate them into a target language (video dubbing, podcast translation) |
Qwen3-Livetranslate (HTTP) |
|
|
Voice cloning: Provide a reference audio clip and the AI generates speech responses in that voice |
Qwen3.5-Omni Plus / Flash (HTTP / WebSocket) |
-
For content analysis, Qwen3.5-Omni supports audio up to 3 hours and video up to 1 hour per request.
-
Function calling is supported by Qwen3.5-Omni (WebSocket + HTTP) and Qwen3-Omni-Flash (HTTP only).
-
Web search is supported by Qwen3.5-Omni only (HTTP / WebSocket). Web search and function calling cannot be enabled at the same time.
Translation
Omni-modal models support voice translation, with different models suited to different translation scenarios.
For quick setup, use Qwen3.5-Livetranslate (60 languages, approximately 3-second latency, out-of-the-box). For the highest quality and broadest language coverage, use Qwen3.5-Omni (29 output languages, with web search and term injection). For cost-sensitive workloads, use Qwen3-Omni-Flash (11 output languages, lower cost).
Recommended models
|
Model ID |
API |
Input |
Function calling |
Web search |
Thinking mode |
|
|
WebSocket |
Text, audio, images |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
|
|
WebSocket |
Text, audio, images |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
|
|
WebSocket |
Text, audio, images, video |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
|
|
WebSocket |
Audio, images |
|
|
|
|
|
HTTP |
Audio, video |
|
|
|
All models
Qwen3.5-Omni
|
Model ID |
API |
Input |
Function calling |
Web search |
Thinking mode |
|
|
WebSocket |
Text, audio, images, video |
|
|
|
|
|
WebSocket |
Text, audio, images, video |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
|
|
WebSocket |
Text, audio, images, video |
|
|
|
|
|
WebSocket |
Text, audio, images, video |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
Qwen3-Omni
|
Model ID |
API |
Input |
Function calling |
Web search |
Thinking mode |
|
|
WebSocket |
Text, audio, images, video |
|
|
|
|
|
WebSocket |
Text, audio, images, video |
|
|
|
|
|
WebSocket |
Text, audio, images, video |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
|
|
HTTP |
Text, audio, images, video |
|
|
|
Qwen3.5-Livetranslate
|
Model ID |
API |
Input |
Languages |
|
|
WebSocket |
Audio |
60 |
|
|
WebSocket |
Audio |
60 |
Qwen3-Livetranslate
|
Model ID |
API |
Input |
Languages |
|
|
WebSocket |
Audio |
18 |
|
|
WebSocket |
Audio |
18 |
|
|
HTTP |
Audio, video |
18 |
|
|
HTTP |
Audio, video |
18 |
Legacy models
The following models are no longer updated. Use Qwen3.5-Omni for new projects.
|
Model ID |
Input |
API |
|
|
Text, audio, images, video |
HTTP |
|
|
Text, audio, images, video |
HTTP |
|
|
Text, audio, images, video |
HTTP |
|
|
Text, audio, images, video |
HTTP |
|
|
Text, audio |
WebSocket |
|
|
Text, audio |
WebSocket |
|
|
Text, audio |
WebSocket |