实时语音识别服务通过 WebSocket 接收音频流并实时转写为带标点的文本,适用于直播字幕、在线会议、语音聊天、智能助手等场景。
概述
通过 WebSocket 流式协议实现低延迟音频到文本转换。
-
支持普通话及粤语、四川话等多种方言的高精度语音识别
-
具备应对复杂声学环境的能力,支持自动语种检测与智能非人声过滤
-
支持惊讶、平静、愉快、悲伤、厌恶、愤怒、恐惧等多种情绪状态识别
-
支持热词定制,可提升特定词汇的识别准确率
-
支持时间戳输出,生成结构化识别结果
-
灵活采样率与多种音频格式,适配不同录音环境
前提条件
快速开始
以下示例展示如何通过 DashScope SDK 快速调用实时语音识别服务。
Fun-ASR
识别传入麦克风的语音
识别麦克风传入的语音并实时输出文本,实现"边说边出字"的效果。
Java
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Main {
public static void main(String[] args) throws InterruptedException {
// 以下为北京地域url,若使用新加坡地域的模型,需将url替换为:wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference";
ExecutorService executorService = Executors.newSingleThreadExecutor();
executorService.submit(new RealtimeRecognitionTask());
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.MINUTES);
System.exit(0);
}
}
class RealtimeRecognitionTask implements Runnable {
@Override
public void run() {
RecognitionParam param = RecognitionParam.builder()
.model("fun-asr-realtime")
// 新加坡和北京地域的API Key不同。获取API Key:https://help.aliyun.com/zh/model-studio/get-api-key
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.format("pcm")
.sampleRate(16000)
.build();
Recognition recognizer = new Recognition();
ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
@Override
public void onEvent(RecognitionResult result) {
if (result.isSentenceEnd()) {
System.out.println("Final Result: " + result.getSentence().getText());
} else {
System.out.println("Intermediate Result: " + result.getSentence().getText());
}
}
@Override
public void onComplete() {
System.out.println("Recognition complete");
}
@Override
public void onError(Exception e) {
System.out.println("RecognitionCallback error: " + e.getMessage());
}
};
try {
recognizer.call(param, callback);
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音50s并进行实时转写
while (System.currentTimeMillis() - start < 50000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
recognizer.sendAudioFrame(buffer);
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
recognizer.stop();
} catch (Exception e) {
e.printStackTrace();
} finally {
// 任务结束后关闭 Websocket 连接
recognizer.getDuplexApi().close(1000, "bye");
}
System.out.println(
"[Metric] requestId: "
+ recognizer.getLastRequestId()
+ ", first package delay ms: "
+ recognizer.getFirstPackageDelay()
+ ", last package delay ms: "
+ recognizer.getLastPackageDelay());
}
}Python
运行Python示例前,需要通过pip install pyaudio命令安装第三方音频播放与采集套件。
import os
import signal # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys
import dashscope
import pyaudio
from dashscope.audio.asr import *
mic = None
stream = None
# Set recording parameters
sample_rate = 16000 # sampling rate (Hz)
channels = 1 # mono channel
dtype = 'int16' # data type
format_pcm = 'pcm' # the format of the audio data
block_size = 3200 # number of frames per buffer
# Real-time speech recognition callback
class Callback(RecognitionCallback):
def on_open(self) -> None:
global mic
global stream
print('RecognitionCallback open.')
mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True)
def on_close(self) -> None:
global mic
global stream
print('RecognitionCallback close.')
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_complete(self) -> None:
print('RecognitionCallback completed.') # recognition completed
def on_error(self, message) -> None:
print('RecognitionCallback task_id: ', message.request_id)
print('RecognitionCallback error: ', message.message)
# Stop and close the audio stream if it is running
if 'stream' in globals() and stream.active:
stream.stop()
stream.close()
# Forcefully exit the program
sys.exit(1)
def on_event(self, result: RecognitionResult) -> None:
sentence = result.get_sentence()
if 'text' in sentence:
print('RecognitionCallback text: ', sentence['text'])
if RecognitionResult.is_sentence_end(sentence):
print(
'RecognitionCallback sentence end, request_id:%s, usage:%s'
% (result.get_request_id(), result.get_usage(sentence)))
def signal_handler(sig, frame):
print('Ctrl+C pressed, stop recognition ...')
# Stop recognition
recognition.stop()
print('Recognition stopped.')
print(
'[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
.format(
recognition.get_last_request_id(),
recognition.get_first_package_delay(),
recognition.get_last_package_delay(),
))
# Forcefully exit the program
sys.exit(0)
# main function
if __name__ == '__main__':
# 新加坡和北京地域的API Key不同。获取API Key:https://help.aliyun.com/zh/model-studio/get-api-key
# 若没有配置环境变量,请用百炼API Key将下行替换为:dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
# 以下为北京地域url,若使用新加坡地域的模型,需将url替换为:wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url='wss://dashscope.aliyuncs.com/api-ws/v1/inference'
# Create the recognition callback
callback = Callback()
# Call recognition service by async mode, you can customize the recognition parameters, like model, format,
# sample_rate
recognition = Recognition(
model='fun-asr-realtime',
format=format_pcm,
# 'pcm'、'wav'、'opus'、'speex'、'aac'、'amr', you can check the supported formats in the document
sample_rate=sample_rate,
# support 8000, 16000
semantic_punctuation_enabled=False,
callback=callback)
# Start recognition
recognition.start()
signal.signal(signal.SIGINT, signal_handler)
print("Press 'Ctrl+C' to stop recording and recognition...")
# Create a keyboard listener until "Ctrl+C" is pressed
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
recognition.send_audio_frame(data)
else:
break
recognition.stop()识别本地音频文件
识别本地音频文件并输出结果,适用于对话聊天、控制口令、语音输入法、语音搜索等较短的准实时场景。
Java
示例中用到的音频为:asr_example.wav。
import com.alibaba.dashscope.api.GeneralApi;
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionResult;
import com.alibaba.dashscope.base.HalfDuplexParamBase;
import com.alibaba.dashscope.common.GeneralListParam;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.protocol.GeneralServiceOption;
import com.alibaba.dashscope.protocol.HttpMethod;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.protocol.StreamingMode;
import com.alibaba.dashscope.utils.Constants;
import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class TimeUtils {
private static final DateTimeFormatter formatter =
DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSS");
public static String getTimestamp() {
return LocalDateTime.now().format(formatter);
}
}
public class Main {
public static void main(String[] args) throws InterruptedException {
// 以下为北京地域url,若使用新加坡地域的模型,需将url替换为:wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference";
// 实际应用中,该方法仅在程序最开始执行一次即可,不必多次执行该方法。
warmUp();
ExecutorService executorService = Executors.newSingleThreadExecutor();
executorService.submit(new RealtimeRecognitionTask(Paths.get(System.getProperty("user.dir"), "asr_example.wav")));
executorService.shutdown();
// wait for all tasks to complete
executorService.awaitTermination(1, TimeUnit.MINUTES);
System.exit(0);
}
public static void warmUp() {
try {
// Lightweight GET request to establish connection
GeneralServiceOption warmupOption = GeneralServiceOption.builder()
.protocol(Protocol.HTTP)
.httpMethod(HttpMethod.GET)
.streamingMode(StreamingMode.OUT)
.path("assistants")
.build();
warmupOption.setBaseHttpUrl(Constants.baseHttpApiUrl);
GeneralApi<HalfDuplexParamBase> api = new GeneralApi<>();
api.get(GeneralListParam.builder().limit(1L).build(), warmupOption);
} catch (Exception e) {
// Reset flag to allow retry if pre-warming failed
}
}
}
class RealtimeRecognitionTask implements Runnable {
private Path filepath;
public RealtimeRecognitionTask(Path filepath) {
this.filepath = filepath;
}
@Override
public void run() {
RecognitionParam param = RecognitionParam.builder()
.model("fun-asr-realtime")
// 新加坡和北京地域的API Key不同。获取API Key:https://help.aliyun.com/zh/model-studio/get-api-key
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.format("wav")
.sampleRate(16000)
.build();
Recognition recognizer = new Recognition();
String threadName = Thread.currentThread().getName();
ResultCallback<RecognitionResult> callback = new ResultCallback<RecognitionResult>() {
@Override
public void onEvent(RecognitionResult message) {
if (message.isSentenceEnd()) {
System.out.println(TimeUtils.getTimestamp()+" "+
"[process " + threadName + "] Final Result:" + message.getSentence().getText());
} else {
System.out.println(TimeUtils.getTimestamp()+" "+
"[process " + threadName + "] Intermediate Result: " + message.getSentence().getText());
}
}
@Override
public void onComplete() {
System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Recognition complete");
}
@Override
public void onError(Exception e) {
System.out.println(TimeUtils.getTimestamp()+" "+
"[" + threadName + "] RecognitionCallback error: " + e.getMessage());
}
};
try {
recognizer.call(param, callback);
// Please replace the path with your audio file path
System.out.println(TimeUtils.getTimestamp()+" "+"[" + threadName + "] Input file_path is: " + this.filepath);
// Read file and send audio by chunks
FileInputStream fis = new FileInputStream(this.filepath.toFile());
byte[] allData = new byte[fis.available()];
int ret = fis.read(allData);
fis.close();
int sendFrameLength = 3200;
for (int i = 0; i * sendFrameLength < allData.length; i ++) {
int start = i * sendFrameLength;
int end = Math.min(start + sendFrameLength, allData.length);
ByteBuffer byteBuffer = ByteBuffer.wrap(allData, start, end - start);
recognizer.sendAudioFrame(byteBuffer);
Thread.sleep(100);
}
System.out.println(TimeUtils.getTimestamp()+" "+LocalDateTime.now());
recognizer.stop();
} catch (Exception e) {
e.printStackTrace();
} finally {
// 任务结束后关闭 Websocket 连接
recognizer.getDuplexApi().close(1000, "bye");
}
System.out.println(
"["
+ threadName
+ "][Metric] requestId: "
+ recognizer.getLastRequestId()
+ ", first package delay ms: "
+ recognizer.getFirstPackageDelay()
+ ", last package delay ms: "
+ recognizer.getLastPackageDelay());
}
}
Python
示例中用到的音频为:asr_example.wav。
import os
import time
import dashscope
from dashscope.audio.asr import *
# 新加坡和北京地域的API Key不同。获取API Key:https://help.aliyun.com/zh/model-studio/get-api-key
# 若没有配置环境变量,请用百炼API Key将下行替换为:dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
# 以下为北京地域url,若使用新加坡地域的模型,需将url替换为:wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference
dashscope.base_websocket_api_url = 'wss://dashscope.aliyuncs.com/api-ws/v1/inference'
from datetime import datetime
def get_timestamp():
now = datetime.now()
formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
return formatted_timestamp
class Callback(RecognitionCallback):
def on_complete(self) -> None:
print(get_timestamp() + ' Recognition completed') # recognition complete
def on_error(self, result: RecognitionResult) -> None:
print('Recognition task_id: ', result.request_id)
print('Recognition error: ', result.message)
exit(0)
def on_event(self, result: RecognitionResult) -> None:
sentence = result.get_sentence()
if 'text' in sentence:
print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
if RecognitionResult.is_sentence_end(sentence):
print(get_timestamp() +
'RecognitionCallback sentence end, request_id:%s, usage:%s'
% (result.get_request_id(), result.get_usage(sentence)))
callback = Callback()
recognition = Recognition(model='fun-asr-realtime',
format='wav',
sample_rate=16000,
callback=callback)
try:
audio_data: bytes = None
f = open("asr_example.wav", 'rb')
if os.path.getsize("asr_example.wav"):
# 一次性将文件数据全部读入buffer
file_buffer = f.read()
f.close()
print("Start Recognition")
recognition.start()
# 从buffer中间隔3200字节发送一次
buffer_size = len(file_buffer)
offset = 0
chunk_size = 3200
while offset < buffer_size:
# 计算本次要发送的数据块大小
remaining_bytes = buffer_size - offset
current_chunk_size = min(chunk_size, remaining_bytes)
# 从buffer中提取当前数据块
audio_data = file_buffer[offset:offset + current_chunk_size]
# 发送音频数据帧
recognition.send_audio_frame(audio_data)
# 更新偏移量
offset += current_chunk_size
# 添加延迟模拟实时传输
time.sleep(0.1)
recognition.stop()
else:
raise Exception(
'The supplied file was empty (zero bytes long)')
except Exception as e:
raise e
print(
'[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
.format(
recognition.get_last_request_id(),
recognition.get_first_package_delay(),
recognition.get_last_package_delay(),
))
Qwen-ASR
示例代码读取 your_audio_file.pcm(PCM16、16 kHz、单声道)。如仅有 MP3/WAV 等格式,可使用 ffmpeg 转换:
ffmpeg -i your_audio.mp3 -ar 16000 -ac 1 -f s16le your_audio_file.pcm
Java
import com.alibaba.dashscope.audio.omni.*;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.sound.sampled.LineUnavailableException;
import java.io.File;
import java.io.FileInputStream;
import java.util.Base64;
import java.util.Collections;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.atomic.AtomicReference;
public class Qwen3AsrRealtimeUsage {
private static final Logger log = LoggerFactory.getLogger(Qwen3AsrRealtimeUsage.class);
private static final int AUDIO_CHUNK_SIZE = 1024; // Audio chunk size in bytes
private static final int SLEEP_INTERVAL_MS = 30; // Sleep interval in milliseconds
public static void main(String[] args) throws InterruptedException, LineUnavailableException {
CountDownLatch finishLatch = new CountDownLatch(1);
OmniRealtimeParam param = OmniRealtimeParam.builder()
.model("qwen3-asr-flash-realtime")
// 以下为北京地域url,若使用新加坡地域的模型,需将url替换为:wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime
.url("wss://dashscope.aliyuncs.com/api-ws/v1/realtime")
// 新加坡和北京地域的API Key不同。获取API Key:https://help.aliyun.com/zh/model-studio/get-api-key
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apikey("sk-xxx")
.apikey(System.getenv("DASHSCOPE_API_KEY"))
.build();
OmniRealtimeConversation conversation = null;
final AtomicReference<OmniRealtimeConversation> conversationRef = new AtomicReference<>(null);
conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
@Override
public void onOpen() {
System.out.println("connection opened");
}
@Override
public void onEvent(JsonObject message) {
String type = message.get("type").getAsString();
switch(type) {
case "session.created":
System.out.println("start session: " + message.get("session").getAsJsonObject().get("id").getAsString());
break;
case "conversation.item.input_audio_transcription.completed":
System.out.println("transcription: " + message.get("transcript").getAsString());
finishLatch.countDown();
break;
case "input_audio_buffer.speech_started":
System.out.println("======VAD Speech Start======");
break;
case "input_audio_buffer.speech_stopped":
System.out.println("======VAD Speech Stop======");
break;
case "conversation.item.input_audio_transcription.text":
System.out.println("transcription: " + message.get("text").getAsString() + message.get("stash").getAsString());
break;
default:
break;
}
}
@Override
public void onClose(int code, String reason) {
System.out.println("connection closed code: " + code + ", reason: " + reason);
}
});
conversationRef.set(conversation);
try {
conversation.connect();
} catch (NoApiKeyException e) {
throw new RuntimeException(e);
}
OmniRealtimeTranscriptionParam transcriptionParam = new OmniRealtimeTranscriptionParam();
transcriptionParam.setLanguage("zh");
transcriptionParam.setInputAudioFormat("pcm");
transcriptionParam.setInputSampleRate(16000);
OmniRealtimeConfig config = OmniRealtimeConfig.builder()
.modalities(Collections.singletonList(OmniRealtimeModality.TEXT))
.transcriptionConfig(transcriptionParam)
.build();
conversation.updateSession(config);
String filePath = "your_audio_file.pcm";
File audioFile = new File(filePath);
if (!audioFile.exists()) {
log.error("Audio file not found: {}", filePath);
return;
}
try (FileInputStream audioInputStream = new FileInputStream(audioFile)) {
byte[] audioBuffer = new byte[AUDIO_CHUNK_SIZE];
int bytesRead;
int totalBytesRead = 0;
log.info("Starting to send audio data from: {}", filePath);
// Read and send audio data in chunks
while ((bytesRead = audioInputStream.read(audioBuffer)) != -1) {
totalBytesRead += bytesRead;
String audioB64 = Base64.getEncoder().encodeToString(audioBuffer);
// Send audio chunk to conversation
conversation.appendAudio(audioB64);
// Add small delay to simulate real-time audio streaming
Thread.sleep(SLEEP_INTERVAL_MS);
}
log.info("Finished sending audio data. Total bytes sent: {}", totalBytesRead);
} catch (Exception e) {
log.error("Error sending audio from file: {}", filePath, e);
}
//send session.finish and wait for finish and close
conversation.endSession();
log.info("task finished");
System.exit(0);
}
}Python
import logging
import os
import base64
import signal
import sys
import time
import dashscope
from dashscope.audio.qwen_omni import *
from dashscope.audio.qwen_omni.omni_realtime import TranscriptionParams
def setup_logging():
"""配置日志输出"""
logger = logging.getLogger('dashscope')
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.propagate = False
return logger
def init_api_key():
"""初始化 API Key"""
# 新加坡和北京地域的API Key不同。获取API Key:https://help.aliyun.com/zh/model-studio/get-api-key
# 若没有配置环境变量,请用百炼API Key将下行替换为:dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY', 'YOUR_API_KEY')
if dashscope.api_key == 'YOUR_API_KEY':
print('[Warning] Using placeholder API key, set DASHSCOPE_API_KEY environment variable.')
class MyCallback(OmniRealtimeCallback):
"""实时识别回调处理"""
def __init__(self, conversation):
self.conversation = conversation
self.handlers = {
'session.created': self._handle_session_created,
'conversation.item.input_audio_transcription.completed': self._handle_final_text,
'conversation.item.input_audio_transcription.text': self._handle_transcription_text,
'input_audio_buffer.speech_started': lambda r: print('======Speech Start======'),
'input_audio_buffer.speech_stopped': lambda r: print('======Speech Stop======')
}
def on_open(self):
print('Connection opened')
def on_close(self, code, msg):
print(f'Connection closed, code: {code}, msg: {msg}')
def on_event(self, response):
try:
handler = self.handlers.get(response['type'])
if handler:
handler(response)
except Exception as e:
print(f'[Error] {e}')
def _handle_session_created(self, response):
print(f"Start session: {response['session']['id']}")
def _handle_final_text(self, response):
print(f"Final recognized text: {response['transcript']}")
def _handle_transcription_text(self, response):
print(f"Got transcription result: {response['text'] + response['stash']}")
def read_audio_chunks(file_path, chunk_size=3200):
"""按块读取音频文件"""
with open(file_path, 'rb') as f:
while chunk := f.read(chunk_size):
yield chunk
def send_audio(conversation, file_path, delay=0.1):
"""发送音频数据"""
if not os.path.exists(file_path):
raise FileNotFoundError(f"Audio file {file_path} does not exist.")
print("Processing audio file... Press 'Ctrl+C' to stop.")
for chunk in read_audio_chunks(file_path):
audio_b64 = base64.b64encode(chunk).decode('ascii')
conversation.append_audio(audio_b64)
time.sleep(delay)
def main():
setup_logging()
init_api_key()
audio_file_path = "./your_audio_file.pcm"
callback = MyCallback(conversation=None)
conversation = OmniRealtimeConversation(
model='qwen3-asr-flash-realtime',
# 以下为北京地域url,若使用新加坡地域的模型,需将url替换为:wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime
url='wss://dashscope.aliyuncs.com/api-ws/v1/realtime',
callback=callback,
)
callback.conversation = conversation # 把 conversation 注入回调,用于回调中调用其方法
def handle_exit(sig, frame):
print('Ctrl+C pressed, exiting...')
conversation.close()
sys.exit(0)
signal.signal(signal.SIGINT, handle_exit)
conversation.connect()
transcription_params = TranscriptionParams(
language='zh',
sample_rate=16000,
input_audio_format="pcm"
)
conversation.update_session(
output_modalities=[MultiModality.TEXT],
enable_input_audio_transcription=True,
transcription_params=transcription_params
)
try:
send_audio(conversation, audio_file_path)
# send session.finish and wait for finished and close
conversation.end_session()
except Exception as e:
print(f"Error occurred: {e}")
finally:
conversation.close()
print("Audio processing completed.")
if __name__ == '__main__':
main()Paraformer
Paraformer示例代码和Fun-ASR相似,将model替换成Paraformer模型名即可。
进阶功能
Qwen-ASR 交互模式
Qwen-ASR Realtime API 提供两种交互模式:
-
VAD 模式(默认):服务端自动检测语音的起点和终点(断句),适用于实时对话、会议记录等场景。启用方式:配置
session.turn_detection参数(默认启用)。 -
Manual 模式:由客户端通过发送
input_audio_buffer.commit控制断句,适用于需要明确控制发送时机的场景(如聊天软件发送语音)。启用方式:将session.turn_detection设为 null。
切换交互模式:
-
WebSocket:通过
session.update事件中的turn_detection字段设置。{ "type": "session.update", "session": { "turn_detection": null } } -
Python SDK:在
update_session方法中通过enable_turn_detection参数设置。conversation.update_session( enable_turn_detection=False ) -
Java SDK:通过
OmniRealtimeConfig.builder()设置enableTurnDetection参数。OmniRealtimeConfig config = OmniRealtimeConfig.builder() .enableTurnDetection(false) .build(); conversation.updateSession(config);
完整的 SDK 代码示例请参见Python SDK和Java SDK。WebSocket 事件生命周期请参见事件交互流程。
VAD 断句配置
VAD(Voice Activity Detection,语音活动检测)用于判定一段连续语音何时结束,从而触发"最终识别结果"事件。三类模型均默认启用服务端 VAD,但参数命名与可调粒度不同:
-
Qwen-ASR:通过
session.turn_detection配置,含silence_duration_ms(静音持续时长阈值,超过则判定 turn 结束,服务端默认800,对话和聊天等需快速断句的场景推荐设为400)与threshold(VAD 检测灵敏度,服务端默认0.2)。Qwen-ASR 还支持关闭 VAD 改用客户端 commit 控制断句的 Manual 模式,详见上文 Qwen-ASR 交互模式。 -
Fun-ASR / Paraformer:通过
max_sentence_silence(VAD 断句静音阈值,毫秒)配置。当一段语音后的静音时长超过该阈值时,系统判定该句子已结束。
参数名因协议而异(同一含义在 Qwen-ASR 中称 silence_duration_ms,在 Fun-ASR / Paraformer 中称 max_sentence_silence)。完整字段定义请参见API参考。
使用热词提升准确率
Fun-ASR 和 Paraformer 系列支持通过热词提升特定词汇(品牌名、人名、专有术语等)的识别准确率。
详细的热词配置方法和使用说明,请参见自定义热词。
获取时间戳
Fun-ASR 和 Paraformer 系列模型默认输出句级与字级两种粒度的时间戳,便于字幕对齐、关键词高亮、卡拉 OK 跟读等场景。Qwen-ASR Realtime(qwen3-asr-flash-realtime)当前不返回时间戳信息,如需时间戳请使用 Fun-ASR 或 Paraformer。Qwen-ASR 的录音文件转写模型 qwen3-asr-flash-filetrans 支持字级时间戳,详见非实时语音识别。
时间戳单位均为毫秒,分两个层级返回:
-
句级:
payload.output.sentence.begin_time与payload.output.sentence.end_time,标识整句在音频中的起止时刻。中间结果中end_time可能为null,待句子结束(sentence_end = true)时填充最终值。 -
字级:
payload.output.sentence.words数组,每个元素包含begin_time、end_time、text(该字/词文本)以及punctuation(该字后跟随的标点,无则为空串)。
返回结构示例(节选):
{
"payload": {
"output": {
"sentence": {
"begin_time": 170,
"end_time": 920,
"text": "好,我知道了",
"sentence_end": true,
"words": [
{ "begin_time": 170, "end_time": 295, "text": "好", "punctuation": "," },
{ "begin_time": 295, "end_time": 503, "text": "我", "punctuation": "" },
{ "begin_time": 503, "end_time": 711, "text": "知道", "punctuation": "" },
{ "begin_time": 711, "end_time": 920, "text": "了", "punctuation": "" }
]
}
}
}
}
以上字段名以 WebSocket JSON 路径为准。不同 SDK 暴露上述字段的命名习惯不同(如字典 key、对象属性、getter 方法等),完整字段对照请参见各 SDK 的 API 参考。
完整字段定义请参见API参考。
情感识别
Qwen-ASR 与 Paraformer 部分模型可在转写结果中附带说话人的情绪状态,但两者输出粒度与开启方式不同。
Qwen-ASR(qwen3-asr-flash-realtime):固定开启,无需配置。在 conversation.item.input_audio_transcription.text 与 conversation.item.input_audio_transcription.completed 事件中均通过顶层 emotion 字段返回,取值为 7 类细粒度情绪:surprised(惊讶)、neutral(平静)、happy(愉快)、sad(悲伤)、disgusted(厌恶)、angry(愤怒)、fearful(恐惧)。
{
"type": "conversation.item.input_audio_transcription.text",
"emotion": "neutral",
"text": "今天天气不错",
"stash": ""
}
Paraformer(paraformer-realtime-8k-v2):仅此一款 Paraformer 模型支持情感识别,结果通过 payload.output.sentence.emo_tag 与 payload.output.sentence.emo_confidence 返回,取值为 3 类极性:positive(正面,如开心、满意)、negative(负面,如愤怒、沉闷)、neutral(无明显情感),置信度范围 [0.0, 1.0]。
情感识别需同时满足以下条件才会输出:
-
模型为
paraformer-realtime-8k-v2。 -
语义断句关闭:
semantic_punctuation_enabled = false(默认即为 false,无需特别设置)。 -
仅在
sentence_end = true的句子结束事件中返回。
如不希望返回情感识别字段,可将 semantic_punctuation_enabled 设为 true,此时将启用语义断句、不再返回 emo_tag 与 emo_confidence 字段。
以上字段名以 WebSocket JSON 路径为准。不同 SDK 暴露上述字段的命名习惯不同(如字典 key、对象属性、getter 方法等),完整字段对照请参见各 SDK 的 API 参考。
完整字段定义、取值约束与示例请参见API参考。
WebSocket 原始协议调用
以下示例展示如何通过 WebSocket 原始协议直连服务端,适用于不使用 DashScope SDK 的场景。此为最小可运行实现,WebSocket 协议请参见各模型的 API参考。
连接复用(WebSocket)
Fun-ASR 和 Paraformer 的 WebSocket 连接支持复用:一个识别任务结束后,无需重新建立连接即可开启下一个任务。
复用流程:客户端发送 finish-task,服务端返回 task-finished 后,可重新发送 run-task 开启新任务。
-
必须等服务端返回
task-finished事件后才可发起新任务。 -
复用连接中的不同任务需要使用不同的
task_id。 -
任务失败时服务端返回错误事件并关闭连接,该连接不可复用。
-
任务结束后 60 秒无新任务,连接自动断开。
Qwen-ASR Realtime 采用会话模式,每次会话结束后需主动断开连接,不支持连接复用。
各模型事件说明请参见对应的API参考。
高并发最佳实践
DashScope SDK 内置池化机制,可复用 WebSocket 连接和识别对象,避免频繁创建销毁带来的开销。目前仅 Paraformer Java SDK 支持此功能。
应用于生产环境
提升识别效果
-
选择匹配采样率的模型:8kHz 电话音频直接使用 8kHz 模型,避免升采样到 16kHz 造成的信息失真。
-
优化输入音频质量:使用高质量麦克风,确保录音环境信噪比高、无回声。可在应用层集成降噪(如 RNNoise)、回声消除(AEC)等算法做预处理。
设置容错策略
-
客户端重连:客户端应实现断线自动重连机制,以应对网络抖动。Python SDK 参考实现如下:
-
捕获异常:在
Callback类中实现on_error方法。当dashscopeSDK遇到网络错误或其他问题时,会调用该方法。 -
状态通知:当
on_error被触发时,设置重连信号。在Python中可以使用threading.Event,它是一种线程安全的信号标志。 -
重连循环:将主逻辑包裹在一个
for循环中(例如重试3次)。当检测到重连信号后,当前轮次的识别会中断,清理资源,然后等待几秒钟,再次进入循环,创建一个全新的连接。
-
-
设置心跳防止连接断开:当需要与服务端保持长连接时,可将参数heartbeat设置为
true,即使音频中长时间没有声音,与服务端的连接也不会中断。 -
模型限流:在调用模型接口时请注意模型的限流规则。
适用范围
不同服务部署范围支持的模型不同:
中国内地
服务部署范围为中国内地时,模型推理计算资源仅限于中国内地;静态数据存储于您所选的地域。该部署范围支持的地域:华北2(北京)。
调用以下模型时,请选择北京地域的API Key:
-
Fun-ASR:
-
fun-asr-realtime(稳定版,当前等同fun-asr-realtime-2025-11-07)、fun-asr-realtime-2026-02-28(最新快照版)、fun-asr-realtime-2025-11-07(快照版)、fun-asr-realtime-2025-09-15(快照版)
-
fun-asr-flash-8k-realtime(稳定版,当前等同fun-asr-flash-8k-realtime-2026-01-28)、fun-asr-flash-8k-realtime-2026-01-28
-
-
千问3-ASR-Flash-Realtime:qwen3-asr-flash-realtime(稳定版,当前等同qwen3-asr-flash-realtime-2025-10-27)、qwen3-asr-flash-realtime-2026-02-10(最新快照版)、qwen3-asr-flash-realtime-2025-10-27(快照版)
-
Paraformer:paraformer-realtime-v2、paraformer-realtime-v1、paraformer-realtime-8k-v2、paraformer-realtime-8k-v1
国际
服务部署范围为国际时,模型推理计算资源在全球范围内动态调度(不含中国内地);静态数据存储于您所选的地域。该部署范围支持的地域:新加坡。
调用以下模型时,请选择新加坡地域的API Key:
-
Fun-ASR:fun-asr-realtime(稳定版,当前等同fun-asr-realtime-2025-11-07)、fun-asr-realtime-2025-11-07(快照版)
-
千问3-ASR-Flash-Realtime:qwen3-asr-flash-realtime(稳定版,当前等同qwen3-asr-flash-realtime-2025-10-27)、qwen3-asr-flash-realtime-2026-02-10(最新快照版)、qwen3-asr-flash-realtime-2025-10-27(快照版)
API参考
常见问题
实时语音识别支持哪些音频格式?
Fun-ASR 和 Paraformer 模型支持 pcm、wav、mp3、opus、speex、aac、amr 格式。Qwen-ASR 模型推荐使用 pcm 或 opus 格式;其他格式(如 wav、aac、amr)虽然在 session.update 校验层会被接受,但服务端实际解码可能失败,请务必确认音频流为推荐格式后再发送。
SDK 和 WebSocket API 有什么区别?该如何选择?
DashScope SDK 封装了 WebSocket 连接管理、鉴权、重连等细节,适合快速集成。WebSocket API 直连提供更细粒度的控制能力,适用于 SDK 未覆盖的编程语言或需要自定义连接管理的场景。推荐优先使用 SDK。
如何提升专有名词的识别准确率?
使用热词(Fun-ASR、Paraformer 支持)。热词适合提升固定词汇的识别率。
连接经常断开怎么办?
建议实现客户端重连机制,并开启心跳参数(heartbeat=true)防止长时间无音频导致连接断开。详细的容错策略请参见应用于生产环境。
模型应用上架及备案
参见应用合规备案。