语音识别(Automatic Speech Recognition,ASR)可以将音视频中包含的人类声音转换为文本。语音翻译(Automatic Speech Translation)结合了语音识别和机器翻译技术,直接将一种语言的语音转化为另一种语言的文本。适用于会议记录、客户服务、媒体制作、市场研究及多样化的实时交互场景,能显著提升工作效率、服务质量与人机交互体验。
语音识别也称为语音转写、语音转录、语音转文字等。
录音文件识别
录音文件识别(也称为录音文件转写)是指对音视频文件进行语音识别,将语音转换为文本。支持单个文件识别和批量文件识别,适用于处理不需要即时返回结果的场景。
应用场景
会议、课堂录音记录:将录音文件转成文字,方便后期快速进行信息检索、分析和整理重点内容。
客服电话分析:自动记录并分析客户电话,快速理解客户需求,自动分类服务请求,甚至识别客户情绪,从而提升服务质量与效率。
字幕生成:帮助媒体制作与后期编辑人员识别音视频材料并生成对应的字幕,加速后期制作的流程。
市场研究与数据分析:将市场调研中收集到的消费者访谈、焦点小组讨论等录音数据,通过识别模型进行分析,提取消费者意见、偏好等信息,为企业决策提供支持。
支持的模型
通义千问ASR
通义千问ASR基于Qwen-Audio训练的专用于语音识别的模型,支持中英文识别。目前为Beta版本。
通义千问Audio模型按输入和输出的总Token数进行计费。
音频转换为Token的规则:每一秒钟的音频对应25个Token。若音频时长不足1秒,则按25个Token计算。
虽然通义千问ASR基于Qwen-Audio进行训练,但它不支持多轮对话和自定义的System Prompt及User Prompt。
模型名称 | 版本 | 支持的语言 | 支持的格式 | 支持的采样率 | 上下文长度 | 最大输入 | 最大输出 | 输入成本 | 输出成本 | 免费额度 |
(Token数) | (每千Token) | |||||||||
qwen-audio-asr 当前等同qwen-audio-asr-2024-12-04 | 稳定版 | 中文、英文 | 音频 | 16kHz | 8,192 | 6,144 | 2,048 | 目前仅供免费体验。 免费额度用完后不可调用,敬请关注后续动态。 | 10万Token 有效期:百炼开通后180天内 | |
qwen-audio-asr-latest 始终等同最新快照版 | 最新版 | |||||||||
qwen-audio-asr-2024-12-04 | 快照版 |
Paraformer
Paraformer基于通义实验室新一代非自回归端到端模型,大幅提高语音识别精度和准确率,目前有多个模型版本,越新的版本(版本号越大越新)效果越好。
模型名称 | 支持的语言 | 支持的采样率 | 适用场景 | 支持的音频格式 | 单价 | 免费额度 |
paraformer-v2 | 中文普通话、中文方言(粤语、吴语、闽南语、东北话、甘肃话、贵州话、河南话、湖北话、湖南话、宁夏话、山西话、陕西话、山东话、四川话、天津话、江西话、云南话、上海话)、英语、日语、韩语 | 任意 | 视频直播 | aac、amr、avi、flac、flv、m4a、mkv、mov、mp3、mp4、mpeg、ogg、opus、wav、webm、wma、wmv | 0.00008元/秒 | 36,000秒(10小时) 每月1日0点自动发放 有效期1个月 |
paraformer-8k-v2 | 中文普通话 | 8kHz | 电话语音 | |||
paraformer-v1 | 中文普通话、英语 | 任意 | 音频或视频 | |||
paraformer-8k-v1 | 中文普通话 | 8kHz | 电话语音 | |||
paraformer-mtl-v1 | 中文普通话、中文方言(粤语、吴语、闽南语、东北话、甘肃话、贵州话、河南话、湖北话、湖南话、宁夏话、山西话、陕西话、山东话、四川话、天津话)、英语、日语、韩语、西班牙语、印尼语、法语、德语、意大利语、马来语 | 16kHz及以上 | 音频或视频 |
SenseVoice
SenseVoice语音识别大模型专注于高精度多语言语音识别、情感辨识和音频事件检测,支持超过50种语言的识别,中文与粤语识别准确率相对提升在50%以上。
模型名称 | 支持的语言 | 支持的格式 | 单价 | 免费额度 |
sensevoice-v1 | 超过50种语言(中、英、日、韩、粤等) | 音频或视频:aac、amr、avi、flac、flv、m4a、mkv、mov、mp3、mp4、mpeg、ogg、opus、wav、webm、wma、wmv | 0.0007 元/秒 | 36,000秒(10小时) 每月1日0点自动发放 有效期1个月 |
模型选型建议
语种支持:
对于中文(普通话)、英语,建议优先选择通义千问ASR或Paraformer(最新版Paraformer-v2)模型以获得更优效果。
对于中文(方言)、粤语、日语、韩语、西班牙语、印尼语、法语、德语、意大利语、马来语,建议优先选择Paraformer模型。特别是最新版Paraformer-v2模型,它支持指定语种,包括中文(含普通话和多种方言)、粤语、英语、日语、韩语。指定语种后,系统能够集中算法资源和语言模型于该特定语种,避免了在多种可能的语种中进行猜测和切换,从而减少了误识别的概率。
对于其他语言(俄语、泰语等),请选择SenseVoice,具体请参见语言列表。
文件读取方式:Paraformer、SenseVoice和通义千问ASR模型均支持读取录音文件的URL,如果需要读取本地录音文件,请选择通义千问ASR模型。
热词定制:如果您的业务领域中,有部分专有名词和行业术语识别效果不够好,您可以定制热词,将这些词添加到词表从而改善识别结果。您可以使用paraformer-v2、paraformer-8k-v2语音识别模型。关于热词的更多信息,请参见Paraformer语音识别热词定制与管理。
时间戳:如果您需要在获取识别结果的同时获取时间戳,请选择Paraformer或者SenseVoice模型。
情感和事件识别:如果需要情感识别能力(包括高兴
<HAPPY>
、伤心<SAD>
、生气<ANGRY>
和中性<NEUTRAL>
)和4种常见音频事件识别(包括背景音乐<BGM>
、说话声<Speech>
、掌声<Applause>
和笑声<Laughter>
),请选择SenseVoice语音识别模型。流式输出:如果需要实现边推理边输出内容的流式输出效果,请选择通义千问ASR模型。
通义千问ASR模型目前为Beta版本,对于更多语言的识别(如韩语、日语、泰语、西班牙语等,以及吴语、粤语、闽南语方言)、标点符号预测、热词定制、敏感词过滤等功能将在后续版本中陆续支持。
点击查看模型功能特性对比
通义千问ASR录音文件识别 | Paraformer录音文件识别 | SenseVoice录音文件识别 | |
接入方式 |
| Java、Python、RESTful | Java、Python、RESTful |
定制热词 | 不支持 | 支持 | 不支持 |
情感和事件识别 | 不支持 | 不支持 | 支持,可识别如下四种情绪和四种常见音频事件
|
敏感词过滤 | 不支持 | 支持 | 不支持 |
语气词过滤 | 不支持 | 支持 | 不支持 |
自动说话人分离 | 不支持 | 支持 | 不支持 |
说话人数量参考 | 不支持 | 支持 | 不支持 |
时间戳 | 不支持 | 支持 | 支持 |
流式输入 | 不支持 | 不支持 | 不支持 |
流式输出 | 支持 | 不支持 | 不支持 |
识别本地文件 | 支持 | 不支持,仅支持传入公网可访问的待识别文件URL | 不支持,仅支持传入公网可访问的待识别文件URL |
标点符号预测 | 不支持 | 支持 | 支持 |
待识别音频格式 | da、 wave( *.wav)、 mp3、wav、aac、flac、ogg、aiff、 au、 mp3、 midi、 wma、 realaudio、 vqf、 oggvorbis、aac、 ape | aac、amr、avi、flac、flv、m4a、mkv、mov、mp3、mp4、mpeg、ogg、opus、wav、webm、wma、wmv | aac、amr、avi、flac、flv、m4a、mkv、mov、mp3、mp4、mpeg、ogg、opus、wav、webm、wma、wmv |
待识别音频声道 | 单声道 | 不限 | 不限 |
待识别音频采样率 | 16kHz | 因模型而异:
| 任意 |
待识别音频大小 | 10MB以内 | 单次识别最多能指定100个文件URL,每个URL对应的录音文件大小应小于等于2GB,详见输入文件限制 | 单次识别最多能指定100个文件URL,每个URL对应的录音文件大小应小于等于2GB,详见输入文件限制 |
语言 | 中文、英文 | 因模型而异:
|
|
单价 | 限时免费体验 | 0.288元/小时 | 2.52元/小时 |
免费额度 | 百炼开通180天内,10万Token | 10小时/月 | 10小时/月 |
快速开始
您可以先进行在线体验:请在语音模型页面选择“Paraformer语音识别-v2”模型,单击立即体验。
下面是调用API的示例代码。
您需要已获取API Key并配置API Key到环境变量。如果通过SDK调用,还需要安装DashScope SDK。
点击查看示例代码
通义千问ASR
通义千问ASR模型不支持多轮对话和自定义Prompt(包括System Prompt和User Prompt)。
录音文件URL
Python
import dashscope
messages = [
{
"role": "user",
"content": [
{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"},
]
}
]
response = dashscope.MultiModalConversation.call(
model="qwen-audio-asr",
messages=messages,
result_format="message")
print(response)
完整结果以JSON格式输出到控制台。完整结果包含状态码、唯一的请求ID、识别后的内容以及本次调用的token信息。
{
"status_code": 200,
"request_id": "802e87ff-1875-99cd-96c0-16a50338836a",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "欢迎使用阿里云"
}
]
}
}
]
},
"usage": {
"input_tokens": 74,
"output_tokens": 7,
"audio_tokens": 46
}
}
Java
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder()
.role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3")))
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.model("qwen-audio-asr")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
完整结果以JSON格式输出到控制台。完整结果包含状态码、唯一的请求ID、识别后的内容以及本次调用的token信息。
{
"requestId": "9111d579-0e6f-9b78-bec5-07f01983c3b7",
"usage": {
"input_tokens": 74,
"output_tokens": 7
},
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "欢迎使用阿里云"
}
]
}
}
]
}
}
curl
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen-audio-asr",
"input":{
"messages":[
{
"role": "user",
"content": [
{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}
]
}
]
}
}'
完整结果以JSON格式输出到控制台。完整结果包含状态码、唯一的请求ID、识别后的内容以及本次调用的token信息。
{
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "欢迎使用阿里云"
}
]
}
}
]
},
"usage": {
"audio_tokens": 46,
"output_tokens": 7,
"input_tokens": 74
},
"request_id": "4edb7418-f01d-938c-9cac-b6f3abcd0173"
}
本地文件
使用DashScope SDK处理本地图像文件时,需要传入文件路径。请您参考下表,结合您的使用方式与操作系统进行文件路径的创建。
系统 | SDK | 传入的文件路径 | 示例 |
Linux或macOS系统 | Python SDK | file://{文件的绝对路径} | file:///home/images/test.png |
Java SDK | |||
Windows系统 | Python SDK | file://{文件的绝对路径} | file://D:/images/test.png |
Java SDK | file:///{文件的绝对路径} | file:///D:images/test.png |
Python
from dashscope import MultiModalConversation
# 请用您的本地音频的绝对路径替换 ABSOLUTE_PATH/welcome.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
{
"role": "user",
"content": [{"audio": audio_file_path}],
}
]
response = MultiModalConversation.call(model="qwen-audio-asr", messages=messages)
print(response)
完整结果以JSON格式输出到控制台。完整结果包含状态码、唯一的请求ID、识别后的内容以及本次调用的token信息。
{
"status_code": 200,
"request_id": "6cd77dbd-fa2a-9167-94dc-9a395815beaa",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "欢迎使用阿里云"
}
]
}
}
]
},
"usage": {
"input_tokens": 74,
"output_tokens": 7,
"audio_tokens": 46
}
}
Java
import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import com.alibaba.dashscope.utils.JsonUtils;
public class Main {
public static void callWithLocalFile()
throws ApiException, NoApiKeyException, UploadFileException {
// 请用您本地文件的绝对路径替换掉ABSOLUTE_PATH/welcome.mp3
String localFilePath = "file://ABSOLUTE_PATH/welcome.mp3";
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(new HashMap<String, Object>(){{put("audio", localFilePath);}}
))
.build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.model("qwen-audio-asr")
.message(userMessage)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(JsonUtils.toJson(result));
}
public static void main(String[] args) {
try {
callWithLocalFile();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
完整结果以JSON格式输出到控制台。完整结果包含状态码、唯一的请求ID、识别后的内容以及本次调用的token信息。
{
"requestId": "ae2735b1-3393-9917-af74-d9c4929b6c0f",
"usage": {
"input_tokens": 74,
"output_tokens": 7
},
"output": {
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "欢迎使用阿里云"
}
]
}
}
]
}
}
流式输出
模型并不是一次性生成最终结果,而是逐步地生成中间结果,最终结果由中间结果拼接而成。使用非流式输出方式需要等待模型生成结束后再将生成的中间结果拼接后返回,而流式输出可以实时地将中间结果返回,您可以在模型进行输出的同时进行阅读,减少等待模型回复的时间。您可以根据调用方式来设置不同的参数以实现流式输出:
DashScope Python SDK方式:设置
stream
参数为true。DashScope Java SDK方式:需要通过
streamCall
接口调用。DashScope HTTP方式:需要在Header中指定
X-DashScope-SSE
为enable
。
Python
import dashscope
messages = [
{
"role": "user",
"content": [
{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"},
]
}
]
response = dashscope.MultiModalConversation.call(
model="qwen-audio-asr",
messages=messages,
result_format="message",
stream=True
)
full_content = ""
print("流式输出内容为:")
for response in response:
try:
print(response["output"]["choices"][0]["message"].content[0]["text"])
full_content += response["output"]["choices"][0]["message"].content[0]["text"]
except:
pass
print(f"完整内容为:{full_content}")
识别的中间结果以字符串的形式输出到控制台上。
流式输出内容为:
欢迎
使用阿里云
完整内容为:欢迎使用阿里云
Java
import java.util.Arrays;
import java.util.HashMap;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
import io.reactivex.Flowable;
public class Main {
public static void streamCall()
throws ApiException, NoApiKeyException, UploadFileException {
MultiModalConversation conv = new MultiModalConversation();
// must create mutable map.
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(new HashMap<String, Object>(){{put("audio", "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3");}}
)).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
.model("qwen-audio-asr")
.message(userMessage)
.incrementalOutput(true)
.build();
Flowable<MultiModalConversationResult> result = conv.streamCall(param);
result.blockingForEach(item -> {
try {
System.out.println(item.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
} catch (Exception e){
System.exit(0);
}
});
}
public static void main(String[] args) {
try {
streamCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}
识别的中间结果会以字符串的形式输出到控制台上。
欢迎
使用阿里云
curl
curl -X POST 'https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-H 'X-DashScope-SSE: enable' \
-d '{
"model": "qwen-audio-asr",
"input":{
"messages":[
{
"role": "user",
"content": [
{"audio": "https://dashscope.oss-cn-beijing.aliyuncs.com/audios/welcome.mp3"}
]
}
]
},
"parameters": {
"incremental_output": true
}
}'
识别的中间结果会以JSON的形式输出到控制台上。
id:1
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"欢迎"}],"role":"assistant"},"finish_reason":"null"}]},"usage":{"audio_tokens":46,"input_tokens":74,"output_tokens":4},"request_id":"74e94b68-fc16-97cc-acc3-0d6bcf5e64fe"}
id:2
event:result
:HTTP_STATUS/200
data:{"output":{"choices":[{"message":{"content":[{"text":"使用阿里云"}],"role":"assistant"},"finish_reason":"stop"}]},"usage":{"audio_tokens":46,"input_tokens":74,"output_tokens":7},"request_id":"74e94b68-fc16-97cc-acc3-0d6bcf5e64fe"}
Paraformer
由于音视频文件的尺寸通常较大,文件传输和语音识别处理均需要时间,文件转写API通过异步调用方式来提交任务。开发者需要通过查询接口,在文件转写完成后获得语音识别结果。
import json
from urllib import request
from http import HTTPStatus
import dashscope
# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# dashscope.api_key = "apiKey"
task_response = dashscope.audio.asr.Transcription.async_call(
model='paraformer-v2',
file_urls=[
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'
],
language_hints=['zh', 'en'])
transcription_response = dashscope.audio.asr.Transcription.wait(
task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(result, indent=4, ensure_ascii=False))
print('transcription done!')
else:
print('Error: ', transcription_response.output.message)
import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.*;
import java.net.HttpURLConnection;
import java.util.Arrays;
import java.util.List;
public class Main {
public static void main(String[] args) {
// 创建转写请求参数,需要用真实apikey替换your-dashscope-api-key
TranscriptionParam param =
TranscriptionParam.builder()
// 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
// .apiKey("apikey")
.model("paraformer-v2")
// “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
.parameter("language_hints", new String[]{"zh", "en"})
.fileUrls(
Arrays.asList(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
.build();
try {
Transcription transcription = new Transcription();
// 提交转写请求
TranscriptionResult result = transcription.asyncCall(param);
// 等待转写完成
System.out.println("RequestId: " + result.getRequestId());
result = transcription.wait(
TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
// 获取转写结果
List<TranscriptionTaskResult> taskResultList = result.getResults();
if (taskResultList != null && taskResultList.size() > 0) {
for (TranscriptionTaskResult taskResult : taskResultList) {
String transcriptionUrl = taskResult.getTranscriptionUrl();
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
System.out.println(gson.toJson(gson.fromJson(reader, JsonObject.class)));
}
}
} catch (Exception e) {
System.out.println("error: " + e);
}
System.exit(0);
}
}
完整的识别结果会以JSON格式打印在控制台。完整结果包含转换后的文本以及文本在音视频文件中的起始、结束时间(以毫秒为单位)。
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
"properties": {
"audio_format": "pcm_s16le",
"channels": [
0
],
"original_sampling_rate": 16000,
"original_duration_in_milliseconds": 4726
},
"transcripts": [
{
"channel_id": 0,
"content_duration_in_milliseconds": 4720,
"text": "Hello world, 这里是阿里巴巴语音实验室。",
"sentences": [
{
"begin_time": 0,
"end_time": 4720,
"text": "Hello world, 这里是阿里巴巴语音实验室。",
"words": [
{
"begin_time": 0,
"end_time": 472,
"text": "Hello ",
"punctuation": ""
},
{
"begin_time": 472,
"end_time": 944,
"text": "world",
"punctuation": ", "
},
{
"begin_time": 944,
"end_time": 1573,
"text": "这里",
"punctuation": ""
},
{
"begin_time": 1573,
"end_time": 2202,
"text": "是阿",
"punctuation": ""
},
{
"begin_time": 2202,
"end_time": 2831,
"text": "里巴",
"punctuation": ""
},
{
"begin_time": 2831,
"end_time": 3460,
"text": "巴语",
"punctuation": ""
},
{
"begin_time": 3460,
"end_time": 4089,
"text": "音实",
"punctuation": ""
},
{
"begin_time": 4089,
"end_time": 4720,
"text": "验室",
"punctuation": "。"
}
]
}
]
}
]
}
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
"properties": {
"audio_format": "pcm_s16le",
"channels": [
0
],
"original_sampling_rate": 16000,
"original_duration_in_milliseconds": 3834
},
"transcripts": [
{
"channel_id": 0,
"content_duration_in_milliseconds": 3820,
"text": "Hello world, 这里是阿里巴巴语音实验室。",
"sentences": [
{
"begin_time": 0,
"end_time": 3820,
"text": "Hello world, 这里是阿里巴巴语音实验室。",
"words": [
{
"begin_time": 0,
"end_time": 382,
"text": "Hello ",
"punctuation": ""
},
{
"begin_time": 382,
"end_time": 764,
"text": "world",
"punctuation": ", "
},
{
"begin_time": 764,
"end_time": 1273,
"text": "这里",
"punctuation": ""
},
{
"begin_time": 1273,
"end_time": 1782,
"text": "是阿",
"punctuation": ""
},
{
"begin_time": 1782,
"end_time": 2291,
"text": "里巴",
"punctuation": ""
},
{
"begin_time": 2291,
"end_time": 2800,
"text": "巴语",
"punctuation": ""
},
{
"begin_time": 2800,
"end_time": 3309,
"text": "音实",
"punctuation": ""
},
{
"begin_time": 3309,
"end_time": 3820,
"text": "验室",
"punctuation": "。"
}
]
}
]
}
]
}
transcription done!
SenseVoice
由于音视频文件的尺寸通常较大,文件传输和语音识别处理均需要时间,文件转写API通过异步调用方式来提交任务。开发者需要通过查询接口,在文件转写完成后获得语音识别结果。
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html
import re
import json
from urllib import request
from http import HTTPStatus
import dashscope
# 将your-dashscope-api-key替换成您自己的API-KEY
dashscope.api_key = 'your-dashscope-api-key'
def parse_sensevoice_result(data, keep_trans=True, keep_emotions=True, keep_events=True):
'''
本工具用于解析 sensevoice 识别结果
keep_trans: 是否保留转写文本,默认为True
keep_emotions: 是否保留情感标签,默认为True
keep_events: 是否保留事件标签,默认为True
'''
# 定义要保留的标签
emotion_list = ['NEUTRAL', 'HAPPY', 'ANGRY', 'SAD']
event_list = ['Speech', 'Applause', 'BGM', 'Laughter']
# 所有支持的标签
all_tags = ['Speech', 'Applause', 'BGM', 'Laughter',
'NEUTRAL', 'HAPPY', 'ANGRY', 'SAD', 'SPECIAL_TOKEN_1']
tags_to_cleanup = []
for tag in all_tags:
tags_to_cleanup.append(f'<|{tag}|> ')
tags_to_cleanup.append(f'<|/{tag}|>')
tags_to_cleanup.append(f'<|{tag}|>')
def get_clean_text(text: str):
for tag in tags_to_cleanup:
text = text.replace(tag, '')
pattern = r"\s{2,}"
text = re.sub(pattern, " ", text).strip()
return text
for item in data['transcripts']:
for sentence in item['sentences']:
if keep_emotions:
# 提取 emotion
emotions_pattern = r'<\|(' + '|'.join(emotion_list) + r')\|>'
emotions = re.findall(emotions_pattern, sentence['text'])
sentence['emotion'] = list(set(emotions))
if not sentence['emotion']:
sentence.pop('emotion', None)
if keep_events:
# 提取 event
events_pattern = r'<\|(' + '|'.join(event_list) + r')\|>'
events = re.findall(events_pattern, sentence['text'])
sentence['event'] = list(set(events))
if not sentence['event']:
sentence.pop('event', None)
if keep_trans:
# 提取纯文本
sentence['text'] = get_clean_text(sentence['text'])
else:
sentence.pop('text', None)
if keep_trans:
item['text'] = get_clean_text(item['text'])
else:
item.pop('text', None)
item['sentences'] = list(filter(lambda x: 'text' in x or 'emotion' in x or 'event' in x, item['sentences']))
return data
task_response = dashscope.audio.asr.Transcription.async_call(
model='sensevoice-v1',
file_urls=[
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav'],
language_hints=['en'], )
print('task_id: ', task_response.output.task_id)
transcription_response = dashscope.audio.asr.Transcription.wait(
task=task_response.output.task_id)
if transcription_response.status_code == HTTPStatus.OK:
for transcription in transcription_response.output['results']:
if transcription['subtask_status'] == 'SUCCEEDED':
url = transcription['transcription_url']
result = json.loads(request.urlopen(url).read().decode('utf8'))
print(json.dumps(parse_sensevoice_result(result, keep_trans=False, keep_emotions=False), indent=4,
ensure_ascii=False))
else:
print('transcription failed!')
print(transcription)
print('transcription done!')
else:
print('Error: ', transcription_response.output.message)
package org.example.recognition;
import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;
class SenseVoiceParser {
private static final List<String> EMOTION_LIST = Arrays.asList("NEUTRAL", "HAPPY", "ANGRY", "SAD");
private static final List<String> EVENT_LIST = Arrays.asList("Speech", "Applause", "BGM", "Laughter");
private static final List<String> ALL_TAGS = Arrays.asList(
"Speech", "Applause", "BGM", "Laughter", "NEUTRAL", "HAPPY", "ANGRY", "SAD", "SPECIAL_TOKEN_1");
/**
* 本工具用于解析 sensevoice 识别结果
* @param data json格式的sensevoice转写结果
* @param keepTrans 是否保留转写文本
* @param keepEmotions 是否保留情感标签
* @param keepEvents 是否保留事件标签
* @return
*/
public static JsonObject parseSenseVoiceResult(JsonObject data, boolean keepTrans, boolean keepEmotions, boolean keepEvents) {
List<String> tagsToCleanup = ALL_TAGS.stream()
.flatMap(tag -> Stream.of("<|" + tag + "|> ", "<|/" + tag + "|>", "<|" + tag + "|>"))
.collect(Collectors.toList());
JsonArray transcripts = data.getAsJsonArray("transcripts");
for (JsonElement transcriptElement : transcripts) {
JsonObject transcript = transcriptElement.getAsJsonObject();
JsonArray sentences = transcript.getAsJsonArray("sentences");
for (JsonElement sentenceElement : sentences) {
JsonObject sentence = sentenceElement.getAsJsonObject();
String text = sentence.get("text").getAsString();
if (keepEmotions) {
extractTags(sentence, text, EMOTION_LIST, "emotion");
}
if (keepEvents) {
extractTags(sentence, text, EVENT_LIST, "event");
}
if (keepTrans) {
String cleanText = getCleanText(text, tagsToCleanup);
sentence.addProperty("text", cleanText);
} else {
sentence.remove("text");
}
}
if (keepTrans) {
transcript.addProperty("text", getCleanText(transcript.get("text").getAsString(), tagsToCleanup));
} else {
transcript.remove("text");
}
JsonArray filteredSentences = new JsonArray();
for (JsonElement sentenceElement : sentences) {
JsonObject sentence = sentenceElement.getAsJsonObject();
if (sentence.has("text") || sentence.has("emotion") || sentence.has("event")) {
filteredSentences.add(sentence);
}
}
transcript.add("sentences", filteredSentences);
}
return data;
}
private static void extractTags(JsonObject sentence, String text, List<String> tagList, String key) {
String pattern = "<\\|(" + String.join("|", tagList) + ")\\|>";
Pattern compiledPattern = Pattern.compile(pattern);
Matcher matcher = compiledPattern.matcher(text);
Set<String> tags = new HashSet<>();
while (matcher.find()) {
tags.add(matcher.group(1));
}
if (!tags.isEmpty()) {
JsonArray tagArray = new JsonArray();
tags.forEach(tagArray::add);
sentence.add(key, tagArray);
} else {
sentence.remove(key);
}
}
private static String getCleanText(String text, List<String> tagsToCleanup) {
for (String tag : tagsToCleanup) {
text = text.replace(tag, "");
}
return text.replaceAll("\\s{2,}", " ").trim();
}
}
public class Main {
public static void main(String[] args) {
// 创建转写请求参数,需要用真实apikey替换your-dashscope-api-key
TranscriptionParam param =
TranscriptionParam.builder()
// 将your-dashscope-api-key替换成您自己的API-KEY
.apiKey("your-dashscope-api-key")
.model("sensevoice-v1")
.fileUrls(
Arrays.asList(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav"))
.parameter("language_hints", new String[] {"en"})
.build();
try {
Transcription transcription = new Transcription();
// 提交转写请求
TranscriptionResult result = transcription.asyncCall(param);
System.out.println("requestId: " + result.getRequestId());
// 等待转写完成
result = transcription.wait(
TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
// 获取转写结果
List<TranscriptionTaskResult> taskResultList = result.getResults();
if (taskResultList != null && taskResultList.size() > 0) {
for (TranscriptionTaskResult taskResult : taskResultList) {
String transcriptionUrl = taskResult.getTranscriptionUrl();
HttpURLConnection connection =
(HttpURLConnection) new URL(transcriptionUrl).openConnection();
connection.setRequestMethod("GET");
connection.connect();
BufferedReader reader =
new BufferedReader(new InputStreamReader(connection.getInputStream()));
Gson gson = new GsonBuilder().setPrettyPrinting().create();
JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
System.out.println(gson.toJson(jsonResult));
System.out.println(gson.toJson(SenseVoiceParser.parseSenseVoiceResult(jsonResult.getAsJsonObject(), true, true, true)));
}
}
} catch (Exception e) {
System.out.println("error: " + e);
}
System.exit(0);
}
}
完整的识别结果会以JSON格式打印在控制台。完整结果包含转换后的文本以及文本在音视频文件中的起始、结束时间(以毫秒为单位)。本示例中,还检测到了说话声事件(<|Speech|>
与<|/Speech|>
分别代表说话声事件的起始与结束),情绪(<|ANGRY|>
)。
{
"file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav",
"properties": {
"audio_format": "pcm_s16le",
"channels": [
0
],
"original_sampling_rate": 16000,
"original_duration_in_milliseconds": 17645
},
"transcripts": [
{
"channel_id": 0,
"content_duration_in_milliseconds": 12710,
"text": "<|Speech|> Senior staff, Principal Doris Jackson, Wakefield faculty, and of course, my fellow classmates. <|/Speech|> <|ANGRY|><|Speech|> I am honored to have been chosen to speak before my classmates, as well as the students across America today. <|/Speech|>",
"sentences": [
{
"begin_time": 0,
"end_time": 7060,
"text": "<|Speech|> Senior staff, Principal Doris Jackson, Wakefield faculty, and of course, my fellow classmates. <|/Speech|> <|ANGRY|>"
},
{
"begin_time": 11980,
"end_time": 17630,
"text": "<|Speech|> I am honored to have been chosen to speak before my classmates, as well as the students across America today. <|/Speech|>"
}
]
}
]
}
transcription done!
输入文件限制
通义千问ASR模型 | Paraformer模型 | SenseVoice模型 | |
输入文件的方式 | 音频文件的URL及本地文件 | 音频文件的URL | |
文件数量 | 1个 | 不超过100个 | |
文件大小 | 每个URL或本地的文件不可超过10MB,时长为3分钟以内 | 每个URL中的文件不超过2 GB,无时长限制 | |
文件格式 | da、 wave( *.wav)、 mp3、wav、aac、flac、ogg、aiff、 au、 mp3、 midi、 wma、 realaudio、 vqf、 oggvorbis、aac、 ape | aac、amr、avi、flac、flv、m4a、mkv、mov、mp3、mp4、mpeg、ogg、opus、wav、webm、wma、wmv | |
采样率 | 16000 Hz | paraformer-v2、paraformer-v1模型对采样率无限制,其他模型支持音视频采样率为16000 Hz及以上,电话语音采样率为8000 Hz及以上 | 任意 |
文件大小说明:如果超出限制,可尝试对文件进行预处理以降低文件尺寸。具体操作,请参见最佳实践。
音频格式说明:由于音视频文件格式及其变种众多,因此不能保证所有格式均能够被正确识别。请通过测试验证您所提供的文件能够获得正常的语音识别结果。
采样率说明:采样率是指每秒对声音信号的采样次数。更高的采样率提供更多信息,有助于提高语音识别的准确率,但过高的采样率可能引入更多无关信息,反而影响识别效果。应根据实际采样率选择合适的模型。例如,8000Hz的语音数据应直接使用支持8000Hz的模型,无需转换为16000Hz。
实时语音识别
实时语音识别可以将音频流实时转换为文本,实现“边说边出文字”的效果。它适用于对麦克风语音进行实时识别,以及对本地音频文件进行实时转录。
应用场景
会议:为会议、演讲、培训、庭审等提供实时记录。
直播:为直播带货、赛事直播等提供实时字幕。
客服:实时记录通话内容,协助提升服务品质。
游戏:让玩家无需停下手头操作即可语音输入或阅读聊天内容。
社交聊天:使用社交App或输入法时,语音自动转文本。
人机交互:转换语音对话为文字,优化人机交互体验。
支持的模型
Paraformer
模型名称 | 支持的语言 | 支持的采样率 | 适用场景 | 支持的音频格式 | 单价 | 免费额度 |
paraformer-realtime-v2 | 中文普通话、中文方言(粤语、吴语、闽南语、东北话、甘肃话、贵州话、河南话、湖北话、湖南话、宁夏话、山西话、陕西话、山东话、四川话、天津话、江西话、云南话、上海话)、英语、日语、韩语 支持多个语种自由切换 | 任意 | 视频直播、会议等 | pcm、wav、mp3、opus、speex、aac、amr | 0.00024元/秒 | 36,000秒(10小时) 每月1日0点自动发放 有效期1个月 |
paraformer-realtime-v1 | 中文 | 16kHz | ||||
paraformer-realtime-8k-v2 | 8kHz | 电话客服等 | ||||
paraformer-realtime-8k-v1 |
Gummy
模型名称 | 支持的语言 | 支持的采样率 | 适用场景 | 支持的音频格式 | 单价 | 免费额度 |
gummy-realtime-v1 | 中文、英文、日语、韩语、粤语、德语、法语、俄语、意大利语、西班牙语 翻译语言对: 中 → 英/日/韩 英 → 中/日/韩 日/韩/粤/德/法/俄/意/西 → 中/英 | 16kHz及以上 | 会议演讲、视频直播等长时间不间断识别的场景 | pcm、wav、mp3、opus、speex、aac、amr | 0.00015元/秒 | 36,000秒(10小时) 2025年1月17日0点前开通百炼:有效期至2025年7月15日 2025年1月17日0点后开通百炼:自开通日起180天有效 |
gummy-chat-v1 | 16kHz | 对话聊天、指令控制、语音输入法、语音搜索等短时语音交互场景 |
模型选型建议
语种支持:
多语种混合场景下,推荐使用Gummy,能够带来更高的识别准确率。另外,Gummy对非常用词的识别准确率更高。
对于中文(普通话)、粤语、英语、日语、韩语,可以选择Gummy或Paraformer模型。
对于德语、法语、俄语、意大利语、西班牙语,请选择Gummy模型。
对于中文(方言),请选择Paraformer模型。
噪音环境下:推荐使用Paraformer。
情感识别和语气词过滤:如果需要情感识别和语气词过滤能力,请选择Paraformer语音识别模型。
点击查看模型功能特性对比
Gummy实时语音识别 | Paraformer实时语音识别 | |
接入方式 | Python、Java、WebSocket | Python、Java、WebSocket |
定制热词 | 支持 | 支持 |
情感识别 | 不支持 | 仅paraformer-realtime-8k-v2模型支持 |
语气词过滤 | 不支持 | 支持 |
时间戳 | 支持 | 支持 |
流式输入 | 支持 | 支持 |
流式输出 | 支持 | 支持 |
识别本地文件 | 支持 | 支持 |
标点符号预测 | 支持 | 支持 |
待识别音频格式 | pcm、wav、mp3、opus、speex、aac、amr | pcm、wav、mp3、opus、speex、aac、amr |
待识别音频声道 | 单声道 | 单声道 |
待识别音频采样率 | 因模型而异:
| 因模型而异:
|
待识别音频时长 | 因模型而异:
| 不限 |
可识别语言 | 中文、英文、日语、韩语、粤语、德语、法语、俄语、意大利语、西班牙语 | 因模型而异:
|
单价 | 0.54元/小时 | 0.864元/小时 |
免费额度 | 36,000秒(10小时) 2025年1月17日0点前开通百炼:有效期至2025年7月15日 2025年1月17日0点后开通百炼:自开通日起180天有效 | 10小时/月 |
快速开始
您可以先进行在线体验:请在语音识别页面选择“Paraformer实时语音识别-v2”模型,单击立即体验。
下面是调用API的示例代码。更多常用场景的代码示例,请参见GitHub。
您需要已获取API Key并配置API Key到环境变量。如果通过SDK调用,还需要安装DashScope SDK。
点击查看示例代码
Gummy
实时语音识别
实时语音识别支持对长时间的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行识别并流式返回结果。
识别传入麦克风的语音
Java
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerRealtime;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;
public class Main {
public static void main(String[] args) throws NoApiKeyException {
// 创建一个Flowable<ByteBuffer>
String targetLanguage = "en";
Flowable<ByteBuffer> audioSource =
Flowable.create(
emitter -> {
new Thread(
() -> {
try {
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
System.out.println("请您通过麦克风讲话体验实时语音识别和翻译功能");
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音50s并进行实时识别
while (System.currentTimeMillis() - start < 50000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
emitter.onNext(buffer);
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
// 通知结束
emitter.onComplete();
} catch (Exception e) {
emitter.onError(e);
}
})
.start();
},
BackpressureStrategy.BUFFER);
// 创建Recognizer
TranslationRecognizerRealtime translator = new TranslationRecognizerRealtime();
// 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-realtime-v1")
.format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 8000、16000
.transcriptionEnabled(true)
.sourceLanguage("auto")
.translationEnabled(true)
.translationLanguages(new String[] {targetLanguage})
.build();
// 流式调用接口
translator
.streamCall(param, audioSource)
// 调用Flowable的subscribe方法订阅结果
.blockingForEach(
result -> {
if (result.getTranscriptionResult() == null) {
return;
}
try {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
System.out.println("\tStash:" + result.getTranscriptionResult().getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
System.out.println("\tStash:" + result.getTranslationResult().getTranslation(targetLanguage).getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
}
}
} catch (Exception e) {
e.printStackTrace();
}
});
System.exit(0);
}
}
Python
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import pyaudio
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
mic = None
stream = None
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback open.")
mic = pyaudio.PyAudio()
stream = mic.open(
format=pyaudio.paInt16, channels=1, rate=16000, input=True
)
def on_close(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback close.")
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.stash is not None:
print(
"translate to english stash: ",
translation_result.get_translation("en").stash.text,
)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
if transcription_result.stash is not None:
print("transcription stash: ", transcription_result.stash.text)
callback = Callback()
translator = TranslationRecognizerRealtime(
model="gummy-realtime-v1",
format="pcm",
sample_rate=16000,
transcription_enabled=True,
translation_enabled=True,
translation_target_languages=["en"],
callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验实时语音识别和翻译功能")
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
translator.send_audio_frame(data)
else:
break
translator.stop()
识别本地文件
示例中用到的音频文件为:hello_world.wav。
Java
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerRealtime;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class RealtimeTranslateTask implements Runnable {
private Path filepath;
public RealtimeTranslateTask(Path filepath) {
this.filepath = filepath;
}
@Override
public void run() {
String targetLanguage = "en";
// Create translation params
// you can customize the translation parameters, like model, format,
// sample_rate for more information, please refer to
// https://help.aliyun.com/document_detail/2712536.html
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-realtime-v1")
.format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 8000、16000
.transcriptionEnabled(true)
.sourceLanguage("auto")
.translationEnabled(true)
.translationLanguages(new String[] {targetLanguage})
.build();
TranslationRecognizerRealtime translator = new TranslationRecognizerRealtime();
CountDownLatch latch = new CountDownLatch(1);
String threadName = Thread.currentThread().getName();
ResultCallback<TranslationRecognizerResult> callback =
new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:"+result);
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
System.out.println("\tStash:" + result.getTranscriptionResult().getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
System.out.println("\tStash:" + result.getTranslationResult().getTranslation(targetLanguage).getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
}
}
}
@Override
public void onComplete() {
System.out.println("[" + threadName + "] Translation complete");
latch.countDown();
}
@Override
public void onError(Exception e) {
e.printStackTrace();
System.out.println("[" + threadName + "] TranslationCallback error: " + e.getMessage());
}
};
// set param & callback
translator.call(param, callback);
// Please replace the path with your audio file path
System.out.println("[" + threadName + "] Input file_path is: " + this.filepath);
// Read file and send audio by chunks
try (FileInputStream fis = new FileInputStream(this.filepath.toFile())) {
// chunk size set to 1 seconds for 16KHz sample rate
byte[] buffer = new byte[3200];
int bytesRead;
// Loop to read chunks of the file
while ((bytesRead = fis.read(buffer)) != -1) {
ByteBuffer byteBuffer;
// Handle the last chunk which might be smaller than the buffer size
System.out.println("[" + threadName + "] bytesRead: " + bytesRead);
if (bytesRead < buffer.length) {
byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
} else {
byteBuffer = ByteBuffer.wrap(buffer);
}
// Send the ByteBuffer to the translation instance
translator.sendAudioFrame(byteBuffer);
buffer = new byte[3200];
Thread.sleep(100);
}
System.out.println(LocalDateTime.now());
} catch (Exception e) {
e.printStackTrace();
}
translator.stop();
// wait for the translation to complete
try {
latch.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
public class Main {
public static void main(String[] args)
throws NoApiKeyException, InterruptedException {
String currentDir = System.getProperty("user.dir");
// Please replace the path with your audio source
Path[] filePaths = {
Paths.get(currentDir, "hello_world.wav"),
// Paths.get(currentDir, "hello_world_male_16k_16bit_mono.wav"),
};
// Use ThreadPool to run recognition tasks
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (Path filepath:filePaths) {
executorService.submit(new RealtimeTranslateTask(filepath));
}
executorService.shutdown();
// wait for all tasks to complete
executorService.awaitTermination(1, TimeUnit.MINUTES);
System.exit(0);
}
}
Python
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import os
import requests
from http import HTTPStatus
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
r = requests.get(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
)
with open("asr_example.wav", "wb") as f:
f.write(r.content)
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
print("TranslationRecognizerCallback open.")
def on_close(self) -> None:
print("TranslationRecognizerCallback close.")
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.stash is not None:
print(
"translate to english stash: ",
translation_result.get_translation("en").stash.text,
)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
if transcription_result.stash is not None:
print("transcription stash: ", transcription_result.stash.text)
def on_error(self, message) -> None:
print('error: {}'.format(message))
def on_complete(self) -> None:
print('TranslationRecognizerCallback complete')
callback = Callback()
translator = TranslationRecognizerRealtime(
model="gummy-realtime-v1",
format="wav",
sample_rate=16000,
callback=callback,
)
translator.start()
try:
audio_data: bytes = None
f = open("asr_example.wav", 'rb')
if os.path.getsize("asr_example.wav"):
while True:
audio_data = f.read(12800)
if not audio_data:
break
else:
translator.send_audio_frame(audio_data)
else:
raise Exception(
'The supplied file was empty (zero bytes long)')
f.close()
except Exception as e:
raise e
translator.stop()
一句话识别
一句话识别能够对一分钟内的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行识别并流式返回结果。
识别传入麦克风的语音
Java
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerChat;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.results.TranscriptionResult;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;
public class Main {
public static void main(String[] args) throws NoApiKeyException, InterruptedException {
// 创建Recognizer
TranslationRecognizerChat translator = new TranslationRecognizerChat();
// 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-chat-v1")
.format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 16000
.transcriptionEnabled(true)
.translationEnabled(true)
.translationLanguages(new String[] {"en"})
.build();
// 创建一个Flowable<ByteBuffer>
Thread thread = new Thread(
() -> {
try {
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
System.out.println("请您通过麦克风讲话体验一句话语音识别和翻译功能");
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音5s并进行实时识别
while (System.currentTimeMillis() - start < 50000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
if (!translator.sendAudioFrame(buffer)) {
System.out.println("sentence end, stop sending");
break;
}
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
} catch (LineUnavailableException e) {
throw new RuntimeException(e);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
});
translator.call(param, new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
if (result.getTranscriptionResult() == null) {
return;
}
try {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
} else {
TranscriptionResult transcriptionResult = result.getTranscriptionResult();
System.out.println("\tTemp Result:" + transcriptionResult.getText());
if (result.getTranscriptionResult().isVadPreEnd()) {
System.out.printf("VadPreEnd: start:%d, end:%d, time:%d\n", transcriptionResult.getPreEndStartTime(), transcriptionResult.getPreEndEndTime(), transcriptionResult.getPreEndTimemillis());
}
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation("en").getText());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation("en").getText());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void onComplete() {
System.out.println("Translation complete");
}
@Override
public void onError(Exception e) {
}
});
thread.start();
thread.join();
translator.stop();
// System.exit(0);
}
}
Python
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import pyaudio
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
mic = None
stream = None
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback open.")
mic = pyaudio.PyAudio()
stream = mic.open(
format=pyaudio.paInt16, channels=1, rate=16000, input=True
)
def on_close(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback close.")
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.vad_pre_end:
print("vad pre end {}, {}, {}".format(transcription_result.pre_end_start_time, transcription_result.pre_end_end_time, transcription_result.pre_end_timemillis))
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
callback = Callback()
translator = TranslationRecognizerChat(
model="gummy-chat-v1",
format="pcm",
sample_rate=16000,
transcription_enabled=True,
translation_enabled=True,
translation_target_languages=["en"],
callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验一句话语音识别和翻译功能")
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
if not translator.send_audio_frame(data):
print("sentence end, stop sending")
break
else:
break
translator.stop()
识别本地文件
示例中用到的音频文件为:hello_world.wav。
Java
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerChat;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.LocalDateTime;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class RealtimeTranslateChatTask implements Runnable {
private Path filepath;
private TranslationRecognizerChat translator = null;
public RealtimeTranslateChatTask(Path filepath) {
this.filepath = filepath;
}
@Override
public void run() {
for (int i=0; i<1; i++) {
// Create translation params
// you can customize the translation parameters, like model, format,
// sample_rate for more information, please refer to
// https://help.aliyun.com/document_detail/2712536.html
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-chat-v1")
.format("wav") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 16000
.transcriptionEnabled(true)
.translationEnabled(true)
.translationLanguages(new String[] {"en"})
.build();
if (translator == null) {
translator = new TranslationRecognizerChat();
}
CountDownLatch latch = new CountDownLatch(1);
String threadName = Thread.currentThread().getName();
ResultCallback<TranslationRecognizerResult> callback =
new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:"+result);
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation("en").getText());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation("en").getText());
}
}
}
@Override
public void onComplete() {
System.out.println("[" + threadName + "] Translation complete");
latch.countDown();
}
@Override
public void onError(Exception e) {
e.printStackTrace();
System.out.println("[" + threadName + "] TranslationCallback error: " + e.getMessage());
}
};
// set param & callback
translator.call(param, callback);
// Please replace the path with your audio file path
System.out.println("[" + threadName + "] Input file_path is: " + this.filepath);
// Read file and send audio by chunks
try (FileInputStream fis = new FileInputStream(this.filepath.toFile())) {
// chunk size set to 1 seconds for 16KHz sample rate
byte[] buffer = new byte[3200];
int bytesRead;
// Loop to read chunks of the file
while ((bytesRead = fis.read(buffer)) != -1) {
ByteBuffer byteBuffer;
// Handle the last chunk which might be smaller than the buffer size
System.out.println("[" + threadName + "] bytesRead: " + bytesRead);
if (bytesRead < buffer.length) {
byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
} else {
byteBuffer = ByteBuffer.wrap(buffer);
}
// Send the ByteBuffer to the translation instance
if (!translator.sendAudioFrame(byteBuffer)) {
System.out.println("sentence end, stop sending");
break;
}
buffer = new byte[3200];
Thread.sleep(100);
}
fis.close();
System.out.println(LocalDateTime.now());
} catch (Exception e) {
e.printStackTrace();
}
translator.stop();
// wait for the translation to complete
try {
latch.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
public class Main {
public static void main(String[] args)
throws NoApiKeyException, InterruptedException {
String currentDir = System.getProperty("user.dir");
// Please replace the path with your audio source
Path[] filePaths = {
Paths.get(currentDir, "hello_world.wav"),
// Paths.get(currentDir, "hello_world_male_16k_16bit_mono.wav"),
};
// Use ThreadPool to run recognition tasks
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (Path filepath:filePaths) {
executorService.submit(new RealtimeTranslateChatTask(filepath));
}
executorService.shutdown();
// wait for all tasks to complete
executorService.awaitTermination(1, TimeUnit.MINUTES);
// System.exit(0);
}
}
Python
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import os
import requests
from http import HTTPStatus
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
r = requests.get(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
)
with open("asr_example.wav", "wb") as f:
f.write(r.content)
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
print("TranslationRecognizerCallback open.")
def on_close(self) -> None:
print("TranslationRecognizerCallback close.")
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
def on_error(self, message) -> None:
print('error: {}'.format(message))
def on_complete(self) -> None:
print('TranslationRecognizerCallback complete')
callback = Callback()
translator = TranslationRecognizerChat(
model="gummy-chat-v1",
format="wav",
sample_rate=16000,
callback=callback,
)
translator.start()
try:
audio_data: bytes = None
f = open("asr_example.wav", 'rb')
if os.path.getsize("asr_example.wav"):
while True:
audio_data = f.read(12800)
if not audio_data:
break
else:
if translator.send_audio_frame(audio_data):
print("send audio frame success")
else:
print("sentence end, stop sending")
break
else:
raise Exception(
'The supplied file was empty (zero bytes long)')
f.close()
except Exception as e:
raise e
translator.stop()
Paraformer
识别传入麦克风的语音
实时语音识别可以识别麦克风中传入的语音并输出识别结果,达到“边说边出文字”的效果。
Python
运行Python示例前,需要通过pip install pyaudio命令安装第三方音频播放与采集套件。
import pyaudio
from dashscope.audio.asr import (Recognition, RecognitionCallback,
RecognitionResult)
# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"
mic = None
stream = None
class Callback(RecognitionCallback):
def on_open(self) -> None:
global mic
global stream
print('RecognitionCallback open.')
mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True)
def on_close(self) -> None:
global mic
global stream
print('RecognitionCallback close.')
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(self, result: RecognitionResult) -> None:
print('RecognitionCallback sentence: ', result.get_sentence())
callback = Callback()
recognition = Recognition(model='paraformer-realtime-v2',
format='pcm',
sample_rate=16000,
callback=callback)
recognition.start()
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
recognition.send_audio_frame(data)
else:
break
recognition.stop()
Java
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;
import java.nio.ByteBuffer;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
public class Main {
public static void main(String[] args) throws NoApiKeyException {
// 创建一个Flowable<ByteBuffer>
Flowable<ByteBuffer> audioSource =
Flowable.create(
emitter -> {
new Thread(
() -> {
try {
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音30s并进行实时转写
while (System.currentTimeMillis() - start < 300000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
emitter.onNext(buffer);
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
// 通知结束转写
emitter.onComplete();
} catch (Exception e) {
emitter.onError(e);
}
})
.start();
},
BackpressureStrategy.BUFFER);
// 创建Recognizer
Recognition recognizer = new Recognition();
// 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
RecognitionParam param =
RecognitionParam.builder()
.model("paraformer-realtime-v2")
.format("pcm")
.sampleRate(16000)
// 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
// .apiKey("apikey")
.build();
// 流式调用接口
recognizer
.streamCall(param, audioSource)
// 调用Flowable的subscribe方法订阅结果
.blockingForEach(
result -> {
// 打印最终结果
if (result.isSentenceEnd()) {
System.out.println("Fix:" + result.getSentence().getText());
} else {
System.out.println("Result:" + result.getSentence().getText());
}
});
System.exit(0);
}
}
识别本地音频文件
实时语音识别可以识别本地音频文件并输出识别结果。对于对话聊天、控制口令、语音输入法、语音搜索等较短的准实时语音识别场景可考虑采用该接口进行语音识别。
识别中英文
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html
import requests
from http import HTTPStatus
from dashscope.audio.asr import Recognition
# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"
# 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
r = requests.get(
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav'
)
with open('asr_example.wav', 'wb') as f:
f.write(r.content)
recognition = Recognition(model='paraformer-realtime-v2',
format='wav',
sample_rate=16000,
# “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
language_hints=['zh', 'en'],
callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
print('识别结果:')
print(result.get_sentence())
else:
print('Error: ', result.message)
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
public class Main {
public static void main(String[] args) {
// 用户可忽略url下载文件部分,可以直接使用本地文件进行相关api调用进行识别
String exampleWavUrl =
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav";
try {
InputStream in = new URL(exampleWavUrl).openStream();
Files.copy(in, Paths.get("asr_example.wav"), StandardCopyOption.REPLACE_EXISTING);
} catch (IOException e) {
System.out.println("error: " + e);
System.exit(1);
}
// 创建Recognition实例
Recognition recognizer = new Recognition();
// 创建RecognitionParam
RecognitionParam param =
RecognitionParam.builder()
// 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
// .apiKey("apikey")
.model("paraformer-realtime-v2")
.format("wav")
.sampleRate(16000)
// “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
.parameter("language_hints", new String[]{"zh", "en"})
.build();
try {
System.out.println("识别结果:" + recognizer.call(param, new File("asr_example.wav")));
} catch (Exception e) {
e.printStackTrace();
}
System.exit(0);
}
}
识别日语
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html
import requests
from http import HTTPStatus
from dashscope.audio.asr import Recognition
# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"
# 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
r = requests.get(
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/welcome_female_16k_mono_japanese.wav'
)
with open('asr_japanese_example.wav', 'wb') as f:
f.write(r.content)
recognition = Recognition(model='paraformer-realtime-v2',
format='wav',
sample_rate=16000,
# “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
language_hints=['ja'],
callback=None)
result = recognition.call('asr_japanese_example.wav')
if result.status_code == HTTPStatus.OK:
print('识别结果:')
print(result.get_sentence())
else:
print('Error: ', result.message)
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
public class Main {
public static void main(String[] args) {
// 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
String exampleWavUrl =
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/welcome_female_16k_mono_japanese.wav";
try {
InputStream in = new URL(exampleWavUrl).openStream();
Files.copy(in, Paths.get("asr_japanese_example.wav"), StandardCopyOption.REPLACE_EXISTING);
} catch (IOException e) {
System.out.println("error: " + e);
System.exit(1);
}
// 创建Recognition实例
Recognition recognizer = new Recognition();
// 创建RecognitionParam
RecognitionParam param =
RecognitionParam.builder()
// 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
// .apiKey("apikey")
.model("paraformer-realtime-v2")
.format("wav")
.sampleRate(16000)
// “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
.parameter("language_hints", new String[]{"ja"})
.build();
try {
System.out.println("识别结果:" + recognizer.call(param, new File("asr_japanese_example.wav")));
} catch (Exception e) {
e.printStackTrace();
}
System.exit(0);
}
}
输入文件限制
对本地音频文件进行识别时:
输入文件的方式:上传本地文件。
文件数量:单次调用最多输入1个文件。
文件大小:无限制。
文件格式:支持pcm、wav、opus、speex、aac、amr这几种格式。推荐pcm和wav。
由于音频文件格式及其变种众多,因此不能保证所有格式均能够被正确识别。请通过测试验证您所提供的文件能够获得正常的语音识别结果。
采样率:支持音频采样率为16000 Hz及以上,电话语音采样率为8000 Hz及以上。
采样率是指每秒对声音信号的采样次数。更高的采样率提供更多信息,有助于提高语音识别的准确率,但过高的采样率可能引入更多无关信息,反而影响识别效果。应根据实际采样率选择合适的模型。例如,8000Hz的语音数据应直接使用支持8000Hz的模型,无需转换为16000Hz。
实时语音翻译
实时语音翻译可以将音频流实时翻译为指定目标语言的文本,实现“边说边翻译成文本”的效果。它适用于对麦克风语音或本地音频文件进行实时翻译。
应用场景
国际会议和商务交流:在多语言环境中,实时语音翻译助力与会者即时理解不同语言的发言,促进跨国沟通与合作。
旅游和出行:在旅行或海外出差时,实时语音翻译帮助用户与当地人无障碍交流,解决问路、点餐、购物等场景中的语言障碍。
支持的模型
点击查看支持的模型
模型名称 | 支持的语言 | 支持的采样率 | 适用场景 | 支持的音频格式 | 单价 | 免费额度 |
gummy-realtime-v1 | 中文、英文、日语、韩语、粤语、德语、法语、俄语、意大利语、西班牙语 翻译语言对: 中 → 英/日/韩 英 → 中/日/韩 日/韩/粤/德/法/俄/意/西 → 中/英 | 16kHz及以上 | 会议演讲、视频直播等长时间不间断识别的场景 | pcm、wav、mp3、opus、speex、aac、amr | 0.00015元/秒 | 36,000秒(10小时) 2025年1月17日0点前开通百炼:有效期至2025年7月15日 2025年1月17日0点后开通百炼:自开通日起180天有效 |
gummy-chat-v1 | 16kHz | 对话聊天、指令控制、语音输入法、语音搜索等短时语音交互场景 |
快速开始
下面是调用API的示例代码。更多常用场景的代码示例,请参见GitHub。
您需要已获取API Key并配置API Key到环境变量。如果通过SDK调用,还需要安装DashScope SDK。
点击查看示例代码
实时语音翻译
实时语音翻译支持对长时间的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行翻译并流式返回结果。
翻译传入麦克风的语音
Java
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerRealtime;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;
public class Main {
public static void main(String[] args) throws NoApiKeyException {
// 创建一个Flowable<ByteBuffer>
String targetLanguage = "en";
Flowable<ByteBuffer> audioSource =
Flowable.create(
emitter -> {
new Thread(
() -> {
try {
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
System.out.println("请您通过麦克风讲话体验实时语音识别和翻译功能");
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音50s并进行实时识别
while (System.currentTimeMillis() - start < 50000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
emitter.onNext(buffer);
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
// 通知结束
emitter.onComplete();
} catch (Exception e) {
emitter.onError(e);
}
})
.start();
},
BackpressureStrategy.BUFFER);
// 创建Recognizer
TranslationRecognizerRealtime translator = new TranslationRecognizerRealtime();
// 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-realtime-v1")
.format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 8000、16000
.transcriptionEnabled(true)
.sourceLanguage("auto")
.translationEnabled(true)
.translationLanguages(new String[] {targetLanguage})
.build();
// 流式调用接口
translator
.streamCall(param, audioSource)
// 调用Flowable的subscribe方法订阅结果
.blockingForEach(
result -> {
if (result.getTranscriptionResult() == null) {
return;
}
try {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
System.out.println("\tStash:" + result.getTranscriptionResult().getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
System.out.println("\tStash:" + result.getTranslationResult().getTranslation(targetLanguage).getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
}
}
} catch (Exception e) {
e.printStackTrace();
}
});
System.exit(0);
}
}
Python
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import pyaudio
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
mic = None
stream = None
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback open.")
mic = pyaudio.PyAudio()
stream = mic.open(
format=pyaudio.paInt16, channels=1, rate=16000, input=True
)
def on_close(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback close.")
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.stash is not None:
print(
"translate to english stash: ",
translation_result.get_translation("en").stash.text,
)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
if transcription_result.stash is not None:
print("transcription stash: ", transcription_result.stash.text)
callback = Callback()
translator = TranslationRecognizerRealtime(
model="gummy-realtime-v1",
format="pcm",
sample_rate=16000,
transcription_enabled=True,
translation_enabled=True,
translation_target_languages=["en"],
callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验实时语音识别和翻译功能")
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
translator.send_audio_frame(data)
else:
break
translator.stop()
翻译本地音频文件
示例中用到的音频文件为:hello_world.wav。
Java
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerRealtime;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class RealtimeTranslateTask implements Runnable {
private Path filepath;
public RealtimeTranslateTask(Path filepath) {
this.filepath = filepath;
}
@Override
public void run() {
String targetLanguage = "en";
// Create translation params
// you can customize the translation parameters, like model, format,
// sample_rate for more information, please refer to
// https://help.aliyun.com/document_detail/2712536.html
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-realtime-v1")
.format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 8000、16000
.transcriptionEnabled(true)
.sourceLanguage("auto")
.translationEnabled(true)
.translationLanguages(new String[] {targetLanguage})
.build();
TranslationRecognizerRealtime translator = new TranslationRecognizerRealtime();
CountDownLatch latch = new CountDownLatch(1);
String threadName = Thread.currentThread().getName();
ResultCallback<TranslationRecognizerResult> callback =
new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:"+result);
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
System.out.println("\tStash:" + result.getTranscriptionResult().getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
System.out.println("\tStash:" + result.getTranslationResult().getTranslation(targetLanguage).getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
}
}
}
@Override
public void onComplete() {
System.out.println("[" + threadName + "] Translation complete");
latch.countDown();
}
@Override
public void onError(Exception e) {
e.printStackTrace();
System.out.println("[" + threadName + "] TranslationCallback error: " + e.getMessage());
}
};
// set param & callback
translator.call(param, callback);
// Please replace the path with your audio file path
System.out.println("[" + threadName + "] Input file_path is: " + this.filepath);
// Read file and send audio by chunks
try (FileInputStream fis = new FileInputStream(this.filepath.toFile())) {
// chunk size set to 1 seconds for 16KHz sample rate
byte[] buffer = new byte[3200];
int bytesRead;
// Loop to read chunks of the file
while ((bytesRead = fis.read(buffer)) != -1) {
ByteBuffer byteBuffer;
// Handle the last chunk which might be smaller than the buffer size
System.out.println("[" + threadName + "] bytesRead: " + bytesRead);
if (bytesRead < buffer.length) {
byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
} else {
byteBuffer = ByteBuffer.wrap(buffer);
}
// Send the ByteBuffer to the translation instance
translator.sendAudioFrame(byteBuffer);
buffer = new byte[3200];
Thread.sleep(100);
}
System.out.println(LocalDateTime.now());
} catch (Exception e) {
e.printStackTrace();
}
translator.stop();
// wait for the translation to complete
try {
latch.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
public class Main {
public static void main(String[] args)
throws NoApiKeyException, InterruptedException {
String currentDir = System.getProperty("user.dir");
// Please replace the path with your audio source
Path[] filePaths = {
Paths.get(currentDir, "hello_world.wav"),
// Paths.get(currentDir, "hello_world_male_16k_16bit_mono.wav"),
};
// Use ThreadPool to run recognition tasks
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (Path filepath:filePaths) {
executorService.submit(new RealtimeTranslateTask(filepath));
}
executorService.shutdown();
// wait for all tasks to complete
executorService.awaitTermination(1, TimeUnit.MINUTES);
System.exit(0);
}
}
Python
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import os
import requests
from http import HTTPStatus
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
r = requests.get(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
)
with open("asr_example.wav", "wb") as f:
f.write(r.content)
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
print("TranslationRecognizerCallback open.")
def on_close(self) -> None:
print("TranslationRecognizerCallback close.")
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.stash is not None:
print(
"translate to english stash: ",
translation_result.get_translation("en").stash.text,
)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
if transcription_result.stash is not None:
print("transcription stash: ", transcription_result.stash.text)
def on_error(self, message) -> None:
print('error: {}'.format(message))
def on_complete(self) -> None:
print('TranslationRecognizerCallback complete')
callback = Callback()
translator = TranslationRecognizerRealtime(
model="gummy-realtime-v1",
format="wav",
sample_rate=16000,
callback=callback,
)
translator.start()
try:
audio_data: bytes = None
f = open("asr_example.wav", 'rb')
if os.path.getsize("asr_example.wav"):
while True:
audio_data = f.read(12800)
if not audio_data:
break
else:
translator.send_audio_frame(audio_data)
else:
raise Exception(
'The supplied file was empty (zero bytes long)')
f.close()
except Exception as e:
raise e
translator.stop()
一句话翻译
一句话翻译能够对一分钟内的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行翻译并流式返回结果。
翻译传入麦克风的语音
Java
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerChat;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.results.TranscriptionResult;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;
public class Main {
public static void main(String[] args) throws NoApiKeyException, InterruptedException {
// 创建Recognizer
TranslationRecognizerChat translator = new TranslationRecognizerChat();
// 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-chat-v1")
.format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 16000
.transcriptionEnabled(true)
.translationEnabled(true)
.translationLanguages(new String[] {"en"})
.build();
// 创建一个Flowable<ByteBuffer>
Thread thread = new Thread(
() -> {
try {
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
System.out.println("请您通过麦克风讲话体验一句话语音识别和翻译功能");
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音5s并进行实时识别
while (System.currentTimeMillis() - start < 50000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
if (!translator.sendAudioFrame(buffer)) {
System.out.println("sentence end, stop sending");
break;
}
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
} catch (LineUnavailableException e) {
throw new RuntimeException(e);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
});
translator.call(param, new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
if (result.getTranscriptionResult() == null) {
return;
}
try {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
} else {
TranscriptionResult transcriptionResult = result.getTranscriptionResult();
System.out.println("\tTemp Result:" + transcriptionResult.getText());
if (result.getTranscriptionResult().isVadPreEnd()) {
System.out.printf("VadPreEnd: start:%d, end:%d, time:%d\n", transcriptionResult.getPreEndStartTime(), transcriptionResult.getPreEndEndTime(), transcriptionResult.getPreEndTimemillis());
}
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation("en").getText());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation("en").getText());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void onComplete() {
System.out.println("Translation complete");
}
@Override
public void onError(Exception e) {
}
});
thread.start();
thread.join();
translator.stop();
// System.exit(0);
}
}
Python
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import pyaudio
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
mic = None
stream = None
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback open.")
mic = pyaudio.PyAudio()
stream = mic.open(
format=pyaudio.paInt16, channels=1, rate=16000, input=True
)
def on_close(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback close.")
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.vad_pre_end:
print("vad pre end {}, {}, {}".format(transcription_result.pre_end_start_time, transcription_result.pre_end_end_time, transcription_result.pre_end_timemillis))
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
callback = Callback()
translator = TranslationRecognizerChat(
model="gummy-chat-v1",
format="pcm",
sample_rate=16000,
transcription_enabled=True,
translation_enabled=True,
translation_target_languages=["en"],
callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验一句话语音识别和翻译功能")
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
if not translator.send_audio_frame(data):
print("sentence end, stop sending")
break
else:
break
translator.stop()
翻译本地音频文件
示例中用到的音频文件为:hello_world.wav。
Java
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerChat;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.LocalDateTime;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class RealtimeTranslateChatTask implements Runnable {
private Path filepath;
private TranslationRecognizerChat translator = null;
public RealtimeTranslateChatTask(Path filepath) {
this.filepath = filepath;
}
@Override
public void run() {
for (int i=0; i<1; i++) {
// Create translation params
// you can customize the translation parameters, like model, format,
// sample_rate for more information, please refer to
// https://help.aliyun.com/document_detail/2712536.html
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-chat-v1")
.format("wav") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 16000
.transcriptionEnabled(true)
.translationEnabled(true)
.translationLanguages(new String[] {"en"})
.build();
if (translator == null) {
translator = new TranslationRecognizerChat();
}
CountDownLatch latch = new CountDownLatch(1);
String threadName = Thread.currentThread().getName();
ResultCallback<TranslationRecognizerResult> callback =
new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:"+result);
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation("en").getText());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation("en").getText());
}
}
}
@Override
public void onComplete() {
System.out.println("[" + threadName + "] Translation complete");
latch.countDown();
}
@Override
public void onError(Exception e) {
e.printStackTrace();
System.out.println("[" + threadName + "] TranslationCallback error: " + e.getMessage());
}
};
// set param & callback
translator.call(param, callback);
// Please replace the path with your audio file path
System.out.println("[" + threadName + "] Input file_path is: " + this.filepath);
// Read file and send audio by chunks
try (FileInputStream fis = new FileInputStream(this.filepath.toFile())) {
// chunk size set to 1 seconds for 16KHz sample rate
byte[] buffer = new byte[3200];
int bytesRead;
// Loop to read chunks of the file
while ((bytesRead = fis.read(buffer)) != -1) {
ByteBuffer byteBuffer;
// Handle the last chunk which might be smaller than the buffer size
System.out.println("[" + threadName + "] bytesRead: " + bytesRead);
if (bytesRead < buffer.length) {
byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
} else {
byteBuffer = ByteBuffer.wrap(buffer);
}
// Send the ByteBuffer to the translation instance
if (!translator.sendAudioFrame(byteBuffer)) {
System.out.println("sentence end, stop sending");
break;
}
buffer = new byte[3200];
Thread.sleep(100);
}
fis.close();
System.out.println(LocalDateTime.now());
} catch (Exception e) {
e.printStackTrace();
}
translator.stop();
// wait for the translation to complete
try {
latch.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
public class Main {
public static void main(String[] args)
throws NoApiKeyException, InterruptedException {
String currentDir = System.getProperty("user.dir");
// Please replace the path with your audio source
Path[] filePaths = {
Paths.get(currentDir, "hello_world.wav"),
// Paths.get(currentDir, "hello_world_male_16k_16bit_mono.wav"),
};
// Use ThreadPool to run recognition tasks
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (Path filepath:filePaths) {
executorService.submit(new RealtimeTranslateChatTask(filepath));
}
executorService.shutdown();
// wait for all tasks to complete
executorService.awaitTermination(1, TimeUnit.MINUTES);
// System.exit(0);
}
}
Python
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import os
import requests
from http import HTTPStatus
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
r = requests.get(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
)
with open("asr_example.wav", "wb") as f:
f.write(r.content)
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
print("TranslationRecognizerCallback open.")
def on_close(self) -> None:
print("TranslationRecognizerCallback close.")
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
def on_error(self, message) -> None:
print('error: {}'.format(message))
def on_complete(self) -> None:
print('TranslationRecognizerCallback complete')
callback = Callback()
translator = TranslationRecognizerChat(
model="gummy-chat-v1",
format="wav",
sample_rate=16000,
callback=callback,
)
translator.start()
try:
audio_data: bytes = None
f = open("asr_example.wav", 'rb')
if os.path.getsize("asr_example.wav"):
while True:
audio_data = f.read(12800)
if not audio_data:
break
else:
if translator.send_audio_frame(audio_data):
print("send audio frame success")
else:
print("sentence end, stop sending")
break
else:
raise Exception(
'The supplied file was empty (zero bytes long)')
f.close()
except Exception as e:
raise e
translator.stop()
输入文件限制
对本地音频文件进行识别时:
输入文件的方式:上传本地文件。
文件数量:单次调用最多输入1个文件。
文件大小:无限制。
文件格式:支持pcm、wav、opus、speex、aac、amr这几种格式。推荐pcm和wav。
由于音频文件格式及其变种众多,因此不能保证所有格式均能够被正确识别。请通过测试验证您所提供的文件能够获得正常的语音识别结果。
采样率:支持音频采样率为16000 Hz及以上,电话语音采样率为8000 Hz及以上。
采样率是指每秒对声音信号的采样次数。更高的采样率提供更多信息,有助于提高语音识别的准确率,但过高的采样率可能引入更多无关信息,反而影响识别效果。应根据实际采样率选择合适的模型。例如,8000Hz的语音数据应直接使用支持8000Hz的模型,无需转换为16000Hz。
API参考
Gummy:语音识别/翻译Gummy。
Paraformer:语音识别Paraformer。
SenseVoice:语音识别SenseVoice。
常见问题
1. 可能影响识别准确率的因素有哪些?
声音质量:设备、环境等可能影响语音的清晰度,从而影响识别准确率。高质量的音频输入是提高识别准确性的前提。
说话人特征:不同人的声音特质(如音调、语速、口音、方言)差异很大,这些个体差异会对语音识别系统构成挑战,尤其是对于未充分训练过的特定口音或方言。
语言和词汇:语音识别模型通常针对特定的语言进行训练。当处理多语言混合、专业术语、俚语或网络用语时,识别难度会增加。您可以通过热词功能,改变识别结果,具体操作请参见Paraformer语音识别热词定制与管理。
上下文理解:缺乏对对话上下文的理解可能会导致误解,尤其是在含义模糊或依赖于上下文的情境中。
2. 模型限流规则是怎样的?
通义千问ASR语音识别:
模型名称 | 限流条件(超出任一数值时触发限流) | |
每分钟调用次数(QPM) | 每分钟消耗Token数(TPM) | |
qwen-audio-asr | 60 | 100,000 |
qwen-audio-asr-latest | ||
qwen-audio-asr-2024-12-04 |
Gummy语音识别、翻译:
模型名称 | 提交作业接口RPS限制 |
gummy-realtime-v1 | 10 |
gummy-chat-v1 |
Paraformer语音识别:
模型名称 | 提交作业接口RPS限制 |
paraformer-realtime-v2 | 20 |
paraformer-realtime-v1 | |
paraformer-realtime-8k-v2 | |
paraformer-realtime-8k-v1 |
模型名称 | 提交作业接口RPS限制 | 任务查询接口RPS限制 |
paraformer-v2 | 20 | 20 |
paraformer-v1 | 10 | |
paraformer-8k-v2 | 20 | |
paraformer-8k-v1 | 10 | |
paraformer-mtl-v1 | 10 |
SenseVoice语音识别:
模型名称 | 提交作业接口RPS限制 | 任务查询接口RPS限制 |
sensevoice-v1 | 10 | 20 |