实时语音识别可以将音频流实时转换为文本,实现“边说边出文字”的效果。它适用于对麦克风语音进行实时识别,以及对本地音频文件进行实时转录。
应用场景
会议:为会议、演讲、培训、庭审等提供实时记录。
直播:为直播带货、赛事直播等提供实时字幕。
客服:实时记录通话内容,协助提升服务品质。
游戏:让玩家无需停下手头操作即可语音输入或阅读聊天内容。
社交聊天:使用社交App或输入法时,语音自动转文本。
人机交互:转换语音对话为文字,优化人机交互体验。
支持的模型
模型选型建议
语种支持:
多语种混合场景下,推荐使用Gummy,能够带来更高的识别准确率。另外,Gummy对非常用词的识别准确率更高。
对于中文(普通话)、粤语、英语、日语、韩语,可以选择Gummy或Paraformer模型。
对于德语、法语、俄语、意大利语、西班牙语,请选择Gummy模型。
对于中文(方言),请选择Paraformer模型。
噪音环境下:推荐使用Paraformer。
情感识别和语气词过滤:如果需要情感识别和语气词过滤能力,请选择Paraformer语音识别模型。
快速开始
您可以先进行在线体验:请在语音识别页面选择“Paraformer实时语音识别-v2”模型,单击立即体验。
下面是调用API的示例代码。更多常用场景的代码示例,请参见GitHub。
您需要已获取API Key并配置API Key到环境变量。如果通过SDK调用,还需要安装DashScope SDK。
实时语音识别:适用于会议演讲、视频直播等长时间不间断识别的场景。
一句话识别:对停顿更加敏感,支持对一分钟内的短语音进行精准识别,适用于对话聊天、指令控制、语音输入法、语音搜索等短时语音交互场景。
实时语音识别支持对长时间的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行识别并流式返回结果。
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerRealtime;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
public class Main {
public static void main(String[] args) throws InterruptedException {
ExecutorService executorService = Executors.newSingleThreadExecutor();
executorService.submit(new RealtimeRecognitionTask());
executorService.shutdown();
executorService.awaitTermination(1, TimeUnit.MINUTES);
System.exit(0);
}
}
class RealtimeRecognitionTask implements Runnable {
@Override
public void run() {
String targetLanguage = "en";
// 初始化请求参数
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-realtime-v1") // 设置模型名
.format("pcm") // 设置待识别音频格式,支持的音频格式:pcm、wav、mp3、opus、speex、aac、amr
.sampleRate(16000) // 设置待识别音频采样率(单位Hz)。支持16000Hz及以上采样率。
.transcriptionEnabled(true) // 设置是否开启实时识别
.sourceLanguage("auto") // 设置源语言(待识别/翻译语言)代码
.translationEnabled(true) // 设置是否开启实时翻译
.translationLanguages(new String[] {targetLanguage}) // 设置翻译目标语言
.build();
// 初始化回调接口
ResultCallback<TranslationRecognizerResult> callback =
new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:"+result);
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
System.out.println("\tStash:" + result.getTranscriptionResult().getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
System.out.println("\tStash:" + result.getTranslationResult().getTranslation(targetLanguage).getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
}
}
}
@Override
public void onComplete() {
System.out.println("Translation complete");
}
@Override
public void onError(Exception e) {
e.printStackTrace();
System.out.println("TranslationCallback error: " + e.getMessage());
}
};
// 初始化流式识别服务
TranslationRecognizerRealtime translator = new TranslationRecognizerRealtime();
// 启动流式语音识别/翻译,绑定请求参数和回调接口
translator.call(param, callback);
try {
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
System.out.println("请您通过麦克风讲话体验实时语音识别和翻译功能");
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音50s并进行实时识别
while (System.currentTimeMillis() - start < 50000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
translator.sendAudioFrame(buffer);
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
// 通知结束
translator.stop();
} catch (Exception e) {
e.printStackTrace();
}
System.out.println(
"[Metric] requestId: "
+ translator.getLastRequestId()
+ ", first package delay ms: "
+ translator.getFirstPackageDelay()
+ ", last package delay ms: "
+ translator.getLastPackageDelay());
}
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import pyaudio
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
mic = None
stream = None
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback open.")
mic = pyaudio.PyAudio()
stream = mic.open(
format=pyaudio.paInt16, channels=1, rate=16000, input=True
)
def on_close(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback close.")
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.stash is not None:
print(
"translate to english stash: ",
translation_result.get_translation("en").stash.text,
)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
if transcription_result.stash is not None:
print("transcription stash: ", transcription_result.stash.text)
callback = Callback()
translator = TranslationRecognizerRealtime(
model="gummy-realtime-v1",
format="pcm",
sample_rate=16000,
transcription_enabled=True,
translation_enabled=True,
translation_target_languages=["en"],
callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验实时语音识别和翻译功能")
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
translator.send_audio_frame(data)
else:
break
translator.stop()
示例中用到的音频为:hello_world.wav。
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerRealtime;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class RealtimeTranslateTask implements Runnable {
private Path filepath;
public RealtimeTranslateTask(Path filepath) {
this.filepath = filepath;
}
@Override
public void run() {
String targetLanguage = "en";
// Create translation params
// you can customize the translation parameters, like model, format,
// sample_rate for more information, please refer to
// https://help.aliyun.com/document_detail/2712536.html
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-realtime-v1")
.format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000)
.transcriptionEnabled(true)
.sourceLanguage("auto")
.translationEnabled(true)
.translationLanguages(new String[] {targetLanguage})
.build();
TranslationRecognizerRealtime translator = new TranslationRecognizerRealtime();
CountDownLatch latch = new CountDownLatch(1);
String threadName = Thread.currentThread().getName();
ResultCallback<TranslationRecognizerResult> callback =
new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:"+result);
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
System.out.println("\tStash:" + result.getTranscriptionResult().getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
System.out.println("\tStash:" + result.getTranslationResult().getTranslation(targetLanguage).getStash());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
}
}
}
@Override
public void onComplete() {
System.out.println("[" + threadName + "] Translation complete");
latch.countDown();
}
@Override
public void onError(Exception e) {
e.printStackTrace();
System.out.println("[" + threadName + "] TranslationCallback error: " + e.getMessage());
}
};
// set param & callback
translator.call(param, callback);
// Please replace the path with your audio file path
System.out.println("[" + threadName + "] Input file_path is: " + this.filepath);
// Read file and send audio by chunks
try (FileInputStream fis = new FileInputStream(this.filepath.toFile())) {
// chunk size set to 1 seconds for 16KHz sample rate
byte[] buffer = new byte[3200];
int bytesRead;
// Loop to read chunks of the file
while ((bytesRead = fis.read(buffer)) != -1) {
ByteBuffer byteBuffer;
// Handle the last chunk which might be smaller than the buffer size
System.out.println("[" + threadName + "] bytesRead: " + bytesRead);
if (bytesRead < buffer.length) {
byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
} else {
byteBuffer = ByteBuffer.wrap(buffer);
}
// Send the ByteBuffer to the translation instance
translator.sendAudioFrame(byteBuffer);
buffer = new byte[3200];
Thread.sleep(100);
}
System.out.println(LocalDateTime.now());
} catch (Exception e) {
e.printStackTrace();
}
translator.stop();
// wait for the translation to complete
try {
latch.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
public class Main {
public static void main(String[] args)
throws NoApiKeyException, InterruptedException {
String currentDir = System.getProperty("user.dir");
// Please replace the path with your audio source
Path[] filePaths = {
Paths.get(currentDir, "hello_world.wav"),
// Paths.get(currentDir, "hello_world_male_16k_16bit_mono.wav"),
};
// Use ThreadPool to run recognition tasks
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (Path filepath:filePaths) {
executorService.submit(new RealtimeTranslateTask(filepath));
}
executorService.shutdown();
// wait for all tasks to complete
executorService.awaitTermination(1, TimeUnit.MINUTES);
System.exit(0);
}
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import os
import requests
from http import HTTPStatus
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
r = requests.get(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
)
with open("asr_example.wav", "wb") as f:
f.write(r.content)
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
print("TranslationRecognizerCallback open.")
def on_close(self) -> None:
print("TranslationRecognizerCallback close.")
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.stash is not None:
print(
"translate to english stash: ",
translation_result.get_translation("en").stash.text,
)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
if transcription_result.stash is not None:
print("transcription stash: ", transcription_result.stash.text)
def on_error(self, message) -> None:
print('error: {}'.format(message))
def on_complete(self) -> None:
print('TranslationRecognizerCallback complete')
callback = Callback()
translator = TranslationRecognizerRealtime(
model="gummy-realtime-v1",
format="wav",
sample_rate=16000,
callback=callback,
)
translator.start()
try:
audio_data: bytes = None
f = open("asr_example.wav", 'rb')
if os.path.getsize("asr_example.wav"):
while True:
audio_data = f.read(12800)
if not audio_data:
break
else:
translator.send_audio_frame(audio_data)
else:
raise Exception(
'The supplied file was empty (zero bytes long)')
f.close()
except Exception as e:
raise e
translator.stop()
一句话识别能够对一分钟内的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行识别并流式返回结果。
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerChat;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.results.TranscriptionResult;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;
public class Main {
public static void main(String[] args) throws NoApiKeyException, InterruptedException {
// 创建Recognizer
TranslationRecognizerChat translator = new TranslationRecognizerChat();
// 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-chat-v1")
.format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 16000
.transcriptionEnabled(true)
.translationEnabled(true)
.translationLanguages(new String[] {"en"})
.build();
// 创建一个Flowable<ByteBuffer>
Thread thread = new Thread(
() -> {
try {
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
System.out.println("请您通过麦克风讲话体验一句话语音识别和翻译功能");
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音5s并进行实时识别
while (System.currentTimeMillis() - start < 50000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
if (!translator.sendAudioFrame(buffer)) {
System.out.println("sentence end, stop sending");
break;
}
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
} catch (LineUnavailableException e) {
throw new RuntimeException(e);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
});
translator.call(param, new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
if (result.getTranscriptionResult() == null) {
return;
}
try {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
} else {
TranscriptionResult transcriptionResult = result.getTranscriptionResult();
System.out.println("\tTemp Result:" + transcriptionResult.getText());
if (result.getTranscriptionResult().isVadPreEnd()) {
System.out.printf("VadPreEnd: start:%d, end:%d, time:%d\n", transcriptionResult.getPreEndStartTime(), transcriptionResult.getPreEndEndTime(), transcriptionResult.getPreEndTimemillis());
}
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation("en").getText());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation("en").getText());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void onComplete() {
System.out.println("Translation complete");
}
@Override
public void onError(Exception e) {
}
});
thread.start();
thread.join();
translator.stop();
// System.exit(0);
}
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import pyaudio
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
mic = None
stream = None
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback open.")
mic = pyaudio.PyAudio()
stream = mic.open(
format=pyaudio.paInt16, channels=1, rate=16000, input=True
)
def on_close(self) -> None:
global mic
global stream
print("TranslationRecognizerCallback close.")
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if english_translation.vad_pre_end:
print("vad pre end {}, {}, {}".format(transcription_result.pre_end_start_time, transcription_result.pre_end_end_time, transcription_result.pre_end_timemillis))
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
callback = Callback()
translator = TranslationRecognizerChat(
model="gummy-chat-v1",
format="pcm",
sample_rate=16000,
transcription_enabled=True,
translation_enabled=True,
translation_target_languages=["en"],
callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验一句话语音识别和翻译功能")
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
if not translator.send_audio_frame(data):
print("sentence end, stop sending")
break
else:
break
translator.stop()
示例中用到的音频为:hello_world.wav。
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerChat;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;
import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.LocalDateTime;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
class RealtimeTranslateChatTask implements Runnable {
private Path filepath;
private TranslationRecognizerChat translator = null;
public RealtimeTranslateChatTask(Path filepath) {
this.filepath = filepath;
}
@Override
public void run() {
for (int i=0; i<1; i++) {
// Create translation params
// you can customize the translation parameters, like model, format,
// sample_rate for more information, please refer to
// https://help.aliyun.com/document_detail/2712536.html
TranslationRecognizerParam param =
TranslationRecognizerParam.builder()
// 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
// .apiKey("your-api-key")
.model("gummy-chat-v1")
.format("wav") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
// can check the supported formats in the document
.sampleRate(16000) // supported 16000
.transcriptionEnabled(true)
.translationEnabled(true)
.translationLanguages(new String[] {"en"})
.build();
if (translator == null) {
translator = new TranslationRecognizerChat();
}
CountDownLatch latch = new CountDownLatch(1);
String threadName = Thread.currentThread().getName();
ResultCallback<TranslationRecognizerResult> callback =
new ResultCallback<TranslationRecognizerResult>() {
@Override
public void onEvent(TranslationRecognizerResult result) {
System.out.println("RequestId: " + result.getRequestId());
// 打印最终结果
if (result.getTranscriptionResult() != null) {
System.out.println("Transcription Result:"+result);
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranscriptionResult().getText());
} else {
System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
}
}
if (result.getTranslationResult() != null) {
System.out.println("English Translation Result:");
if (result.isSentenceEnd()) {
System.out.println("\tFix:" + result.getTranslationResult().getTranslation("en").getText());
} else {
System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation("en").getText());
}
}
}
@Override
public void onComplete() {
System.out.println("[" + threadName + "] Translation complete");
latch.countDown();
}
@Override
public void onError(Exception e) {
e.printStackTrace();
System.out.println("[" + threadName + "] TranslationCallback error: " + e.getMessage());
}
};
// set param & callback
translator.call(param, callback);
// Please replace the path with your audio file path
System.out.println("[" + threadName + "] Input file_path is: " + this.filepath);
// Read file and send audio by chunks
try (FileInputStream fis = new FileInputStream(this.filepath.toFile())) {
// chunk size set to 1 seconds for 16KHz sample rate
byte[] buffer = new byte[3200];
int bytesRead;
// Loop to read chunks of the file
while ((bytesRead = fis.read(buffer)) != -1) {
ByteBuffer byteBuffer;
// Handle the last chunk which might be smaller than the buffer size
System.out.println("[" + threadName + "] bytesRead: " + bytesRead);
if (bytesRead < buffer.length) {
byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
} else {
byteBuffer = ByteBuffer.wrap(buffer);
}
// Send the ByteBuffer to the translation instance
if (!translator.sendAudioFrame(byteBuffer)) {
System.out.println("sentence end, stop sending");
break;
}
buffer = new byte[3200];
Thread.sleep(100);
}
fis.close();
System.out.println(LocalDateTime.now());
} catch (Exception e) {
e.printStackTrace();
}
translator.stop();
// wait for the translation to complete
try {
latch.await();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
public class Main {
public static void main(String[] args)
throws NoApiKeyException, InterruptedException {
String currentDir = System.getProperty("user.dir");
// Please replace the path with your audio source
Path[] filePaths = {
Paths.get(currentDir, "hello_world.wav"),
// Paths.get(currentDir, "hello_world_male_16k_16bit_mono.wav"),
};
// Use ThreadPool to run recognition tasks
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (Path filepath:filePaths) {
executorService.submit(new RealtimeTranslateChatTask(filepath));
}
executorService.shutdown();
// wait for all tasks to complete
executorService.awaitTermination(1, TimeUnit.MINUTES);
// System.exit(0);
}
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html
import os
import requests
from http import HTTPStatus
import dashscope
from dashscope.audio.asr import *
# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"
r = requests.get(
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
)
with open("asr_example.wav", "wb") as f:
f.write(r.content)
class Callback(TranslationRecognizerCallback):
def on_open(self) -> None:
print("TranslationRecognizerCallback open.")
def on_close(self) -> None:
print("TranslationRecognizerCallback close.")
def on_event(
self,
request_id,
transcription_result: TranscriptionResult,
translation_result: TranslationResult,
usage,
) -> None:
print("request id: ", request_id)
print("usage: ", usage)
if translation_result is not None:
print(
"translation_languages: ",
translation_result.get_language_list(),
)
english_translation = translation_result.get_translation("en")
print("sentence id: ", english_translation.sentence_id)
print("translate to english: ", english_translation.text)
if transcription_result is not None:
print("sentence id: ", transcription_result.sentence_id)
print("transcription: ", transcription_result.text)
def on_error(self, message) -> None:
print('error: {}'.format(message))
def on_complete(self) -> None:
print('TranslationRecognizerCallback complete')
callback = Callback()
translator = TranslationRecognizerChat(
model="gummy-chat-v1",
format="wav",
sample_rate=16000,
callback=callback,
)
translator.start()
try:
audio_data: bytes = None
f = open("asr_example.wav", 'rb')
if os.path.getsize("asr_example.wav"):
while True:
audio_data = f.read(12800)
if not audio_data:
break
else:
if translator.send_audio_frame(audio_data):
print("send audio frame success")
else:
print("sentence end, stop sending")
break
else:
raise Exception(
'The supplied file was empty (zero bytes long)')
f.close()
except Exception as e:
raise e
translator.stop()
实时语音识别可以识别麦克风中传入的语音并输出识别结果,达到“边说边出文字”的效果。
运行Python示例前,需要通过pip install pyaudio命令安装第三方音频播放与采集套件。
import pyaudio
from dashscope.audio.asr import (Recognition, RecognitionCallback,
RecognitionResult)
# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"
mic = None
stream = None
class Callback(RecognitionCallback):
def on_open(self) -> None:
global mic
global stream
print('RecognitionCallback open.')
mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True)
def on_close(self) -> None:
global mic
global stream
print('RecognitionCallback close.')
stream.stop_stream()
stream.close()
mic.terminate()
stream = None
mic = None
def on_event(self, result: RecognitionResult) -> None:
print('RecognitionCallback sentence: ', result.get_sentence())
callback = Callback()
recognition = Recognition(model='paraformer-realtime-v2',
format='pcm',
sample_rate=16000,
callback=callback)
recognition.start()
while True:
if stream:
data = stream.read(3200, exception_on_overflow=False)
recognition.send_audio_frame(data)
else:
break
recognition.stop()
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;
import java.nio.ByteBuffer;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;
public class Main {
public static void main(String[] args) throws NoApiKeyException {
// 创建一个Flowable<ByteBuffer>
Flowable<ByteBuffer> audioSource =
Flowable.create(
emitter -> {
new Thread(
() -> {
try {
// 创建音频格式
AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
// 根据格式匹配默认录音设备
TargetDataLine targetDataLine =
AudioSystem.getTargetDataLine(audioFormat);
targetDataLine.open(audioFormat);
// 开始录音
targetDataLine.start();
ByteBuffer buffer = ByteBuffer.allocate(1024);
long start = System.currentTimeMillis();
// 录音30s并进行实时转写
while (System.currentTimeMillis() - start < 300000) {
int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
if (read > 0) {
buffer.limit(read);
// 将录音音频数据发送给流式识别服务
emitter.onNext(buffer);
buffer = ByteBuffer.allocate(1024);
// 录音速率有限,防止cpu占用过高,休眠一小会儿
Thread.sleep(20);
}
}
// 通知结束转写
emitter.onComplete();
} catch (Exception e) {
emitter.onError(e);
}
})
.start();
},
BackpressureStrategy.BUFFER);
// 创建Recognizer
Recognition recognizer = new Recognition();
// 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
RecognitionParam param =
RecognitionParam.builder()
.model("paraformer-realtime-v2")
.format("pcm")
.sampleRate(16000)
// 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
// .apiKey("apikey")
.build();
// 流式调用接口
recognizer
.streamCall(param, audioSource)
// 调用Flowable的subscribe方法订阅结果
.blockingForEach(
result -> {
// 打印最终结果
if (result.isSentenceEnd()) {
System.out.println("Fix:" + result.getSentence().getText());
} else {
System.out.println("Result:" + result.getSentence().getText());
}
});
System.exit(0);
}
}
实时语音识别可以识别本地音频文件并输出识别结果。对于对话聊天、控制口令、语音输入法、语音搜索等较短的准实时语音识别场景可考虑采用该接口进行语音识别。
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html
import requests
from http import HTTPStatus
from dashscope.audio.asr import Recognition
# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"
# 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
r = requests.get(
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav'
)
with open('asr_example.wav', 'wb') as f:
f.write(r.content)
recognition = Recognition(model='paraformer-realtime-v2',
format='wav',
sample_rate=16000,
# “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
language_hints=['zh', 'en'],
callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
print('识别结果:')
print(result.get_sentence())
else:
print('Error: ', result.message)
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
public class Main {
public static void main(String[] args) {
// 用户可忽略url下载文件部分,可以直接使用本地文件进行相关api调用进行识别
String exampleWavUrl =
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav";
try {
InputStream in = new URL(exampleWavUrl).openStream();
Files.copy(in, Paths.get("asr_example.wav"), StandardCopyOption.REPLACE_EXISTING);
} catch (IOException e) {
System.out.println("error: " + e);
System.exit(1);
}
// 创建Recognition实例
Recognition recognizer = new Recognition();
// 创建RecognitionParam
RecognitionParam param =
RecognitionParam.builder()
// 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
// .apiKey("apikey")
.model("paraformer-realtime-v2")
.format("wav")
.sampleRate(16000)
// “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
.parameter("language_hints", new String[]{"zh", "en"})
.build();
try {
System.out.println("识别结果:" + recognizer.call(param, new File("asr_example.wav")));
} catch (Exception e) {
e.printStackTrace();
}
System.exit(0);
}
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html
import requests
from http import HTTPStatus
from dashscope.audio.asr import Recognition
# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"
# 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
r = requests.get(
'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/welcome_female_16k_mono_japanese.wav'
)
with open('asr_japanese_example.wav', 'wb') as f:
f.write(r.content)
recognition = Recognition(model='paraformer-realtime-v2',
format='wav',
sample_rate=16000,
# “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
language_hints=['ja'],
callback=None)
result = recognition.call('asr_japanese_example.wav')
if result.status_code == HTTPStatus.OK:
print('识别结果:')
print(result.get_sentence())
else:
print('Error: ', result.message)
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
public class Main {
public static void main(String[] args) {
// 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
String exampleWavUrl =
"https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/welcome_female_16k_mono_japanese.wav";
try {
InputStream in = new URL(exampleWavUrl).openStream();
Files.copy(in, Paths.get("asr_japanese_example.wav"), StandardCopyOption.REPLACE_EXISTING);
} catch (IOException e) {
System.out.println("error: " + e);
System.exit(1);
}
// 创建Recognition实例
Recognition recognizer = new Recognition();
// 创建RecognitionParam
RecognitionParam param =
RecognitionParam.builder()
// 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
// .apiKey("apikey")
.model("paraformer-realtime-v2")
.format("wav")
.sampleRate(16000)
// “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
.parameter("language_hints", new String[]{"ja"})
.build();
try {
System.out.println("识别结果:" + recognizer.call(param, new File("asr_japanese_example.wav")));
} catch (Exception e) {
e.printStackTrace();
}
System.exit(0);
}
}
输入文件限制
对本地音频文件进行识别时:
输入文件的方式:将本地文件路径作为参数传递。
文件数量:单次调用最多输入1个文件。
文件大小:无限制。
音频时长:
Paraformer:无限制
Gummy实时语音识别:无限制
Gummy一句话识别:一分钟以内
文件格式:支持pcm、pcm编码的wav、mp3、ogg封装的opus、ogg封装的speex、aac、amr这几种格式。推荐pcm和wav。
由于音频文件格式及其变种众多,因此不能保证所有格式均能够被正确识别。请通过测试验证您所提供的文件能够获得正常的语音识别结果。
音频采样位数:16bit。
采样率:因模型而异。
采样率是指每秒对声音信号的采样次数。更高的采样率提供更多信息,有助于提高语音识别的准确率,但过高的采样率可能引入更多无关信息,反而影响识别效果。应根据实际采样率选择合适的模型。例如,8000Hz的语音数据应直接使用支持8000Hz的模型,无需转换为16000Hz。
API参考
常见问题
1. 可能影响识别准确率的因素有哪些?
声音质量:设备、环境等可能影响语音的清晰度,从而影响识别准确率。高质量的音频输入是提高识别准确性的前提。
说话人特征:不同人的声音特质(如音调、语速、口音、方言)差异很大,这些个体差异会对语音识别系统构成挑战,尤其是对于未充分训练过的特定口音或方言。
语言和词汇:语音识别模型通常针对特定的语言进行训练。当处理多语言混合、专业术语、俚语或网络用语时,识别难度会增加。若模型支持热词功能,可通过热词调整识别结果。
上下文理解:缺乏对对话上下文的理解可能会导致误解,尤其是在含义模糊或依赖于上下文的情境中。
2. 模型限流规则是怎样的?
Gummy:
模型名称 | 提交作业接口RPS限制 |
模型名称 | 提交作业接口RPS限制 |
gummy-realtime-v1 | 10 |
gummy-chat-v1 |
Paraformer:
模型名称 | 提交作业接口RPS限制 |
模型名称 | 提交作业接口RPS限制 |
paraformer-realtime-v2 | 20 |
paraformer-realtime-v1 | |
paraformer-realtime-8k-v2 | |
paraformer-realtime-8k-v1 |
- 本页导读 (1)
- 应用场景
- 支持的模型
- 模型选型建议
- 快速开始
- 输入文件限制
- API参考
- 常见问题
- 1. 可能影响识别准确率的因素有哪些?
- 2. 模型限流规则是怎样的?