实时语音识别

更新时间:2025-04-15 09:59:07

实时语音识别可以将音频流实时转换为文本,实现“边说边出文字”的效果它适用于对麦克风语音进行实时识别,以及对本地音频文件进行实时转录。

应用场景

  • 会议:为会议、演讲、培训、庭审等提供实时记录。

  • 直播:为直播带货、赛事直播等提供实时字幕。

  • 客服:实时记录通话内容,协助提升服务品质。

  • 游戏:让玩家无需停下手头操作即可语音输入或阅读聊天内容。

  • 社交聊天:使用社交App或输入法时,语音自动转文本。

  • 人机交互:转换语音对话为文字,优化人机交互体验。

支持的模型

Paraformer

模型名称

支持的语言

支持的采样率

适用场景

支持的音频格式

单价

免费额度

paraformer-realtime-v2

中文普通话、中文方言(粤语、吴语、闽南语、东北话、甘肃话、贵州话、河南话、湖北话、湖南话、宁夏话、山西话、陕西话、山东话、四川话、天津话、江西话、云南话、上海话)、英语、日语、韩语

支持多个语种自由切换

任意

视频直播、会议等

pcm、wav、mp3、opus、speex、aac、amr

0.00024元/秒

36,000秒(10小时)

每月10点自动发放

有效期1个月

paraformer-realtime-v1

中文

16kHz

paraformer-realtime-8k-v2

8kHz

电话客服等

paraformer-realtime-8k-v1

Gummy

模型名称

支持的语言

支持的采样率

适用场景

支持的音频格式

单价

免费额度

gummy-realtime-v1

中文、英文、日语、韩语、粤语、德语、法语、俄语、意大利语、西班牙语

翻译语言对:

中 → 英/日/韩

英 → 中/日/韩

日/韩/粤/德/法/俄/意/西 → 中/英

16kHz及以上

会议演讲、视频直播等长时间不间断识别的场景

pcm、wav、mp3、opus、speex、aac、amr

0.00015元/秒

36,000秒(10小时)

20251170点前开通百炼:有效期至2025715

20251170点后开通百炼:自开通日起180天有效

gummy-chat-v1

16kHz

对话聊天、指令控制、语音输入法、语音搜索等短时语音交互场景

模型选型建议

  • 语种支持

    多语种混合场景下,推荐使用Gummy,能够带来更高的识别准确率。另外,Gummy对非常用词的识别准确率更高。

    • 对于中文(普通话)、粤语、英语、日语、韩语,可以选择GummyParaformer模型。

    • 对于德语、法语、俄语、意大利语、西班牙语,请选择Gummy模型。

    • 对于中文(方言),请选择Paraformer模型。

  • 噪音环境下:推荐使用Paraformer。

  • 情感识别和语气词过滤:如果需要情感识别和语气词过滤能力,请选择Paraformer语音识别模型。

点击查看模型功能特性对比

Gummy实时语音识别

Paraformer实时语音识别

接入方式

Python、Java、WebSocket

Python、Java、WebSocket

定制热词

支持

支持

情感识别

不支持

paraformer-realtime-8k-v2模型支持

语气词过滤

不支持

支持

时间戳

支持

支持

流式输入

支持

支持

流式输出

支持

支持

识别本地文件

支持

支持

标点符号预测

支持

支持

待识别音频格式

pcm、pcm编码的wav、mp3、ogg封装的opus、ogg封装的speex、aac、amr

pcm、pcm编码的wav、mp3、ogg封装的opus、ogg封装的speex、aac、amr

待识别音频采样位数

16bit

16bit

待识别音频声道

单声道

单声道

待识别音频采样率

因模型而异:

  • gummy-realtime-v1支持16kHz及以上

  • gummy-chat-v1仅支持16kHz

因模型而异:

  • paraformer-realtime-v2 支持任意采样率

  • paraformer-realtime-v1 仅支持16kHz采样

  • paraformer-realtime-8k-v2 仅支持8kHz采样率

  • paraformer-realtime-8k-v1 仅支持8kHz采样率

待识别音频时长

因模型而异:

  • gummy-realtime-v1:不限

  • gummy-chat-v1:60秒以内

不限

可识别语言

中文、英文、日语、韩语、粤语、德语、法语、俄语、意大利语、西班牙语

因模型而异:

  • paraformer-realtime-v2:

    • 中文,包含中文普通话和各种方言:上海话、吴语、闽南语、东北话、甘肃话、贵州话、河南话、湖北话、湖南话、江西话、宁夏话、山西话、陕西话、山东话、四川话、天津话、云南话、粤语

    • 英文

    • 日语

    • 韩语

  • paraformer-realtime-v1 仅支持中文

  • paraformer-realtime-8k-v2 仅支持中文

  • paraformer-realtime-8k-v1 仅支持中文

单价

0.54元/小时

0.864元/小时

免费额度

36,000秒(10小时)

20251170点前开通百炼:有效期至2025715

20251170点后开通百炼:自开通日起180天有效

10小时/月

快速开始

您可以先进行在线体验:请在语音识别页面选择“Paraformer实时语音识别-v2”模型,单击立即体验

下面是调用API的示例代码。更多常用场景的代码示例,请参见GitHub

您需要已获取API Key配置API Key到环境变量。如果通过SDK调用,还需要安装DashScope SDK

Gummy
Paraformer

实时语音识别:适用于会议演讲、视频直播等长时间不间断识别的场景。

一句话识别:对停顿更加敏感,支持对一分钟内的短语音进行精准识别,适用于对话聊天、指令控制、语音输入法、语音搜索等短时语音交互场景。

实时语音识别
一句话识别

实时语音识别支持对长时间的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行识别并流式返回结果。

识别传入麦克风的语音
识别本地文件
Java
Python
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerRealtime;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

import java.nio.ByteBuffer;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws InterruptedException {
        ExecutorService executorService = Executors.newSingleThreadExecutor();
        executorService.submit(new RealtimeRecognitionTask());
        executorService.shutdown();
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}

class RealtimeRecognitionTask implements Runnable {
    @Override
    public void run() {
        String targetLanguage = "en";
        // 初始化请求参数
        TranslationRecognizerParam param =
                TranslationRecognizerParam.builder()
                        // 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
                        // .apiKey("your-api-key")
                        .model("gummy-realtime-v1") // 设置模型名
                        .format("pcm") // 设置待识别音频格式,支持的音频格式:pcm、wav、mp3、opus、speex、aac、amr
                        .sampleRate(16000) // 设置待识别音频采样率(单位Hz)。支持16000Hz及以上采样率。
                        .transcriptionEnabled(true) // 设置是否开启实时识别
                        .sourceLanguage("auto") // 设置源语言(待识别/翻译语言)代码
                        .translationEnabled(true) // 设置是否开启实时翻译
                        .translationLanguages(new String[] {targetLanguage}) // 设置翻译目标语言
                        .build();

        // 初始化回调接口
        ResultCallback<TranslationRecognizerResult> callback =
                new ResultCallback<TranslationRecognizerResult>() {
                    @Override
                    public void onEvent(TranslationRecognizerResult result) {
                        System.out.println("RequestId: " + result.getRequestId());
                        // 打印最终结果
                        if (result.getTranscriptionResult() != null) {
                            System.out.println("Transcription Result:"+result);
                            if (result.isSentenceEnd()) {
                                System.out.println("\tFix:" + result.getTranscriptionResult().getText());
                                System.out.println("\tStash:" + result.getTranscriptionResult().getStash());
                            } else {
                                System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
                            }
                        }
                        if (result.getTranslationResult() != null) {
                            System.out.println("English Translation Result:");
                            if (result.isSentenceEnd()) {
                                System.out.println("\tFix:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
                                System.out.println("\tStash:" + result.getTranslationResult().getTranslation(targetLanguage).getStash());
                            } else {
                                System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
                            }
                        }
                    }

                    @Override
                    public void onComplete() {
                        System.out.println("Translation complete");
                    }

                    @Override
                    public void onError(Exception e) {
                        e.printStackTrace();
                        System.out.println("TranslationCallback error: " + e.getMessage());
                    }
                };

        // 初始化流式识别服务
        TranslationRecognizerRealtime translator = new TranslationRecognizerRealtime();
        // 启动流式语音识别/翻译,绑定请求参数和回调接口
        translator.call(param, callback);

        try {
            // 创建音频格式
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            // 根据格式匹配默认录音设备
            TargetDataLine targetDataLine =
                    AudioSystem.getTargetDataLine(audioFormat);
            targetDataLine.open(audioFormat);
            // 开始录音
            targetDataLine.start();
            System.out.println("请您通过麦克风讲话体验实时语音识别和翻译功能");
            ByteBuffer buffer = ByteBuffer.allocate(1024);
            long start = System.currentTimeMillis();
            // 录音50s并进行实时识别
            while (System.currentTimeMillis() - start < 50000) {
                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                if (read > 0) {
                    buffer.limit(read);
                    // 将录音音频数据发送给流式识别服务
                    translator.sendAudioFrame(buffer);
                    buffer = ByteBuffer.allocate(1024);
                    // 录音速率有限,防止cpu占用过高,休眠一小会儿
                    Thread.sleep(20);
                }
            }
            // 通知结束
            translator.stop();
        } catch (Exception e) {
            e.printStackTrace();
        }

        System.out.println(
                "[Metric] requestId: "
                        + translator.getLastRequestId()
                        + ", first package delay ms: "
                        + translator.getFirstPackageDelay()
                        + ", last package delay ms: "
                        + translator.getLastPackageDelay());
    }
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html

import pyaudio
import dashscope
from dashscope.audio.asr import *


# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"

mic = None
stream = None

class Callback(TranslationRecognizerCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print("TranslationRecognizerCallback open.")
        mic = pyaudio.PyAudio()
        stream = mic.open(
            format=pyaudio.paInt16, channels=1, rate=16000, input=True
        )

    def on_close(self) -> None:
        global mic
        global stream
        print("TranslationRecognizerCallback close.")
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_event(
        self,
        request_id,
        transcription_result: TranscriptionResult,
        translation_result: TranslationResult,
        usage,
    ) -> None:
        print("request id: ", request_id)
        print("usage: ", usage)
        if translation_result is not None:
            print(
                "translation_languages: ",
                translation_result.get_language_list(),
            )
            english_translation = translation_result.get_translation("en")
            print("sentence id: ", english_translation.sentence_id)
            print("translate to english: ", english_translation.text)
            if english_translation.stash is not None:
                print(
                    "translate to english stash: ",
                    translation_result.get_translation("en").stash.text,
                )
        if transcription_result is not None:
            print("sentence id: ", transcription_result.sentence_id)
            print("transcription: ", transcription_result.text)
            if transcription_result.stash is not None:
                print("transcription stash: ", transcription_result.stash.text)


callback = Callback()


translator = TranslationRecognizerRealtime(
    model="gummy-realtime-v1",
    format="pcm",
    sample_rate=16000,
    transcription_enabled=True,
    translation_enabled=True,
    translation_target_languages=["en"],
    callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验实时语音识别和翻译功能")
while True:
    if stream:
        data = stream.read(3200, exception_on_overflow=False)
        translator.send_audio_frame(data)
    else:
        break

translator.stop()
Java
Python

示例中用到的音频为:hello_world.wav

import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerRealtime;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.LocalDateTime;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class RealtimeTranslateTask implements Runnable {
    private Path filepath;

    public RealtimeTranslateTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        String targetLanguage = "en";
        // Create translation params
        // you can customize the translation parameters, like model, format,
        // sample_rate for more information, please refer to
        // https://help.aliyun.com/document_detail/2712536.html
        TranslationRecognizerParam param =
                TranslationRecognizerParam.builder()
                        // 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
                        // .apiKey("your-api-key")
                        .model("gummy-realtime-v1")
                        .format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
                        // can check the supported formats in the document
                        .sampleRate(16000)
                        .transcriptionEnabled(true)
                        .sourceLanguage("auto")
                        .translationEnabled(true)
                        .translationLanguages(new String[] {targetLanguage})
                        .build();
        TranslationRecognizerRealtime translator = new TranslationRecognizerRealtime();
        CountDownLatch latch = new CountDownLatch(1);

        String threadName = Thread.currentThread().getName();

        ResultCallback<TranslationRecognizerResult> callback =
                new ResultCallback<TranslationRecognizerResult>() {
                    @Override
                    public void onEvent(TranslationRecognizerResult result) {
                        System.out.println("RequestId: " + result.getRequestId());
                        // 打印最终结果
                        if (result.getTranscriptionResult() != null) {
                            System.out.println("Transcription Result:"+result);
                            if (result.isSentenceEnd()) {
                                System.out.println("\tFix:" + result.getTranscriptionResult().getText());
                                System.out.println("\tStash:" + result.getTranscriptionResult().getStash());
                            } else {
                                System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
                            }
                        }
                        if (result.getTranslationResult() != null) {
                            System.out.println("English Translation Result:");
                            if (result.isSentenceEnd()) {
                                System.out.println("\tFix:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
                                System.out.println("\tStash:" + result.getTranslationResult().getTranslation(targetLanguage).getStash());
                            } else {
                                System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation(targetLanguage).getText());
                            }
                        }
                    }

                    @Override
                    public void onComplete() {
                        System.out.println("[" + threadName + "] Translation complete");
                        latch.countDown();
                    }

                    @Override
                    public void onError(Exception e) {
                        e.printStackTrace();
                        System.out.println("[" + threadName + "] TranslationCallback error: " + e.getMessage());
                    }
                };
        // set param & callback
        translator.call(param, callback);
        // Please replace the path with your audio file path
        System.out.println("[" + threadName + "] Input file_path is: " + this.filepath);
        // Read file and send audio by chunks
        try (FileInputStream fis = new FileInputStream(this.filepath.toFile())) {
            // chunk size set to 1 seconds for 16KHz sample rate
            byte[] buffer = new byte[3200];
            int bytesRead;
            // Loop to read chunks of the file
            while ((bytesRead = fis.read(buffer)) != -1) {
                ByteBuffer byteBuffer;
                // Handle the last chunk which might be smaller than the buffer size
                System.out.println("[" + threadName + "] bytesRead: " + bytesRead);
                if (bytesRead < buffer.length) {
                    byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
                } else {
                    byteBuffer = ByteBuffer.wrap(buffer);
                }
                // Send the ByteBuffer to the translation instance
                translator.sendAudioFrame(byteBuffer);
                buffer = new byte[3200];
                Thread.sleep(100);
            }
            System.out.println(LocalDateTime.now());
        } catch (Exception e) {
            e.printStackTrace();
        }

        translator.stop();
        // wait for the translation to complete
        try {
            latch.await();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

}

public class Main {
    public static void main(String[] args)
            throws NoApiKeyException, InterruptedException {

        String currentDir = System.getProperty("user.dir");
        // Please replace the path with your audio source
        Path[] filePaths = {
                Paths.get(currentDir, "hello_world.wav"),
//                Paths.get(currentDir, "hello_world_male_16k_16bit_mono.wav"),
        };
        // Use ThreadPool to run recognition tasks
        ExecutorService executorService = Executors.newFixedThreadPool(10);
        for (Path filepath:filePaths) {
            executorService.submit(new RealtimeTranslateTask(filepath));
        }
        executorService.shutdown();
        // wait for all tasks to complete
        executorService.awaitTermination(1, TimeUnit.MINUTES);
        System.exit(0);
    }
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html

import os
import requests
from http import HTTPStatus

import dashscope
from dashscope.audio.asr import *

# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"

r = requests.get(
    "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
)
with open("asr_example.wav", "wb") as f:
    f.write(r.content)


class Callback(TranslationRecognizerCallback):
    def on_open(self) -> None:
        print("TranslationRecognizerCallback open.")

    def on_close(self) -> None:
        print("TranslationRecognizerCallback close.")

    def on_event(
        self,
        request_id,
        transcription_result: TranscriptionResult,
        translation_result: TranslationResult,
        usage,
    ) -> None:
        print("request id: ", request_id)
        print("usage: ", usage)
        if translation_result is not None:
            print(
                "translation_languages: ",
                translation_result.get_language_list(),
            )
            english_translation = translation_result.get_translation("en")
            print("sentence id: ", english_translation.sentence_id)
            print("translate to english: ", english_translation.text)
            if english_translation.stash is not None:
                print(
                    "translate to english stash: ",
                    translation_result.get_translation("en").stash.text,
                )
        if transcription_result is not None:
            print("sentence id: ", transcription_result.sentence_id)
            print("transcription: ", transcription_result.text)
            if transcription_result.stash is not None:
                print("transcription stash: ", transcription_result.stash.text)
    
    def on_error(self, message) -> None:
        print('error: {}'.format(message))
    
    def on_complete(self) -> None:
        print('TranslationRecognizerCallback complete')


callback = Callback()


translator = TranslationRecognizerRealtime(
    model="gummy-realtime-v1",
    format="wav",
    sample_rate=16000,
    callback=callback,
)

translator.start()

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        while True:
            audio_data = f.read(12800)
            if not audio_data:
                break
            else:
                translator.send_audio_frame(audio_data)
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
    f.close()
except Exception as e:
    raise e

translator.stop()

一句话识别能够对一分钟内的语音数据流(无论是从外部设备如麦克风获取的音频流,还是从本地文件读取的音频流)进行识别并流式返回结果。

识别传入麦克风的语音
识别本地文件
Java
Python
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerChat;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.results.TranscriptionResult;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;

import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.TargetDataLine;
import java.nio.ByteBuffer;

public class Main {

    public static void main(String[] args) throws NoApiKeyException, InterruptedException {

        // 创建Recognizer
        TranslationRecognizerChat translator = new TranslationRecognizerChat();
        // 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
        TranslationRecognizerParam param =
                TranslationRecognizerParam.builder()
                        // 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
                        // .apiKey("your-api-key")
                        .model("gummy-chat-v1")
                        .format("pcm") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
                        // can check the supported formats in the document
                        .sampleRate(16000) // supported 16000
                        .transcriptionEnabled(true)
                        .translationEnabled(true)
                        .translationLanguages(new String[] {"en"})
                        .build();


        // 创建一个Flowable<ByteBuffer>
        Thread thread = new Thread(
                () -> {
                    try {
                        // 创建音频格式
                        AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                        // 根据格式匹配默认录音设备
                        TargetDataLine targetDataLine =
                                AudioSystem.getTargetDataLine(audioFormat);
                        targetDataLine.open(audioFormat);
                        // 开始录音
                        targetDataLine.start();
                        System.out.println("请您通过麦克风讲话体验一句话语音识别和翻译功能");
                        ByteBuffer buffer = ByteBuffer.allocate(1024);
                        long start = System.currentTimeMillis();
                        // 录音5s并进行实时识别
                        while (System.currentTimeMillis() - start < 50000) {
                            int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                            if (read > 0) {
                                buffer.limit(read);
                                // 将录音音频数据发送给流式识别服务
                                if (!translator.sendAudioFrame(buffer)) {
                                    System.out.println("sentence end, stop sending");
                                    break;
                                }
                                buffer = ByteBuffer.allocate(1024);
                                // 录音速率有限,防止cpu占用过高,休眠一小会儿
                                Thread.sleep(20);
                            }
                        }
                    } catch (LineUnavailableException e) {
                        throw new RuntimeException(e);
                    } catch (InterruptedException e) {
                        throw new RuntimeException(e);
                    }
                });

        translator.call(param, new ResultCallback<TranslationRecognizerResult>() {
            @Override
            public void onEvent(TranslationRecognizerResult result) {
                if (result.getTranscriptionResult() == null) {
                    return;
                }
                try {
                    System.out.println("RequestId: " + result.getRequestId());
                    // 打印最终结果
                    if (result.getTranscriptionResult() != null) {
                        System.out.println("Transcription Result:");
                        if (result.isSentenceEnd()) {
                            System.out.println("\tFix:" + result.getTranscriptionResult().getText());
                        } else {
                            TranscriptionResult transcriptionResult = result.getTranscriptionResult();
                            System.out.println("\tTemp Result:" + transcriptionResult.getText());
                            if (result.getTranscriptionResult().isVadPreEnd()) {
                                System.out.printf("VadPreEnd: start:%d, end:%d, time:%d\n", transcriptionResult.getPreEndStartTime(), transcriptionResult.getPreEndEndTime(), transcriptionResult.getPreEndTimemillis());
                            }
                        }
                    }
                    if (result.getTranslationResult() != null) {
                        System.out.println("English Translation Result:");
                        if (result.isSentenceEnd()) {
                            System.out.println("\tFix:" + result.getTranslationResult().getTranslation("en").getText());
                        } else {
                            System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation("en").getText());
                        }
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }

            }

            @Override
            public void onComplete() {
                System.out.println("Translation complete");
            }

            @Override
            public void onError(Exception e) {

            }
        });

        thread.start();
        thread.join();
        translator.stop();
//        System.exit(0);
    }
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html

import pyaudio
import dashscope
from dashscope.audio.asr import *


# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"

mic = None
stream = None

class Callback(TranslationRecognizerCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print("TranslationRecognizerCallback open.")
        mic = pyaudio.PyAudio()
        stream = mic.open(
            format=pyaudio.paInt16, channels=1, rate=16000, input=True
        )

    def on_close(self) -> None:
        global mic
        global stream
        print("TranslationRecognizerCallback close.")
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_event(
        self,
        request_id,
        transcription_result: TranscriptionResult,
        translation_result: TranslationResult,
        usage,
    ) -> None:
        print("request id: ", request_id)
        print("usage: ", usage)
        if translation_result is not None:
            print(
                "translation_languages: ",
                translation_result.get_language_list(),
            )
            english_translation = translation_result.get_translation("en")
            print("sentence id: ", english_translation.sentence_id)
            print("translate to english: ", english_translation.text)
            if english_translation.vad_pre_end:
                print("vad pre end {}, {}, {}".format(transcription_result.pre_end_start_time, transcription_result.pre_end_end_time, transcription_result.pre_end_timemillis))
        if transcription_result is not None:
            print("sentence id: ", transcription_result.sentence_id)
            print("transcription: ", transcription_result.text)


callback = Callback()


translator = TranslationRecognizerChat(
    model="gummy-chat-v1",
    format="pcm",
    sample_rate=16000,
    transcription_enabled=True,
    translation_enabled=True,
    translation_target_languages=["en"],
    callback=callback,
)
translator.start()
print("请您通过麦克风讲话体验一句话语音识别和翻译功能")
while True:
    if stream:
        data = stream.read(3200, exception_on_overflow=False)
        if not translator.send_audio_frame(data):
            print("sentence end, stop sending")
            break
    else:
        break

translator.stop()
Java
Python

示例中用到的音频为:hello_world.wav

import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerChat;
import com.alibaba.dashscope.audio.asr.translation.TranslationRecognizerParam;
import com.alibaba.dashscope.audio.asr.translation.results.TranslationRecognizerResult;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.exception.NoApiKeyException;

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.LocalDateTime;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

class RealtimeTranslateChatTask implements Runnable {
    private Path filepath;
    private TranslationRecognizerChat translator = null;

    public RealtimeTranslateChatTask(Path filepath) {
        this.filepath = filepath;
    }

    @Override
    public void run() {
        for (int i=0; i<1; i++) {
            // Create translation params
            // you can customize the translation parameters, like model, format,
            // sample_rate for more information, please refer to
            // https://help.aliyun.com/document_detail/2712536.html
            TranslationRecognizerParam param =
                    TranslationRecognizerParam.builder()
                            // 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
                            // .apiKey("your-api-key")
                            .model("gummy-chat-v1")
                            .format("wav") // 'pcm'、'wav'、'mp3'、'opus'、'speex'、'aac'、'amr', you
                            // can check the supported formats in the document
                            .sampleRate(16000) // supported 16000
                            .transcriptionEnabled(true)
                            .translationEnabled(true)
                            .translationLanguages(new String[] {"en"})
                            .build();
            if (translator == null) {
                translator = new TranslationRecognizerChat();
            }
            CountDownLatch latch = new CountDownLatch(1);

            String threadName = Thread.currentThread().getName();

            ResultCallback<TranslationRecognizerResult> callback =
                    new ResultCallback<TranslationRecognizerResult>() {
                        @Override
                        public void onEvent(TranslationRecognizerResult result) {
                            System.out.println("RequestId: " + result.getRequestId());
                            // 打印最终结果
                            if (result.getTranscriptionResult() != null) {
                                System.out.println("Transcription Result:"+result);
                                if (result.isSentenceEnd()) {
                                    System.out.println("\tFix:" + result.getTranscriptionResult().getText());
                                } else {
                                    System.out.println("\tTemp Result:" + result.getTranscriptionResult().getText());
                                }
                            }
                            if (result.getTranslationResult() != null) {
                                System.out.println("English Translation Result:");
                                if (result.isSentenceEnd()) {
                                    System.out.println("\tFix:" + result.getTranslationResult().getTranslation("en").getText());
                                } else {
                                    System.out.println("\tTemp Result:" + result.getTranslationResult().getTranslation("en").getText());
                                }
                            }
                        }

                        @Override
                        public void onComplete() {
                            System.out.println("[" + threadName + "] Translation complete");
                            latch.countDown();
                        }

                        @Override
                        public void onError(Exception e) {
                            e.printStackTrace();
                            System.out.println("[" + threadName + "] TranslationCallback error: " + e.getMessage());
                        }
                    };
            // set param & callback
            translator.call(param, callback);
            // Please replace the path with your audio file path
            System.out.println("[" + threadName + "] Input file_path is: " + this.filepath);
            // Read file and send audio by chunks
            try (FileInputStream fis = new FileInputStream(this.filepath.toFile())) {
                // chunk size set to 1 seconds for 16KHz sample rate
                byte[] buffer = new byte[3200];
                int bytesRead;
                // Loop to read chunks of the file
                while ((bytesRead = fis.read(buffer)) != -1) {
                    ByteBuffer byteBuffer;
                    // Handle the last chunk which might be smaller than the buffer size
                    System.out.println("[" + threadName + "] bytesRead: " + bytesRead);
                    if (bytesRead < buffer.length) {
                        byteBuffer = ByteBuffer.wrap(buffer, 0, bytesRead);
                    } else {
                        byteBuffer = ByteBuffer.wrap(buffer);
                    }
                    // Send the ByteBuffer to the translation instance
                    if (!translator.sendAudioFrame(byteBuffer)) {
                        System.out.println("sentence end, stop sending");
                        break;
                    }
                    buffer = new byte[3200];
                    Thread.sleep(100);
                }
                fis.close();
                System.out.println(LocalDateTime.now());
            } catch (Exception e) {
                e.printStackTrace();
            }

            translator.stop();
            // wait for the translation to complete
            try {
                latch.await();
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
    }

}


public class Main {
    public static void main(String[] args)
            throws NoApiKeyException, InterruptedException {
        String currentDir = System.getProperty("user.dir");
        // Please replace the path with your audio source
        Path[] filePaths = {
                Paths.get(currentDir, "hello_world.wav"),
//                Paths.get(currentDir, "hello_world_male_16k_16bit_mono.wav"),
        };
        // Use ThreadPool to run recognition tasks
        ExecutorService executorService = Executors.newFixedThreadPool(10);
        for (Path filepath:filePaths) {
            executorService.submit(new RealtimeTranslateChatTask(filepath));
        }
        executorService.shutdown();
        // wait for all tasks to complete
        executorService.awaitTermination(1, TimeUnit.MINUTES);
//        System.exit(0);
    }
}
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/xxxxx.html

import os
import requests
from http import HTTPStatus

import dashscope
from dashscope.audio.asr import *

# 若没有将API Key配置到环境变量中,需将your-api-key替换为自己的API Key
# dashscope.api_key = "your-api-key"

r = requests.get(
    "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav"
)
with open("asr_example.wav", "wb") as f:
    f.write(r.content)


class Callback(TranslationRecognizerCallback):
    def on_open(self) -> None:
        print("TranslationRecognizerCallback open.")

    def on_close(self) -> None:
        print("TranslationRecognizerCallback close.")

    def on_event(
            self,
            request_id,
            transcription_result: TranscriptionResult,
            translation_result: TranslationResult,
            usage,
    ) -> None:
        print("request id: ", request_id)
        print("usage: ", usage)
        if translation_result is not None:
            print(
                "translation_languages: ",
                translation_result.get_language_list(),
            )
            english_translation = translation_result.get_translation("en")
            print("sentence id: ", english_translation.sentence_id)
            print("translate to english: ", english_translation.text)
        if transcription_result is not None:
            print("sentence id: ", transcription_result.sentence_id)
            print("transcription: ", transcription_result.text)

    def on_error(self, message) -> None:
        print('error: {}'.format(message))

    def on_complete(self) -> None:
        print('TranslationRecognizerCallback complete')


callback = Callback()

translator = TranslationRecognizerChat(
    model="gummy-chat-v1",
    format="wav",
    sample_rate=16000,
    callback=callback,
)

translator.start()

try:
    audio_data: bytes = None
    f = open("asr_example.wav", 'rb')
    if os.path.getsize("asr_example.wav"):
        while True:
            audio_data = f.read(12800)
            if not audio_data:
                break
            else:
                if translator.send_audio_frame(audio_data):
                    print("send audio frame success")
                else:
                    print("sentence end, stop sending")
                    break
    else:
        raise Exception(
            'The supplied file was empty (zero bytes long)')
    f.close()
except Exception as e:
    raise e

translator.stop()
识别传入麦克风的语音
识别本地音频文件

实时语音识别可以识别麦克风中传入的语音并输出识别结果,达到“边说边出文字”的效果。

Python
Java

运行Python示例前,需要通过pip install pyaudio命令安装第三方音频播放与采集套件。

import pyaudio
from dashscope.audio.asr import (Recognition, RecognitionCallback,
                                 RecognitionResult)

# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"

mic = None
stream = None


class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_event(self, result: RecognitionResult) -> None:
        print('RecognitionCallback sentence: ', result.get_sentence())


callback = Callback()
recognition = Recognition(model='paraformer-realtime-v2',
                          format='pcm',
                          sample_rate=16000,
                          callback=callback)
recognition.start()

while True:
    if stream:
        data = stream.read(3200, exception_on_overflow=False)
        recognition.send_audio_frame(data)
    else:
        break

recognition.stop()
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;
import java.nio.ByteBuffer;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

public class Main {

    public static void main(String[] args) throws NoApiKeyException {
        // 创建一个Flowable<ByteBuffer>
        Flowable<ByteBuffer> audioSource =
                Flowable.create(
                        emitter -> {
                            new Thread(
                                    () -> {
                                        try {
                                            // 创建音频格式
                                            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                                            // 根据格式匹配默认录音设备
                                            TargetDataLine targetDataLine =
                                                    AudioSystem.getTargetDataLine(audioFormat);
                                            targetDataLine.open(audioFormat);
                                            // 开始录音
                                            targetDataLine.start();
                                            ByteBuffer buffer = ByteBuffer.allocate(1024);
                                            long start = System.currentTimeMillis();
                                            // 录音30s并进行实时转写
                                            while (System.currentTimeMillis() - start < 300000) {
                                                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                                                if (read > 0) {
                                                    buffer.limit(read);
                                                    // 将录音音频数据发送给流式识别服务
                                                    emitter.onNext(buffer);
                                                    buffer = ByteBuffer.allocate(1024);
                                                    // 录音速率有限,防止cpu占用过高,休眠一小会儿
                                                    Thread.sleep(20);
                                                }
                                            }
                                            // 通知结束转写
                                            emitter.onComplete();
                                        } catch (Exception e) {
                                            emitter.onError(e);
                                        }
                                    })
                                    .start();
                        },
                        BackpressureStrategy.BUFFER);

        // 创建Recognizer
        Recognition recognizer = new Recognition();
        // 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
        RecognitionParam param =
                RecognitionParam.builder()
                        .model("paraformer-realtime-v2")
                        .format("pcm")
                        .sampleRate(16000)
                        // 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
                        // .apiKey("apikey")
                        .build();

        // 流式调用接口
        recognizer
                .streamCall(param, audioSource)
                // 调用Flowable的subscribe方法订阅结果
                .blockingForEach(
                        result -> {
                            // 打印最终结果
                            if (result.isSentenceEnd()) {
                                System.out.println("Fix:" + result.getSentence().getText());
                            } else {
                                System.out.println("Result:" + result.getSentence().getText());
                            }
                        });
        System.exit(0);
    }
}

实时语音识别可以识别本地音频文件并输出识别结果。对于对话聊天、控制口令、语音输入法、语音搜索等较短的准实时语音识别场景可考虑采用该接口进行语音识别。

识别中英文
识别日语
Python
Java
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html

import requests
from http import HTTPStatus
from dashscope.audio.asr import Recognition

# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"

# 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
r = requests.get(
    'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav'
)
with open('asr_example.wav', 'wb') as f:
    f.write(r.content)

recognition = Recognition(model='paraformer-realtime-v2',
                          format='wav',
                          sample_rate=16000,
                          # “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
                          language_hints=['zh', 'en'],
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
    print('识别结果:')
    print(result.get_sentence())
else:
    print('Error: ', result.message)
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;

public class Main {
    public static void main(String[] args) {
        // 用户可忽略url下载文件部分,可以直接使用本地文件进行相关api调用进行识别
        String exampleWavUrl =
                "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav";
        try {
            InputStream in = new URL(exampleWavUrl).openStream();
            Files.copy(in, Paths.get("asr_example.wav"), StandardCopyOption.REPLACE_EXISTING);
        } catch (IOException e) {
            System.out.println("error: " + e);
            System.exit(1);
        }

        // 创建Recognition实例
        Recognition recognizer = new Recognition();
        // 创建RecognitionParam
        RecognitionParam param =
                RecognitionParam.builder()
                        // 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
                        // .apiKey("apikey")
                        .model("paraformer-realtime-v2")
                        .format("wav")
                        .sampleRate(16000)
                        // “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .build();

        try {
            System.out.println("识别结果:" + recognizer.call(param, new File("asr_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.exit(0);
    }
}
Python
Java
# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html

import requests
from http import HTTPStatus
from dashscope.audio.asr import Recognition

# 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
# import dashscope
# dashscope.api_key = "apiKey"

# 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
r = requests.get(
    'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/welcome_female_16k_mono_japanese.wav'
)
with open('asr_japanese_example.wav', 'wb') as f:
    f.write(r.content)

recognition = Recognition(model='paraformer-realtime-v2',
                          format='wav',
                          sample_rate=16000,
                          # “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
                          language_hints=['ja'],
                          callback=None)
result = recognition.call('asr_japanese_example.wav')
if result.status_code == HTTPStatus.OK:
    print('识别结果:')
    print(result.get_sentence())
else:
    print('Error: ', result.message)
import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;

public class Main {
    public static void main(String[] args) {
        // 用户可忽略从url下载文件这部分代码,直接使用本地文件进行识别
        String exampleWavUrl =
                "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/welcome_female_16k_mono_japanese.wav";
        try {
            InputStream in = new URL(exampleWavUrl).openStream();
            Files.copy(in, Paths.get("asr_japanese_example.wav"), StandardCopyOption.REPLACE_EXISTING);
        } catch (IOException e) {
            System.out.println("error: " + e);
            System.exit(1);
        }

        // 创建Recognition实例
        Recognition recognizer = new Recognition();
        // 创建RecognitionParam
        RecognitionParam param =
                RecognitionParam.builder()
                        // 若没有将API Key配置到环境变量中,需将下面这行代码注释放开,并将apiKey替换为自己的API Key
                        // .apiKey("apikey")
                        .model("paraformer-realtime-v2")
                        .format("wav")
                        .sampleRate(16000)
                        // “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
                        .parameter("language_hints", new String[]{"ja"})
                        .build();

        try {
            System.out.println("识别结果:" + recognizer.call(param, new File("asr_japanese_example.wav")));
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.exit(0);
    }
}

输入文件限制

对本地音频文件进行识别时:

  • 输入文件的方式:将本地文件路径作为参数传递。

  • 文件数量:单次调用最多输入1个文件。

  • 文件大小:无限制。

  • 音频时长

    • Paraformer:无限制

    • Gummy实时语音识别:无限制

    • Gummy一句话识别:一分钟以内

  • 文件格式:支持pcm、pcm编码的wav、mp3、ogg封装的opus、ogg封装的speex、aac、amr这几种格式。推荐pcmwav。

    由于音频文件格式及其变种众多,因此不能保证所有格式均能够被正确识别。请通过测试验证您所提供的文件能够获得正常的语音识别结果。
  • 音频采样位数:16bit。

  • 采样率:因模型而异。

    采样率是指每秒对声音信号的采样次数。更高的采样率提供更多信息,有助于提高语音识别的准确率,但过高的采样率可能引入更多无关信息,反而影响识别效果。应根据实际采样率选择合适的模型。例如,8000Hz的语音数据应直接使用支持8000Hz的模型,无需转换为16000Hz。

API参考

常见问题

1. 可能影响识别准确率的因素有哪些?

  1. 声音质量:设备、环境等可能影响语音的清晰度,从而影响识别准确率。高质量的音频输入是提高识别准确性的前提。

  2. 说话人特征:不同人的声音特质(如音调、语速、口音、方言)差异很大,这些个体差异会对语音识别系统构成挑战,尤其是对于未充分训练过的特定口音或方言。

  3. 语言和词汇:语音识别模型通常针对特定的语言进行训练。当处理多语言混合、专业术语、俚语或网络用语时,识别难度会增加。若模型支持热词功能,可通过热词调整识别结果。

  4. 上下文理解:缺乏对对话上下文的理解可能会导致误解,尤其是在含义模糊或依赖于上下文的情境中。

2. 模型限流规则是怎样的?

Gummy:

模型名称

提交作业接口RPS限制

模型名称

提交作业接口RPS限制

gummy-realtime-v1

10

gummy-chat-v1

Paraformer

模型名称

提交作业接口RPS限制

模型名称

提交作业接口RPS限制

paraformer-realtime-v2

20

paraformer-realtime-v1

paraformer-realtime-8k-v2

paraformer-realtime-8k-v1

  • 本页导读 (1)
  • 应用场景
  • 支持的模型
  • 模型选型建议
  • 快速开始
  • 输入文件限制
  • API参考
  • 常见问题
  • 1. 可能影响识别准确率的因素有哪些?
  • 2. 模型限流规则是怎样的?