快速开始

Paraformer语音识别

说明

支持的领域 / 任务:audio(音频) / asr(语音识别)

Paraformer语音识别API基于通义实验室新一代非自回归端到端模型,提供基于实时音频流的语音识别以及对输入的各类音视频文件进行语音识别的能力。可被应用于:

  • 对语音识别结果返回的即时性有严格要求的实时场景,如实时会议记录、实时直播字幕、电话客服等。

  • 对音视频文件中语音内容的识别,从而进行内容理解分析、字幕生成等。

  • 对电话客服呼叫中心录音进行识别,从而进行客服质检等。

快速开始

前提条件

实时语音识别示例代码

实时语音识别是对不限时长的音频流做实时识别,达到“边说边出文字”的效果,内置智能断句,可提供每句话开始结束时间。可用于视频实时直播字幕、实时会议记录、实时法庭庭审记录、智能语音助手等场景。

更多常用场景的代码示例,请参见GitHub

实时语音识别支持通过同步调用的方式识别本地文件。

使用麦克风进行流式语音文字上屏

以下示例展示使用实时语音识别API,使用麦克风进行流式语音识别并进行文字上屏,达到“边说边出文字”的效果。

说明
  • 需要使用您的API-KEY替换示例中的 your-dashscope-api-key ,代码才能正常运行。

  • 运行Python示例前,需要通过pip install pyaudio命令安装第三方音频播放与采集套件。

# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html

import pyaudio
import dashscope
from dashscope.audio.asr import (Recognition, RecognitionCallback,
                                 RecognitionResult)

dashscope.api_key='<your-dashscope-api-key>'

mic = None
stream = None

class Callback(RecognitionCallback):
    def on_open(self) -> None:
        global mic
        global stream
        print('RecognitionCallback open.')
        mic = pyaudio.PyAudio()
        stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

    def on_close(self) -> None:
        global mic
        global stream
        print('RecognitionCallback close.')
        stream.stop_stream()
        stream.close()
        mic.terminate()
        stream = None
        mic = None

    def on_event(self, result: RecognitionResult) -> None:
        print('RecognitionCallback sentence: ', result.get_sentence())

callback = Callback()
recognition = Recognition(model='paraformer-realtime-v2',
                          format='pcm',
                          sample_rate=16000,
                          callback=callback)
recognition.start()

while True:
    if stream:
        data = stream.read(3200, exception_on_overflow = False)
        recognition.send_audio_frame(data)
    else:
        break

recognition.stop()
package org.example.recognition;

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;
import java.nio.ByteBuffer;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.TargetDataLine;

public class Main {

    public static void main(String[] args) throws NoApiKeyException {
        // 创建一个Flowable<ByteBuffer>
        Flowable<ByteBuffer> audioSource =
                Flowable.create(
                        emitter -> {
                            new Thread(
                                    () -> {
                                        try {
                                            // 创建音频格式
                                            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
                                            // 根据格式匹配默认录音设备
                                            TargetDataLine targetDataLine =
                                                    AudioSystem.getTargetDataLine(audioFormat);
                                            targetDataLine.open(audioFormat);
                                            // 开始录音
                                            targetDataLine.start();
                                            ByteBuffer buffer = ByteBuffer.allocate(1024);
                                            long start = System.currentTimeMillis();
                                            // 录音30s并进行实时转写
                                            while (System.currentTimeMillis() - start < 300000) {
                                                int read = targetDataLine.read(buffer.array(), 0, buffer.capacity());
                                                if (read > 0) {
                                                    buffer.limit(read);
                                                    // 将录音音频数据发送给流式识别服务
                                                    emitter.onNext(buffer);
                                                    buffer = ByteBuffer.allocate(1024);
                                                    // 录音速率有限,防止cpu占用过高,休眠一小会儿
                                                    Thread.sleep(20);
                                                }
                                            }
                                            // 通知结束转写
                                            emitter.onComplete();
                                        } catch (Exception e) {
                                            emitter.onError(e);
                                        }
                                    })
                                    .start();
                        },
                        BackpressureStrategy.BUFFER);

        // 创建Recognizer
        Recognition recognizer = new Recognition();
        // 创建RecognitionParam,audioFrames参数中传入上面创建的Flowable<ByteBuffer>
        RecognitionParam param =
                RecognitionParam.builder()
                        .model("paraformer-realtime-v2")
                        .format("pcm")
                        .sampleRate(16000)
                        .apiKey("your-dashscope-api-key")
                        .build();

        // 流式调用接口
        recognizer
                .streamCall(param, audioSource)
                // 调用Flowable的subscribe方法订阅结果
                .blockingForEach(
                        result -> {
                            // 打印最终结果
                            if (result.isSentenceEnd()) {
                                System.out.println("Fix:" + result.getSentence().getText());
                            } else {
                                System.out.println("Result:" + result.getSentence().getText());
                            }
                        });
        System.exit(0);
    }
}

更多详细案例可参考语音识别实现音视频文件转写及实时文字上屏功能

使用同步接口进行文件转写

以下示例展示使用语音识别同步API接口进行文件转写,对于对话聊天、控制口令、语音输入法、语音搜索等较短的准实时语音识别场景可考虑采用该接口进行语音识别。

# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html

import requests
from http import HTTPStatus

import dashscope
from dashscope.audio.asr import Recognition

dashscope.api_key = '<your-dashscope-api-key>'

r = requests.get(
    'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav'
)
with open('asr_example.wav', 'wb') as f:
    f.write(r.content)

recognition = Recognition(model='paraformer-realtime-v2',
                          format='wav',
                          sample_rate=16000,
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
    with open('asr_result.txt', 'w+') as f:
        for sentence in result.get_sentence():
            f.write(str(sentence) + '\n')
    print('Recognition done!')
else:
    print('Error: ', result.message)
package com.alibaba.dashscope.sample.recognition.quickstart;

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;

public class Main {

  public static void main(String[] args) {
    // 用户可忽略url下载文件部分,可以直接使用本地文件进行相关api调用进行识别
    String exampleWavUrl =
        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav";
    try {
      InputStream in = new URL(exampleWavUrl).openStream();
      Files.copy(in, Paths.get("asr_example.wav"), StandardCopyOption.REPLACE_EXISTING);
    } catch (IOException e) {
      System.out.println("error: " + e);
      System.exit(1);
    }

    // 创建Recognition实例
    Recognition recognizer = new Recognition();
    // 创建RecognitionParam,请在实际使用中替换真实apiKey
    RecognitionParam param =
        RecognitionParam.builder()
            .model("paraformer-realtime-v2")
            .format("wav")
            .sampleRate(16000)
            .apiKey("your-dashscope-api-key")
            .build();
    // 直接将结果保存到script.txt中
    try (FileOutputStream fos = new FileOutputStream("asr_result.txt")) {
      String result = recognizer.call(param, new File("asr_example.wav"));
      System.out.println(result);
      fos.write(result.getBytes());
    } catch (Exception e) {
      e.printStackTrace();
    }
    System.exit(0);
  }
}

调用成功后,实时识别的返回结果示例如下

{
	"begin_time": 280,
	"end_time": 4000,
	"text": "hello word, 这里是阿里巴巴语音实验室。",
	"words": [{
		"begin_time": 280,
		"end_time": 776,
		"text": "hello ",
		"punctuation": ""
	}, {
		"begin_time": 776,
		"end_time": 1024,
		"text": "word",
		"punctuation": ", "
	}, {
		"begin_time": 1024,
		"end_time": 1520,
		"text": "这里",
		"punctuation": ""
	}, {
		"begin_time": 1520,
		"end_time": 1768,
		"text": "是",
		"punctuation": ""
	}, {
		"begin_time": 1768,
		"end_time": 2760,
		"text": "阿里巴巴",
		"punctuation": ""
	}, {
		"begin_time": 2760,
		"end_time": 3256,
		"text": "语音",
		"punctuation": ""
	}, {
		"begin_time": 3256,
		"end_time": 4000,
		"text": "实验室",
		"punctuation": "。"
	}]
}

使用同步接口进行多语种文件转写

以下示例展示使用语音识别同步API接口进行日语文件转写,对于对话聊天、控制口令、语音输入法、语音搜索等较短的准实时语音识别场景可考虑采用该接口进行语音识别。

# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html

import requests
from http import HTTPStatus

import dashscope
from dashscope.audio.asr import Recognition

dashscope.api_key = '<your-dashscope-api-key>'

r = requests.get(
    'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/welcome_female_16k_mono_japanese.wav'
)
with open('asr_japanese_example.wav', 'wb') as f:
    f.write(r.content)

recognition = Recognition(model='paraformer-realtime-v2',
                          format='wav',
                          sample_rate=16000,
                          language_hints=['ja'],  # “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
                          callback=None)
result = recognition.call('asr_japanese_example.wav')
if result.status_code == HTTPStatus.OK:
    with open('asr_japanese_result.txt', 'w+') as f:
        for sentence in result.get_sentence():
            f.write(str(sentence) + '\n')
    print('Recognition done!')
else:
    print('Error: ', result.message)
package com.alibaba.dashscope.sample.recognition.quickstart;

import com.alibaba.dashscope.audio.asr.recognition.Recognition;
import com.alibaba.dashscope.audio.asr.recognition.RecognitionParam;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;

public class Main {

  public static void main(String[] args) {
    // 用户可忽略url下载文件部分,可以直接使用本地文件进行相关api调用进行识别
    String exampleWavUrl =
        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/welcome_female_16k_mono_japanese.wav";
    try {
      InputStream in = new URL(exampleWavUrl).openStream();
      Files.copy(in, Paths.get("asr_japanese_example.wav"), StandardCopyOption.REPLACE_EXISTING);
    } catch (IOException e) {
      System.out.println("error: " + e);
      System.exit(1);
    }

    // 创建Recognition实例
    Recognition recognizer = new Recognition();
    // 创建RecognitionParam,请在实际使用中替换真实apiKey
    RecognitionParam param =
        RecognitionParam.builder()
            .model("paraformer-realtime-v2")
            .format("wav")
            .sampleRate(16000)
            .apiKey("your-dashscope-api-key")
            // “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
            .parameter("language_hints", new String[]{"ja"})
            .build();
    // 直接将结果保存到script.txt中
    try (FileOutputStream fos = new FileOutputStream("asr_japanese_result.txt")) {
      String result = recognizer.call(param, new File("asr_japanese_example.wav"));
      System.out.println(result);
      fos.write(result.getBytes());
    } catch (Exception e) {
      e.printStackTrace();
    }
    System.exit(0);
  }
}

调用成功后,实时识别的返回结果示例如下

{
    "begin_time": 220,
    "end_time": 4280,
    "text": "アリババクラウドボイスサービスへようこそ。",
    "words": [
        {
            "begin_time": 220,
            "end_time": 626,
            "text": "アリ",
            "punctuation": ""
        },
        {
            "begin_time": 626,
            "end_time": 1032,
            "text": "ババ",
            "punctuation": ""
        },
        {
            "begin_time": 1032,
            "end_time": 1438,
            "text": "クラ",
            "punctuation": ""
        },
        {
            "begin_time": 1438,
            "end_time": 1844,
            "text": "ウド",
            "punctuation": ""
        },
        {
            "begin_time": 1844,
            "end_time": 2250,
            "text": "ボイ",
            "punctuation": ""
        },
        {
            "begin_time": 2250,
            "end_time": 2656,
            "text": "スサ",
            "punctuation": ""
        },
        {
            "begin_time": 2656,
            "end_time": 3062,
            "text": "ービ",
            "punctuation": ""
        },
        {
            "begin_time": 3062,
            "end_time": 3468,
            "text": "スへ",
            "punctuation": ""
        },
        {
            "begin_time": 3468,
            "end_time": 3874,
            "text": "よう",
            "punctuation": ""
        },
        {
            "begin_time": 3874,
            "end_time": 4280,
            "text": "こそ",
            "punctuation": "。"
        }
    ]
}

录音文件识别示例代码

以下示例展示了调用Paraformer语音识别文件转写异步API,对多个通过URL给出的音频文件进行语音识别批处理的代码。

录音文件识别目前不支持识别本地文件。

说明

需要使用您的API-KEY替换示例中的 your-dashscope-api-key ,代码才能正常运行。

# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html

import json
from urllib import request
from http import HTTPStatus

import dashscope

dashscope.api_key = 'your-dashscope-api-key'

task_response = dashscope.audio.asr.Transcription.async_call(
    model='paraformer-v2',
    file_urls=[
        'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav',
        'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav'
    ],
    language_hints=['zh', 'en'])  # “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型

transcription_response = dashscope.audio.asr.Transcription.wait(
    task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
    for transcription in transcription_response.output['results']:
        url = transcription['transcription_url']
        result = json.loads(request.urlopen(url).read().decode('utf8'))
        print(json.dumps(result, indent=4, ensure_ascii=False))
    print('transcription done!')
else:
    print('Error: ', transcription_response.output.message)
package com.alibaba.dashscope.sample.transcription;

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.*;
import java.net.HttpURLConnection;
import java.util.Arrays;
import java.util.List;

public class Main {
    public static void main(String[] args) {
      	// 创建转写请求参数,需要用真实apikey替换your-dashscope-api-key
        TranscriptionParam param =
                TranscriptionParam.builder()
                        .apiKey("your-dashscope-api-key")
                        .model("paraformer-v2")
                        // “language_hints”只支持paraformer-v2和paraformer-realtime-v2模型
                        .parameter("language_hints", new String[]{"zh", "en"})
                        .fileUrls(
                            Arrays.asList(
                                "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
                                "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav"))
            						.build();
        try {
            Transcription transcription = new Transcription();
          	// 提交转写请求
            TranscriptionResult result = transcription.asyncCall(param);
            // 等待转写完成
          	result = transcription.wait(
                            TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // 获取转写结果
          	List<TranscriptionTaskResult> taskResultList = result.getResults();
            if (taskResultList != null && taskResultList.size() > 0) {
                for (TranscriptionTaskResult taskResult : taskResultList) {
                  String transcriptionUrl = taskResult.getTranscriptionUrl();
                  HttpURLConnection connection =
                          (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                  connection.setRequestMethod("GET");
                  connection.connect();
                  BufferedReader reader =
                          new BufferedReader(new InputStreamReader(connection.getInputStream()));
                  Gson gson = new GsonBuilder().setPrettyPrinting().create();
                  System.out.println(gson.toJson(gson.fromJson(reader, JsonObject.class)));
                }
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}
说明
  • 通过URL指定进行语音转写的文件,其大小不超过2GB。

  • file_urls 参数支持传入多个文件URL,示例中展示了对多个文件URL进行转写的功能。

调用成功后,将会返回例如以下示例的文件转写结果。

{
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_male2.wav",
    "properties": {
        "audio_format": "pcm_s16le",
        "channels": [
            0
        ],
        "original_sampling_rate": 16000,
        "original_duration_in_milliseconds": 4726
    },
    "transcripts": [
        {
            "channel_id": 0,
            "content_duration_in_milliseconds": 4570,
            "text": "Hello world, 这里是阿里巴巴语音实验室。",
            "sentences": [
                {
                    "begin_time": 140,
                    "end_time": 4710,
                    "text": "Hello world, 这里是阿里巴巴语音实验室。",
                    "words": [
                        {
                            "begin_time": 140,
                            "end_time": 597,
                            "text": "Hello ",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 597,
                            "end_time": 1054,
                            "text": "world",
                            "punctuation": ", "
                        },
                        {
                            "begin_time": 1054,
                            "end_time": 1663,
                            "text": "这里",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 1663,
                            "end_time": 2272,
                            "text": "是阿",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 2272,
                            "end_time": 2881,
                            "text": "里巴",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 2881,
                            "end_time": 3490,
                            "text": "巴语",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 3490,
                            "end_time": 4099,
                            "text": "音实",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 4099,
                            "end_time": 4710,
                            "text": "验室",
                            "punctuation": "。"
                        }
                    ]
                }
            ]
        }
    ]
}
{
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/paraformer/hello_world_female2.wav",
    "properties": {
        "audio_format": "pcm_s16le",
        "channels": [
            0
        ],
        "original_sampling_rate": 16000,
        "original_duration_in_milliseconds": 3834
    },
    "transcripts": [
        {
            "channel_id": 0,
            "content_duration_in_milliseconds": 3530,
            "text": "Hello world, 这里是阿里巴巴语音实验室。",
            "sentences": [
                {
                    "begin_time": 280,
                    "end_time": 3810,
                    "text": "Hello world, 这里是阿里巴巴语音实验室。",
                    "words": [
                        {
                            "begin_time": 280,
                            "end_time": 633,
                            "text": "Hello ",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 633,
                            "end_time": 986,
                            "text": "world",
                            "punctuation": ", "
                        },
                        {
                            "begin_time": 986,
                            "end_time": 1456,
                            "text": "这里",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 1456,
                            "end_time": 1926,
                            "text": "是阿",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 1926,
                            "end_time": 2396,
                            "text": "里巴",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 2396,
                            "end_time": 2866,
                            "text": "巴语",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 2866,
                            "end_time": 3336,
                            "text": "音实",
                            "punctuation": ""
                        },
                        {
                            "begin_time": 3336,
                            "end_time": 3810,
                            "text": "验室",
                            "punctuation": "。"
                        }
                    ]
                }
            ]
        }
    ]
}

了解更多

有关Paraformer语音识别模型服务的实时语音识别API以及录音文件转写的详细调用方法,可前往实时语音识别API详情录音文件识别API详情页面进行了解。