快速开始

SenseVoice语音识别大模型

说明

支持的领域/任务:audio(音频)/asr(语音识别)、SER(情感识别)、AED(音频事件检测)

模型介绍

SenseVoice语音识别大模型专注于高精度多语言语音识别、情感辨识和音频事件检测,支持超过50种语言的识别,整体效果优于Whisper模型,中文与粤语识别准确率相对提升在50%以上。

  • 高精度多语言语音识别:SenseVoice支持50+的语种的语音识别,包括中文(zh)、英文(en)、粤语(yue)、日语(ja)、韩语(ko)、法语(fr)、德语(de)、俄语(ru)、意大利语(it)、西班牙语(es)、泰语(th)、印度尼西亚语(id)等。

  • 情感和音频事件检测:SenseVoice提供了情感识别能力(例如高兴、悲伤、生气等),能够检测音频中的特定事件,如背景音乐、歌唱、掌声和笑声等。

多语种识别

支持共计50+种语种的语音识别,尤其以中、英、日、韩、粤为重点支持语种,可通过language_hints参数选择语种获得更准确的识别效果,详见附录:支持语言列表

情感识别

支持4种情绪的情感识别,包括生气(ANGRY)、高兴(HAPPY)、伤心(SAD)和中性(NEUTRAL),若识别结果中未出现上述情感,或返回结果中包含<|SPECIAL_TOKEN_1|>,代表该语音中未检测到特定情绪。情感一般出现在识别结果最末端,以诸如今天天气好棒啊!<|HAPPY|>形式出现。

音频事件检测

支持4种常见音频事件识别,包括掌声(Applause)、背景音乐(BGM)、笑声(Laughter)和说话声(Speech)。音频事件特殊符号包含起始与结束符两类,以掌声为例:假设识别结果为<|Applause|>今天<|/Applause|>天气好棒啊!<|Applause|><|/Applause|>分别代表掌声事件的起始与结束,整句话含义为在说“今天”两个字时,模型检测到了有掌声事件的存在。

示例代码

前提条件

异步文件转写示例代码

以下示例展示了调用SenseVoice语音识别文件转写异步API,对多个通过URL给出的音频文件进行语音识别批处理的代码。模型默认支持中英两个语种的语音识别,您可以参考示例中的调用入参language_hints指定特定语种进行识别。更多常用场景的代码示例请参考Github仓库。

说明

需要使用您的API-KEY替换示例中的your-dashscope-api-key,代码才能正常运行。

  • 通过URL指定进行语音转写的文件,其大小不超过2 GB。

  • file_urls参数支持传入多个文件URL,示例中展示了对多个文件URL进行转写的功能。

# For prerequisites running the following sample, visit https://help.aliyun.com/document_detail/611472.html

import re
import json
from urllib import request
from http import HTTPStatus

import dashscope

# 将your-dashscope-api-key替换成您自己的API-KEY
dashscope.api_key = 'your-dashscope-api-key'


def parse_sensevoice_result(data, keep_trans=True, keep_emotions=True, keep_events=True):
    '''
    本工具用于解析 sensevoice 识别结果
    keep_trans: 是否保留转写文本,默认为True
    keep_emotions: 是否保留情感标签,默认为True
    keep_events: 是否保留事件标签,默认为True
    '''
    # 定义要保留的标签
    emotion_list = ['NEUTRAL', 'HAPPY', 'ANGRY', 'SAD']
    event_list = ['Speech', 'Applause', 'BGM', 'Laughter']

    # 所有支持的标签
    all_tags = ['Speech', 'Applause', 'BGM', 'Laughter',
                'NEUTRAL', 'HAPPY', 'ANGRY', 'SAD', 'SPECIAL_TOKEN_1']
    tags_to_cleanup = []
    for tag in all_tags:
        tags_to_cleanup.append(f'<|{tag}|> ')
        tags_to_cleanup.append(f'<|/{tag}|>')
        tags_to_cleanup.append(f'<|{tag}|>')

    def get_clean_text(text: str):
        for tag in tags_to_cleanup:
            text = text.replace(tag, '')
        pattern = r"\s{2,}"
        text = re.sub(pattern, " ", text).strip()
        return text

    for item in data['transcripts']:
        for sentence in item['sentences']:
            if keep_emotions:
                # 提取 emotion
                emotions_pattern = r'<\|(' + '|'.join(emotion_list) + r')\|>'
                emotions = re.findall(emotions_pattern, sentence['text'])
                sentence['emotion'] = list(set(emotions))
                if not sentence['emotion']:
                    sentence.pop('emotion', None)

            if keep_events:
                # 提取 event
                events_pattern = r'<\|(' + '|'.join(event_list) + r')\|>'
                events = re.findall(events_pattern, sentence['text'])
                sentence['event'] = list(set(events))
                if not sentence['event']:
                    sentence.pop('event', None)

            if keep_trans:
                # 提取纯文本
                sentence['text'] = get_clean_text(sentence['text'])
            else:
                sentence.pop('text', None)

        if keep_trans:
            item['text'] = get_clean_text(item['text'])
        else:
            item.pop('text', None)
        item['sentences'] = list(filter(lambda x: 'text' in x or 'emotion' in x or 'event' in x, item['sentences']))
    return data


task_response = dashscope.audio.asr.Transcription.async_call(
    model='sensevoice-v1',
    file_urls=[
        'https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav'],
    language_hints=['en'], )

transcription_response = dashscope.audio.asr.Transcription.wait(
    task=task_response.output.task_id)

if transcription_response.status_code == HTTPStatus.OK:
    for transcription in transcription_response.output['results']:
        url = transcription['transcription_url']
        result = json.loads(request.urlopen(url).read().decode('utf8'))
        print(json.dumps(parse_sensevoice_result(result, keep_trans=False, keep_emotions=False), indent=4,
                         ensure_ascii=False))
    print('transcription done!')
else:
    print('Error: ', transcription_response.output.message)
package org.example.recognition;

import com.alibaba.dashscope.audio.asr.transcription.*;
import com.google.gson.*;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.Stream;

class SenseVoiceParser {

    private static final List<String> EMOTION_LIST = Arrays.asList("NEUTRAL", "HAPPY", "ANGRY", "SAD");
    private static final List<String> EVENT_LIST = Arrays.asList("Speech", "Applause", "BGM", "Laughter");
    private static final List<String> ALL_TAGS = Arrays.asList(
            "Speech", "Applause", "BGM", "Laughter", "NEUTRAL", "HAPPY", "ANGRY", "SAD", "SPECIAL_TOKEN_1");

    /**
     * 本工具用于解析 sensevoice 识别结果
     * @param data json格式的sensevoice转写结果
     * @param keepTrans 是否保留转写文本
     * @param keepEmotions 是否保留情感标签
     * @param keepEvents 是否保留事件标签
     * @return
     */
    public static JsonObject parseSenseVoiceResult(JsonObject data, boolean keepTrans, boolean keepEmotions, boolean keepEvents) {

        List<String> tagsToCleanup = ALL_TAGS.stream()
                .flatMap(tag -> Stream.of("<|" + tag + "|> ", "<|/" + tag + "|>", "<|" + tag + "|>"))
                .collect(Collectors.toList());

        JsonArray transcripts = data.getAsJsonArray("transcripts");

        for (JsonElement transcriptElement : transcripts) {
            JsonObject transcript = transcriptElement.getAsJsonObject();
            JsonArray sentences = transcript.getAsJsonArray("sentences");

            for (JsonElement sentenceElement : sentences) {
                JsonObject sentence = sentenceElement.getAsJsonObject();
                String text = sentence.get("text").getAsString();

                if (keepEmotions) {
                    extractTags(sentence, text, EMOTION_LIST, "emotion");
                }

                if (keepEvents) {
                    extractTags(sentence, text, EVENT_LIST, "event");
                }

                if (keepTrans) {
                    String cleanText = getCleanText(text, tagsToCleanup);
                    sentence.addProperty("text", cleanText);
                } else {
                    sentence.remove("text");
                }
            }

            if (keepTrans) {
                transcript.addProperty("text", getCleanText(transcript.get("text").getAsString(), tagsToCleanup));
            } else {
                transcript.remove("text");
            }

            JsonArray filteredSentences = new JsonArray();
            for (JsonElement sentenceElement : sentences) {
                JsonObject sentence = sentenceElement.getAsJsonObject();
                if (sentence.has("text") || sentence.has("emotion") || sentence.has("event")) {
                    filteredSentences.add(sentence);
                }
            }
            transcript.add("sentences", filteredSentences);
        }
        return data;
    }

    private static void extractTags(JsonObject sentence, String text, List<String> tagList, String key) {
        String pattern = "<\\|(" + String.join("|", tagList) + ")\\|>";
        Pattern compiledPattern = Pattern.compile(pattern);
        Matcher matcher = compiledPattern.matcher(text);
        Set<String> tags = new HashSet<>();

        while (matcher.find()) {
            tags.add(matcher.group(1));
        }

        if (!tags.isEmpty()) {
            JsonArray tagArray = new JsonArray();
            tags.forEach(tagArray::add);
            sentence.add(key, tagArray);
        } else {
            sentence.remove(key);
        }
    }

    private static String getCleanText(String text, List<String> tagsToCleanup) {
        for (String tag : tagsToCleanup) {
            text = text.replace(tag, "");
        }
        return text.replaceAll("\\s{2,}", " ").trim();
    }
}

public class Main {
    public static void main(String[] args) {
        // 创建转写请求参数,需要用真实apikey替换your-dashscope-api-key
        TranscriptionParam param =
                TranscriptionParam.builder()
                        // 将your-dashscope-api-key替换成您自己的API-KEY
                        .apiKey("your-dashscope-api-key")
                        .model("sensevoice-v1")
                        .fileUrls(
                                Arrays.asList(
                                        "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav"))
                        .parameter("language_hints", new String[] {"en"})
                        .build();
        try {
            Transcription transcription = new Transcription();
            // 提交转写请求
            TranscriptionResult result = transcription.asyncCall(param);
            // 等待转写完成
            result = transcription.wait(
                    TranscriptionQueryParam.FromTranscriptionParam(param, result.getTaskId()));
            // 获取转写结果
            List<TranscriptionTaskResult> taskResultList = result.getResults();
            if (taskResultList != null && taskResultList.size() > 0) {
                for (TranscriptionTaskResult taskResult : taskResultList) {
                    String transcriptionUrl = taskResult.getTranscriptionUrl();
                    HttpURLConnection connection =
                            (HttpURLConnection) new URL(transcriptionUrl).openConnection();
                    connection.setRequestMethod("GET");
                    connection.connect();
                    BufferedReader reader =
                            new BufferedReader(new InputStreamReader(connection.getInputStream()));
                    Gson gson = new GsonBuilder().setPrettyPrinting().create();
                    JsonElement jsonResult = gson.fromJson(reader, JsonObject.class);
                    System.out.println(gson.toJson(jsonResult));
                    System.out.println(gson.toJson(SenseVoiceParser.parseSenseVoiceResult(jsonResult.getAsJsonObject(), true, true, true)));
                }
            }
        } catch (Exception e) {
            System.out.println("error: " + e);
        }
        System.exit(0);
    }
}

调用成功后,将会返回例如以下示例的文件转写结果。

{
    "file_url": "https://dashscope.oss-cn-beijing.aliyuncs.com/samples/audio/sensevoice/rich_text_example_1.wav",
    "properties": {
        "audio_format": "pcm_s16le",
        "channels": [
            0
        ],
        "original_sampling_rate": 16000,
        "original_duration_in_milliseconds": 17645
    },
    "transcripts": [
        {
            "channel_id": 0,
            "content_duration_in_milliseconds": 13240,
            "text": "Senior staff, principal doris jackson, wakefield faculty, and of course, my fellow classmates. I am honored to have been chosen to speak before my classmates, as well as the students across America today.",
            "sentences": [
                {
                    "begin_time": 0,
                    "end_time": 7480,
                    "text": "Senior staff, principal doris jackson, wakefield faculty, and of course, my fellow classmates.",
                    "emotion": [
                        "NEUTRAL"
                    ],
                    "event": [
                        "Speech"
                    ]
                },
                {
                    "begin_time": 11880,
                    "end_time": 17640,
                    "text": "I am honored to have been chosen to speak before my classmates, as well as the students across America today.",
                    "event": [
                        "Speech",
                        "Applause"
                    ]
                }
            ]
        }
    ]
}

了解更多

有关SenseVoice语音识别模型服务的录音文件转写的详细调用方法,请参见录音文件识别API详情页面进行了解。