Sambert语音合成API详情-阿里云帮助中心

语音合成

说明

支持的领域 / 任务：audio（音频） / tts（语音合成）。

语音合成提供的实时语音合成API，可将文字内容转化为音频。除语音数据外，可选择开启字级别和音素级别时间戳，用于生成字幕或驱动数字人嘴型。

不同的使用场景，需要选择适合的模型，如客服场景、直播场景、方言场景、童声场景等，详情请参考模型列表。采样率的选择也同样重要，通常情况下，客服场景建议选择8kHz，其他场景建议选择16k/24k/48kHz，采样率越高音频越饱满，听感越好。

前提条件

已开通服务并获得API-KEY：API-KEY的获取与配置。
已安装最新版SDK：安装DashScope SDK。

同步调用

提交单个语音合成任务，无需调用回调函数，进行语音合成（无流式输出中间结果），最终一次性获取完整结果。

以下示例，展示如何使用同步接口调用发音人模型知厨（sambert-zhichu-v1），将文案“今天天气怎么样”合成采样率为48kHz，音频格式为wav的音频，并保存到名为output.wav的文件中。

说明

需要使用您的API-KEY替换示例中的your-dashscope-api-key，代码才能正常运行。

Python

# coding=utf-8

import dashscope
from dashscope.audio.tts import SpeechSynthesizer

dashscope.api_key = 'your-dashscope-api-key'

result = SpeechSynthesizer.call(model='sambert-zhichu-v1',
                                text='今天天气怎么样',
                                sample_rate=48000,
                                format='wav')

if result.get_audio_data() is not None:
    with open('output.wav', 'wb') as f:
        f.write(result.get_audio_data())
print(' get response: %s' % (result.get_response()))

Java

package com.alibaba.dashscope.sample;

import com.alibaba.dashscope.audio.tts.SpeechSynthesizer;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.common.Status;

import java.io.*;
import java.nio.ByteBuffer;

public class Main {

    public static void SyncAudioDataToFile() {
        SpeechSynthesizer synthesizer = new SpeechSynthesizer();
        SpeechSynthesisParam param = SpeechSynthesisParam.builder()
          .model("sambert-zhichu-v1")
          .text("今天天气怎么样")
          .sampleRate(48000)
          .format(SpeechSynthesisAudioFormat.WAV)
          .apiKey("your-dashscope-api-key")
          .build();

        File file = new File("output.wav");
        // 调用call方法，传入param参数，获取合成音频
        ByteBuffer audio = synthesizer.call(param);
        try (FileOutputStream fos = new FileOutputStream(file)) {
            fos.write(audio.array());
            System.out.println("synthesis done!");
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] args) {
        SyncAudioDataToFile();
        System.exit(0);
    }
}

重要

同步接口将阻塞当前线程，直到合成完成或者出现错误。

请求参数说明

参数	类型	默认值	说明
model	string	-	指定用于语音合成的音色模型名，完整列表参考模型列表。
text	string	-	待合成文本，要求采用UTF-8编码且不能为空，一次性合成最高一万字符，其中每个汉字、英文、标点符号均按照1个字计算，支持SSML格式。SSML标记语言使用，请点击SSML标记语言介绍。
format	string	WAV	返回音频编码格式，支持PCM/WAV/MP3格式。
sample_rate	int	16000	返回音频采样率，建议使用模型默认采样率（参考模型列表），如果不匹配，服务会进行必要的升降采样处理。
volume	int	50	返回音频音量，取值范围是0~100。
rate	double	1.0	返回音频语速，取值范围0.5~2： 0.5：默认语速的0.5倍速。（默认语速是指模型默认输出的合成语速，语速会依据每一个发音人略有不同，约每秒钟4个字。） 1：默认语速。 2：默认语速的2倍速。
pitch	double	1.0	返回音频语调，取值范围：0.5~2。
word_timestamp_enabled	bool	false	是否开启字级别时间戳。
phoneme_timestamp_enabled	bool	false	是否在开启字级别时间戳的基础上，显示音素时间戳。

返回接口说明

返回结果中包含二进制音频数据。如果选择开启时间戳，还会返回时间戳信息。以Java SDK为例，SpeechSynthesizer类提供如下函数。

方法名	方法签名	说明
getAudioData	ByteBuffer getAudioData()	二进制音频数据。
getTimestamps	List<Sentence> getTimestamps()	多个句级别时间戳信息Sentence的List，时间戳相关类的说明见下文。

返回参数时间戳说明

参数	类型	说明
begin_time	int	句子/词/音素开始时间，单位为ms。
end_time	int	句子/词/音素结束时间，单位为ms。
words	List[]	包含的字时间戳信息，需要在请求中将word_timestamp_enabled也设置为true。
text	string	时间戳文本信息。
phonemes	List[]	包含的音素时间戳信息，需要请求中phoneme_timestamp_enabled也设置为true。
tone	string	音调。英文中0/1/2分别代表轻音/重音/次重音。拼音中1/2/3/4/5分别代表一声/二声/三声/四声/轻声。

流式调用

提交单个语音合成任务，通过回调的方式流式输出中间结果，合成结果通过ResultCallback中的回调函数流式进行获取。语音合成成功完成回调后还可以一次性获取完整的音频和时间戳结果。

以下示例展示了如何使用流式接口调用发音人模型知厨（sambert-zhichu-v1）将文案“今天天气怎么样”合成采样率为48kHz，默认音频格式（wav）的流式音频，并获取对应时间戳。

调用示例

Python

# coding=utf-8

import dashscope
import sys
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts import ResultCallback, SpeechSynthesizer, SpeechSynthesisResult

dashscope.api_key = 'your-dashscope-api-key'

class Callback(ResultCallback):
    def on_open(self):
        print('Speech synthesizer is opened.')

    def on_complete(self):
        print('Speech synthesizer is completed.')

    def on_error(self, response: SpeechSynthesisResponse):
        print('Speech synthesizer failed, response is %s' % (str(response)))

    def on_close(self):
        print('Speech synthesizer is closed.')

    def on_event(self, result: SpeechSynthesisResult):
        if result.get_audio_frame() is not None:
            print('audio result length:', sys.getsizeof(result.get_audio_frame()))

        if result.get_timestamp() is not None:
            print('timestamp result:', str(result.get_timestamp()))

callback = Callback()
SpeechSynthesizer.call(model='sambert-zhichu-v1',
                       text='今天天气怎么样',
                       sample_rate=48000,
                       callback=callback,
                       word_timestamp_enabled=True,
                       phoneme_timestamp_enabled=True)

Java

package com.alibaba.dashscope.sample;

import com.alibaba.dashscope.audio.tts.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.tts.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.common.Status;

import java.util.concurrent.CountDownLatch;

public class Main {
    public static void main(String[] args) {
        CountDownLatch latch = new CountDownLatch(1);
        SpeechSynthesizer synthesizer = new SpeechSynthesizer();
        SpeechSynthesisParam param = SpeechSynthesisParam.builder()
                .model("sambert-zhichu-v1")
                .text("今天天气怎么样")
                .sampleRate(48000)
                .enableWordTimestamp(true)
                .enablePhonemeTimestamp(true)
          			.apiKey("your-dashscope-api-key")
                .build();

        class ReactCallback extends ResultCallback<SpeechSynthesisResult> {

            @Override
            public void onEvent(SpeechSynthesisResult result) {
                if (result.getAudioFrame() != null) {
                    // do something with the audio frame
                    System.out.println("audio result length: " + result.getAudioFrame().array().length);
                }
                if (result.getTimestamp() != null) {
                    // do something with the timestamp
                    System.out.println("tiemstamp: " + result.getTimestamp());
                }
            }

            @Override
            public void onComplete() {
                // do something when the synthesis is done
              	System.out.println("onComplete!");
                latch.countDown();
            }

            @Override
            public void onError(Exception e) {
                // do something when an error occurs
              	System.out.println("onError:" + e);
                latch.countDown();
            }
        }

        synthesizer.call(param, new ReactCallback());
        try {
            latch.await();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.exit(0);
    }
}

请求参数说明

参数	类型	默认值	说明
model	string	-	指定用于语音合成的音色模型名，完整模型请参考模型列表。
text	string	-	待合成文本，要求采用UTF-8编码且不能为空，一次性合成最高一万字符，其中每个汉字、英文、标点符号均按照1个字计算，支持SSML格式。SSML标记语言使用，请点击SSML标记语言介绍。
format	string	WAV	返回音频编码格式，支持PCM/WAV/MP3格式。
sample_rate	int	16000	返回音频采样率，建议使用模型默认采样率（参考模型列表），如果不匹配，服务会进行必要的升降采样处理。
volume	int	50	返回音频音量，取值范围是0~100。
rate	double	1	返回音频语速，取值范围0.5~2： 0.5：默认语速的0.5倍速。（默认速是指模型默认输出的合成语速，语速会依据每一个发音人略有不同，约每秒钟4个字。） 1：默认语速。 2：默认语速的2倍速。
pitch	double	1	返回音频语调，取值范围：0.5~2。
word_timestamp_enabled	bool	false	是否开启字级别时间戳。
phoneme_timestamp_enabled	bool	false	是否在开启字级别时间戳的基础上，显示音素时间戳。

回调函数说明

以Java SDK为例，流式响应过程中，ResultCallback有如下回调函数可以被重新定义。

方法签名	是否必须override	说明
void onOpen(Status status)	否	当WebSocket建立链接完成后立刻会被回调。
void onEvent(SpeechSynthesisResult result)	是	当服务有回复时会被回调，SpeechSynthesisResult类型会在后文中介绍。
void onComplete()	否	当所有合成数据全部返回后进行回调。
void onError(Exception e)	否	当调用过程出现异常以及服务返回错误后进行回调。
void doClose(Status status)	否	当服务正在关闭连接时进行调用。
void onClose(Status status)	否	当服务已经关闭连接后进行回调。

返回结果说明

以Java SDK为例，SpeechSynthesisResult代表一次流式合成数据，包含合成音频片段和对应合成中的文本句级别范围的时间戳。

重要

SpeechSynthesisResult中getAudioFrame()或getTimestamp()的结果可能为空，这是因为服务的应答数据帧分为binary音频帧和text时间戳文本帧，两种帧交替返回到客户端。

成员方法	方法签名	说明
getAudioFrame	ByteBuffer getAudioFrame()	返回一个流式合成片段的增量二进制音频数据，可能为空。
getTimestamp	Sentence getTimestamp()	返回当前流式合成的句子所对应的句级别时间戳。

通过Flowable的流式调用

Java SDK还额外提供了通过Flowable流式调用的方式进行语音合成。在Flowable对象onComplete()后，可以通过Synthesis对象的getTimestamps()和getAudioData()获取完整结果。

以下示例展示了通过Flowable对象的blockingForEach接口，阻塞式的获取每次流式返回的SpeechSynthesisResult类型数据msg。

调用示例

Java

package com.alibaba.dashscope.sample;

import com.alibaba.dashscope.audio.tts.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.tts.SpeechSynthesizer;
import io.reactivex.Flowable;

public class Main {
    public static void main(String[] args) {
        SpeechSynthesizer synthesizer = new SpeechSynthesizer();
        SpeechSynthesisParam param = SpeechSynthesisParam.builder()
                .model("sambert-zhichu-v1")
                .text("今天天气怎么样")
                .sampleRate(48000)
                .enableWordTimestamp(true)
          			.apiKey("your-dashscope-api-key")
                .build();

        Flowable<SpeechSynthesisResult> flowable = synthesizer.streamCall(param);
        flowable.blockingForEach(
                msg -> {
                    if (msg.getAudioFrame() != null) {
                        // do something with the audio frame
                        System.out.println("getAudioFrame");
                    }
                    if (msg.getTimestamp() != null) {
                        // do something with the timestamp
                        System.out.println("getTimestamp");
                    }
                }
        );
        System.exit(0);
    }
}

返回结果说明

该接口主要通过返回的Flowable<SpeechSynthesisResult>来获取流式结果，也可以在Flowable的所有流式数据返回完成后通过对应SpeechSynthesizer对象的getAudioData和getTimestamps来获取完整的合成结果和完整的时间戳。关于Flowable的使用，请参见rxjava API。

状态码说明

DashScope通用状态码，请查阅：返回状态码说明。