说明

支持的领域 / 任务:audio(音频)/ ttsv2(语音合成)。

CosyVoice语音合成是基于通义实验室的生成式语音大模型(CosyVoice),依托大规模预训练语言模型,深度融合文本理解和语音生成的一项新型语音合成技术,能够精准解析并诠释各类文本内容,将其转化为宛如真人般的自然语音,提供超自然拟人的语音合成能力。支持文本至语音的流式输入和流式输出。

除了传统的“输入一段文本直接输出音频/流式输出音频”的交互方式外,CosyVoice还提供了“流式输入文本流式输出音频”的纯流式交互方式,可以实时合成LLM流式生成的文本。

前提条件

同步调用

提交单个语音合成任务,无需调用回调函数,进行语音合成(无流式输出中间结果),最终一次性获取完整结果。

请求示例

以下示例展示如何使用同步接口调用语音大模型CosyVoice的发音人龙小淳(longxiaochun),将文案“今天天气怎么样”合成采样率为22050Hz、音频格式为MP3的音频,并保存到名为output.mp3的文件中。

说明
  • 需要使用您的API-KEY替换示例中的your-dashscope-api-key,代码才能正常运行。

  • 同步接口将阻塞当前线程,直到合成完成或者出现错误。

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

# 将your-dashscope-api-key替换成您自己的API-KEY
dashscope.api_key = "your-dashscope-api-key"
model = "cosyvoice-v1"
voice = "longxiaochun"


synthesizer = SpeechSynthesizer(model=model, voice=voice)
audio = synthesizer.call("今天天气怎么样?")
print('requestId: ', synthesizer.get_last_request_id())
with open('output.mp3', 'wb') as f:
    f.write(audio)
package SpeechSynthesisDemo;

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;

public class Tts2File {
  /**
   * 将your-dashscope-api-key替换成您自己的API-KEY
   */
  private static String apikey = "your-dashscope-api-key";
  private static String model = "cosyvoice-v1";
  private static String voice = "longxiaochun";

  public static void StreamAuidoDataToSpeaker() {
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            .apiKey(apikey)
            .model(model)
            .voice(voice)
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    ByteBuffer audio = synthesizer.call("今天天气怎么样?");
    File file = new File("output.mp3");
    System.out.print("requestId: " + synthesizer.getLastRequestId());
    try (FileOutputStream fos = new FileOutputStream(file)) {
      fos.write(audio.array());
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
  }

  public static void main(String[] args) {
    StreamAuidoDataToSpeaker();
    System.exit(0);
  }
}

请求参数说明

参数

类型

是否必填

默认值

说明

model

string

指定用于语音合成的模型名(指定为:cosyvoice-v1)。

voice

string

指定用于语音合成的音色名,更多信息,请参见音色列表

text

string

待合成文本。

format

AudioFormat

模型列表中发音人对应的默认采样率和音频格式。

合成音频的编码格式,支持下列格式:

  • WAV_8000HZ_MONO_16BIT

  • WAV_16000HZ_MONO_16BIT

  • WAV_22050HZ_MONO_16BIT

  • WAV_24000HZ_MONO_16BIT

  • WAV_44100HZ_MONO_16BIT

  • WAV_48000HZ_MONO_16BIT

  • MP3_8000HZ_MONO_128KBPS

  • MP3_16000HZ_MONO_128KBPS

  • MP3_22050HZ_MONO_256KBPS

  • MP3_24000HZ_MONO_256KBPS

  • MP3_44100HZ_MONO_256KBPS

  • MP3_48000HZ_MONO_256KBPS

  • PCM_8000HZ_MONO_16BIT

  • PCM_16000HZ_MONO_16BIT

  • PCM_22050HZ_MONO_16BIT

  • PCM_24000HZ_MONO_16BIT

  • PCM_44100HZ_MONO_16BIT

  • PCM_48000HZ_MONO_16BIT

volume

int

50

合成音频的音量,取值范围:0~100。

重要

该字段在不同版本及不同编程语言的DashScope SDK中有所不同:

  • Java SDK:volume

  • 1.20.10及以后版本的Python SDK:volume

  • 1.20.10以前版本的Python SDK:volumn

speech_rate

double

1.0

合成音频的语速,取值范围:0.5~2。

  • 0.5:表示默认语速的0.5倍速。

  • 1:表示默认语速。默认语速是指模型默认输出的合成语速,语速会依据每一个发音人略有不同,约每秒钟4个字。

  • 2:表示默认语速的2倍速。

pitch_rate

double

1.0

合成音频的语调,取值范围:0.5~2。

返回结果说明

返回结果为合成的二进制音频数据。

接口详情

"""
Speech synthesis.
If callback is set, the audio will be returned in real-time through the on_event interface.
Otherwise, this function blocks until all audio is received and then returns the complete audio data.

Parameters:
-----------
text: str
    utf-8 encoded text
return: bytes
    If a callback is not set during initialization, the complete audio is returned as the function's return value. Otherwise, the return value is null.
"""
def call(self, text:str):
/**
 * Speech synthesis.<br>
 * If callback is set, the audio will be returned in real-time through the on_event interface.<br>
 * Otherwise, this function blocks until all audio is received and then returns the complete audio data.
 *
 * @param text utf-8 encoded text
 * @return If a callback is not set during initialization, the complete audio is returned as the function's return value. Otherwise, the return value is null.
 */
public ByteBuffer call(String text)

异步调用

提交单个语音合成任务,通过回调的方式流式输出中间结果,合成结果通过ResultCallback中的回调函数流式进行获取。

调用示例

以下示例,展示如何使用同步接口调用语音大模型CosyVoice的发音人龙小淳(longxiaochun),将文案“今天天气怎么样”合成采样率为22050Hz,音频格式为MP3的音频。

说明
  • 需要使用您的API-KEY替换示例中的your-dashscope-api-key,代码才能正常运行。

  • 异步接口不会阻塞当前线程,需要监听onComplete事件接收完所有音频。

# coding=utf-8

import dashscope
from dashscope.audio.tts_v2 import *

# 将your-dashscope-api-key替换成您自己的API-KEY
dashscope.api_key = "your-dashscope-api-key"
model = "cosyvoice-v1"
voice = "longxiaochun"


class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        self.file = open("output.mp3", "wb")
        print("websocket is open.")

    def on_complete(self):
        print("speech synthesis task complete successfully.")

    def on_error(self, message: str):
        print(f"speech synthesis task failed, {message}")

    def on_close(self):
        print("websocket is closed.")
        self.file.close()

    def on_event(self, message):
        print(f"recv speech synthsis message {message}")

    def on_data(self, data: bytes) -> None:
        print("audio result length:", len(data))
        self.file.write(data)


callback = Callback()

synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    callback=callback,
)

synthesizer.call("今天天气怎么样?")
print('requestId: ', synthesizer.get_last_request_id())
package com.alibaba.dashscope;

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;
import java.util.concurrent.CountDownLatch;

public class StreamInputTtsPlayableDemo {
  /**
   * 将your-dashscope-api-key替换成您自己的API-KEY
   */
  private static String apikey = "your-dashscope-api-key";
  private static String model = "cosyvoice-v1";
  private static String voice = "longxiaochun";

  public static void StreamAuidoDataToSpeaker() {
    CountDownLatch latch = new CountDownLatch(1);

    // 配置回调函数
    ResultCallback<SpeechSynthesisResult> callback =
        new ResultCallback<SpeechSynthesisResult>() {
          @Override
          public void onEvent(SpeechSynthesisResult result) {
            System.out.println("收到消息: " + result);
            if (result.getAudioFrame() != null) {
              // TODO: 处理音频
              System.out.println("收到音频");
            }
          }

          @Override
          public void onComplete() {
            System.out.println("收到Complete");
            latch.countDown();
          }

          @Override
          public void onError(Exception e) {
            System.out.println("收到错误: " + e.toString());
            latch.countDown();
          }
        };

    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            .apiKey(apikey)
            .model(model)
            .voice(voice)
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
    // 带Callback的call方法将不会阻塞当前线程
    synthesizer.call("今天天气怎么样?");
    System.out.print("requestId: " + synthesizer.getLastRequestId());
    // 等待合成完成
    try {
      latch.await();
      // 等待播放线程全部播放完
    } catch (InterruptedException e) {
      throw new RuntimeException(e);
    }
  }

  public static void main(String[] args) {
    StreamAuidoDataToSpeaker();
    System.exit(0);
  }
}

请求参数说明

和同步调用一致,在初始化时设定callback,则call函数转为异步接口,会立刻返回null。音频从回调函数中实时返回。详情请参见请求参数说明

返回结果说明

数据在on_event回调返回的SpeechSynthesisResult对象中。包含如下成员方法用于获取数据:

成员方法

方法签名

说明方法

getAudioFrame

ByteBuffer getAudioFrame()

返回一个流式合成片段的增量二进制音频数据,可能为空。

call函数无返回数据。

流式输入调用

调用示例

在同一个语音合成任务中分多次提交文本,并通过回调的方式实时获取合成结果。

以下示例,展示如何使用同步接口调用语音合成大模型CosyVoice的发音人龙小淳(longxiaochun),分多次发送文案,合成采样率为22050Hz,编码格式为PCM的音频,并使用播放器实时播放。

# coding=utf-8
#
# Installation instructions for pyaudio:
# APPLE Mac OS X
#   brew install portaudio
#   pip install pyaudio
# Debian/Ubuntu
#   sudo apt-get install python-pyaudio python3-pyaudio
#   or
#   pip install pyaudio
# CentOS
#   sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
#   python -m pip install pyaudio

import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *

# 将your-dashscope-api-key替换成您自己的API-KEY
dashscope.api_key = "your-dashscope-api-key"
model = "cosyvoice-v1"
voice = "longxiaochun"


class Callback(ResultCallback):
    _player = None
    _stream = None

    def on_open(self):
        print("websocket is open.")
        self._player = pyaudio.PyAudio()
        self._stream = self._player.open(
            format=pyaudio.paInt16, channels=1, rate=22050, output=True
        )

    def on_complete(self):
        print("speech synthesis task complete successfully.")

    def on_error(self, message: str):
        print(f"speech synthesis task failed, {message}")

    def on_close(self):
        print("websocket is closed.")
        # 停止播放器
        self._stream.stop_stream()
        self._stream.close()
        self._player.terminate()

    def on_event(self, message):
        print(f"recv speech synthsis message {message}")

    def on_data(self, data: bytes) -> None:
        print("audio result length:", len(data))
        self._stream.write(data)


callback = Callback()

test_text = [
    "流式文本语音合成SDK,",
    "可以将输入的文本",
    "合成为语音二进制数据,",
    "相比于非流式语音合成,",
    "流式合成的优势在于实时性",
    "更强。用户在输入文本的同时",
    "可以听到接近同步的语音输出,",
    "极大地提升了交互体验,",
    "减少了用户等待时间。",
    "适用于调用大规模",
    "语言模型(LLM),以",
    "流式输入文本的方式",
    "进行语音合成的场景。",
]

synthesizer = SpeechSynthesizer(
    model=model,
    voice=voice,
    format=AudioFormat.PCM_22050HZ_MONO_16BIT,  
    callback=callback,
)


for text in test_text:
    synthesizer.streaming_call(text)
    time.sleep(0.5)
synthesizer.streaming_complete()
print('requestId: ', synthesizer.get_last_request_id())
package com.alibaba.dashscope;

import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;
import java.util.concurrent.CountDownLatch;

public class StreamInputTtsPlayableDemo {
    private static String[] textArray = {"流式文本语音合成SDK,",
            "可以将输入的文本", "合成为语音二进制数据,", "相比于非流式语音合成,",
            "流式合成的优势在于实时性", "更强。用户在输入文本的同时",
            "可以听到接近同步的语音输出,", "极大地提升了交互体验,",
            "减少了用户等待时间。", "适用于调用大规模", "语言模型(LLM),以",
            "流式输入文本的方式", "进行语音合成的场景。"};
    /**
     * 将your-dashscope-api-key替换成您自己的API-KEY
     */
    private static String apikey = "your-dashscope-api-key";
    private static String model = "cosyvoice-v1";
    private static String voice = "longxiaochun";

    public static void StreamAuidoDataToSpeaker() {
        CountDownLatch latch = new CountDownLatch(1);

        // 配置回调函数
        ResultCallback<SpeechSynthesisResult> callback =
                new ResultCallback<SpeechSynthesisResult>() {
                    @Override
                    public void onEvent(SpeechSynthesisResult result) {
                        System.out.println("收到消息: " + result);
                        if (result.getAudioFrame() != null) {
                            // TODO: 处理音频
                            System.out.println("收到音频");
                        }
                    }

                    @Override
                    public void onComplete() {
                        System.out.println("收到Complete");
                        latch.countDown();
                    }

                    @Override
                    public void onError(Exception e) {
                        System.out.println("收到错误: " + e.toString());
                        latch.countDown();
                    }
                };

        SpeechSynthesisParam param =
                SpeechSynthesisParam.builder()
                        .apiKey(apikey)
                        .model(model)
                        .voice(voice)          
                        .format(SpeechSynthesisAudioFormat
                                .PCM_22050HZ_MONO_16BIT) // 流式合成使用PCM或者MP3
                        .build();
        SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
        // 带Callback的call方法将不会阻塞当前线程
        // 带Callback的call方法将不会阻塞当前线程
        for (String text : textArray) {
            synthesizer.streamingCall(text);
        }
        synthesizer.streamingComplete();
        System.out.print("requestId: " + synthesizer.getLastRequestId());
        // 等待合成完成
        try {
            latch.await();
            // 等待播放线程全部播放完
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] args) {
        StreamAuidoDataToSpeaker();
        System.exit(0);
    }
}

接口详情

  • 发送文本

    """
    Streaming input mode: You can call the stream_call function multiple times to send text. A session will be created on the first call.
    The session ends after calling streaming_complete.
    Parameters:
    -----------
    text: str
        utf-8 encoded text
    """
    def streaming_call(self, String text):
    /**
     * Streaming input mode: You can call the stream_call function multiple times to send text. A session will be created on the first call.
     * The session ends after calling streaming_complete.
     * @param text utf-8 encoded text
     */
    public void streamingCall(String text)
  • 同步结束任务流

    """
    Synchronously stop the streaming input speech synthesis task. Wait for all remaining synthesized audio before returning
    
    Parameters:
    -----------
    complete_timeout_millis: int
        Throws TimeoutError exception if it times out.
    """
    def streaming_complete(self, complete_timeout_millis=10000):
    /**
     * Synchronously stop the streaming input speech synthesis task. Wait for all remaining synthesized audio before returning
     * If it does not complete within 10 seconds, a timeout occurs and a TimeoutError exception is thrown.
     */
    public void streamingComplete()
    
    /**
     * Synchronously stop the streaming input speech synthesis task. Wait for all remaining synthesized audio before returning
     * @param completeTimeoutMillis The timeout period for await. Throws TimeoutError exception if it times out.
     */
    public void streamingComplete(long completeTimeoutMillis)
  • 异步结束任务流

    """
    Asynchronously stop the streaming input speech synthesis task, returns immediately.
    You need to listen and handle the STREAM_INPUT_TTS_EVENT_SYNTHESIS_COMPLETE event in the on_event callback.
    Do not destroy the object and callback before this event.
    """
    def async_streaming_complete(self):
    /**
     * Asynchronously stop the streaming input speech synthesis task, returns immediately.
     * You need to listen and handle the STREAM_INPUT_TTS_EVENT_SYNTHESIS_COMPLETE event in the on_event callback.
     * Do not destroy the object and callback before this event.
     */
    public void asyncStreamingComplete()
  • 取消当前任务

    """
    Immediately terminate the streaming input speech synthesis task and discard any remaining audio that is not yet delivered.
    """
    def streaming_cancel(self):
    /**
     * Immediately terminate the streaming input speech synthesis task and discard any remaining audio that is not yet delivered.
     */
    public void streamingCancel()

通过Flowable的调用

Java SDK还额外提供了通过Flowable流式调用的方式进行语音合成。在Flowable对象onComplete( )后,可以通过Synthesis对象的getAudioData( )获取完整结果。

非流式输入调用示例

以下示例展示了通过Flowable对象的blockingForEach接口,阻塞式的获取每次流式返回的SpeechSynthesisResult类型数据msg。

package com.alibaba.dashscope;

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;

public class StreamInputTtsPlayableDemo {
  /**
   * 将your-dashscope-api-key替换成您自己的API-KEY
   */
  private static String apikey = "your-dashscope-api-key";
  private static String model = "cosyvoice-v1";
  private static String voice = "longxiaochun";

  public static void StreamAuidoDataToSpeaker() throws NoApiKeyException {
    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            .apiKey(apikey)
            .model(model)
            .voice(voice)
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    synthesizer.callAsFlowable("今天天气怎么样?").blockingForEach(result -> {
      System.out.println("收到消息: " + result);
      if (result.getAudioFrame() != null) {
        // TODO: 处理音频
        System.out.println("收到音频");
      }
    });
  }

  public static void main(String[] args) throws NoApiKeyException {
    StreamAuidoDataToSpeaker();
    System.exit(0);
  }
}

接口详情

/**
 * Stream output speech synthesis using Flowable features (non-streaming input)
 * @param text Text to be synthesized
 * @return The output event stream, including real-time audio
 * @throws ApiException
 * @throws NoApiKeyException
 */
public Flowable<SpeechSynthesisResult> callAsFlowable(String text)
        throws ApiException, NoApiKeyException

流式输入调用示例

以下示例展示了通过Flowable对象作为输入参数,输入文本流。并通过Flowable对象作为返回值,利用的blockingForEach接口,阻塞式地获取每次流式返回的SpeechSynthesisResult类型数据msg。

package com.alibaba.dashscope;

import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;

public class StreamInputTtsPlayableDemo {
  private static String[] textArray = {"流式文本语音合成SDK,",
      "可以将输入的文本", "合成为语音二进制数据,", "相比于非流式语音合成,",
      "流式合成的优势在于实时性", "更强。用户在输入文本的同时",
      "可以听到接近同步的语音输出,", "极大地提升了交互体验,",
      "减少了用户等待时间。", "适用于调用大规模", "语言模型(LLM),以",
      "流式输入文本的方式", "进行语音合成的场景。"};
  /**
   * 将your-dashscope-api-key替换成您自己的API-KEY
   */
  private static String apikey = "your-daskscope-api-key";
  private static String model = "cosyvoice-v1";
  private static String voice = "longxiaochun";

  public static void StreamAuidoDataToSpeaker() throws NoApiKeyException {
    // 模拟流式输入
    Flowable<String> textSource = Flowable.create(emitter -> {
      new Thread(() -> {
        for (int i = 0; i < textArray.length; i++) {
          emitter.onNext(textArray[i]);
          try {
            Thread.sleep(1000);
          } catch (InterruptedException e) {
            throw new RuntimeException(e);
          }
        }
        emitter.onComplete();
      }).start();
    }, BackpressureStrategy.BUFFER);

    SpeechSynthesisParam param =
        SpeechSynthesisParam.builder()
            .apiKey(apikey)
            .model(model)
            .voice(voice)      
            .build();
    SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
    synthesizer.streamingCallAsFlowable(textSource).blockingForEach(result -> {
      if (result.getAudioFrame() != null) {
        // TODO: 将音频片段发送给播放器
        System.out.println(
            "audio result length: " + result.getAudioFrame().capacity());
      }
    });
  }

  public static void main(String[] args) throws NoApiKeyException {
    StreamAuidoDataToSpeaker();
    System.exit(0);
  }
}

返回结果说明

该接口主要通过返回的Flowable<SpeechSynthesisResult>来获取流式结果,也可以在Flowable的所有流式数据返回完成后通过对应SpeechSynthesizer对象的getAudioData来获取完整的合成结果。关于Flowable的使用,请参见rxjava API

接口详情

/**
 * Stream input and output speech synthesis using Flowable features
 * @param textStream The text stream to be synthesized
 * @return The output event stream, including real-time audio
 * @throws ApiException
 * @throws NoApiKeyException
 */
public Flowable<SpeechSynthesisResult> streamingCallAsFlowable(
    Flowable<String> textStream)