服务端Java SDK

本文介绍了如何使用阿里云百炼大模型服务提供的实时多模交互服务端 Java SDK,包括SDK下载安装、关键接口及代码示例。

多模态实时交互服务架构

多模态实时交互服务接入架构-通用-流程图

前提条件

开通阿里云百炼实时多模交互应用,获取Workspace IDAPP IDAPI Key

环境和安装

您可以通过maven或者gradle集成阿里云Dashscope官方SDK(版本号 >= 2.20.5)。

<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>dashscope-sdk-java</artifactId>
    <version>2.20.5</version>
</dependency>

交互模式说明

SDK支持 Push2TalkTap2TalkDuplex(全双工)三种交互模式。

  • Push2Talk: 长按说话,抬起结束(或者点击开始,点击结束 )的收音方式。

  • Tap2Talk: 点击开始说话,自动判断用户说话结束的收音方式。

  • Duplex: 全双工交互,连接后支持在任意时刻说话,支持语音打断。视频交互建议使用duplex模式。

音频格式说明

服务端接入方式只支持 websocket 传输协议。

  • 音频格式说明:

    • 上行:支持 pcm 和 opus 格式音频进行语音识别。

    • 下行:支持 pcm 和 mp3 音频流。

接口说明

MultiModalDialog

服务入口类

1、MultiModalDialog(MultiModalRequestParam param, MultiModalDialogCallback callback)

初始化服务对象,设置对话参数和回调。

/**
 * Constructor initializes service options and creates a duplex communication API instance.
 *
 * @param param Request parameter
 * @param callback Callback interface
 */
public MultiModalDialog(MultiModalRequestParam param, MultiModalDialogCallback callback) 

2start

启动对话服务,返回onStarted回调。注意onStarted会回调dialogId。

/**
 * Starts the conversation session.
 */
public void start() 

3、startSpeech

通知服务端开始上传音频,注意需要在Listening状态才可以调用。只需要在Push2Talk模式下调用。

/**
 * Starts upload speech.
 */
public void startSpeech()

4、sendAudioData

通知服务端上传音频。

/**
 * Sends audio frame.
 *
 * @param audioFrame Audio frame data
 */
public void sendAudioData(ByteBuffer audioFrame)  

5、stopSpeech

通知服务端结束上传音频。只需要在 Push2Talk(按键说) 模式下调用。

/**
 * Stops upload speech.
 */
public void stopSpeech()

6interrupt

通知服务端,客户端需要打断当前交互,开始说话。会返回RequestAccepted。

/**
 * Interrupts current operation.
 */
public void interrupt()

7、localRespondingStarted

通知服务端,客户端开始播放tts音频。

/**
 * Local player start broadcast tts
 */
public void localRespondingStarted()

8、localRespondingEnded

通知服务端,客户端结束播放tts音频。

/**
 * Local player broadcast tts end
 */
public void localRespondingEnded()

9stop

结束当前轮次对话。

/**
 * Stops the conversation.
 */ 
public void stop()

10、getDialogState

获得当前对话服务状态。DialogState枚举。

/**
 * Gets current dialogue state.
 *
 * @return Current dialogue state
 */
public State.DialogState getDialogState()

11、requestToRespond

端侧主动通过文本直接发起tts语音合成,或者向服务端发起图片等其他请求。

/**
 * Requests response.
 *
 * @param type Response type
 * @param text Response text
 * @param updateParams Update parameters
 */
public void requestToRespond(String type, String text, MultiModalRequestParam.updateParams updateParams)

12、updateInfo

更新参数信息等操作。

/**
 * Updates information.
 *
 * @param updateParams Update parameters
 */
public void updateInfo(MultiModalRequestParam.updateParams updateParams)

13 sendHeartBeat

发送心跳信息,否则存在60秒超时报错。需要保持长连接场景使用,请每20秒调用一次发送接口。

/**
 * Send heart beat.
 *
 */
public void sendHeartBeat()

MultiModalDialogCallback

回调函数

/**
 * Abstract class representing callbacks for multi-modal conversation events.
 *
 * @author songsong.shao
 * @date 2025/4/27
 */
public abstract class MultiModalDialogCallback {

    /**
     * Called when the conversation is connected.
     */
    public abstract void onConnected();

    /**
     * Called when a conversation starts with a specific dialog ID.
     *
     * @param dialogId The unique identifier for the dialog.
     */
    public abstract void onStarted(String dialogId);

    /**
     * Called when a conversation stops with a specific dialog ID.
     *
     * @param dialogId The unique identifier for the dialog.
     */
    public abstract void onStopped(String dialogId);

    /**
     * Called when speech starts in a specific dialog.
     *
     * @param dialogId The unique identifier for the dialog.
     */
    public abstract void onSpeechStarted(String dialogId);

    /**
     * Called when speech ends in a specific dialog.
     *
     * @param dialogId The unique identifier for the dialog.
     */
    public abstract void onSpeechEnded(String dialogId);

    /**
     * Called when an error occurs during a conversation.
     *
     * @param dialogId The unique identifier for the dialog.
     * @param errorCode The error code associated with the error.
     * @param errorMsg The error message associated with the error.
     */
    public abstract void onError(String dialogId, String errorCode, String errorMsg);

    /**
     * Called when the conversation state changes.
     *
     * @param state The new state of the conversation.
     */
    public abstract void onStateChanged(State.DialogState state);

    /**
     * Called when speech audio data is available.
     *
     * @param audioData The audio data as a ByteBuffer.
     */
    public abstract void onSpeechAudioData(ByteBuffer audioData);

    /**
     * Called when responding starts in a specific dialog.
     *
     * @param dialogId The unique identifier for the dialog.
     */
    public abstract void onRespondingStarted(String dialogId);

    /**
     * Called when responding ends in a specific dialog.
     *
     * @param dialogId The unique identifier for the dialog.
     * @param content The content of the response as a JsonObject.
     */
    public abstract void onRespondingEnded(String dialogId, JsonObject jsonObject);

    /**
     * Called when responding content is available in a specific dialog.
     *
     * @param dialogId The unique identifier for the dialog.
     * @param content The content of the response as a JsonObject.
     */
    public abstract void onRespondingContent(String dialogId, JsonObject content);

    /**
     * Called when speech content is available in a specific dialog.
     *
     * @param dialogId The unique identifier for the dialog.
     * @param content The content of the speech as a JsonObject.
     */
    public abstract void onSpeechContent(String dialogId, JsonObject content);

    /**
     * Called when a request is accepted in a specific dialog.
     *
     * @param dialogId The unique identifier for the dialog.
     */
    public abstract void onRequestAccepted(String dialogId);

    /**
     * Called when the conversation closes.
     */
    public abstract void onClosed();
}

MultiModalRequestParam

请求参数类

请求参数均支持builder模式设置参数,参数的值和说明参考如下。

Start建联请求参数

一级参数

二级参数

三级参数

四级参数

类型

是否必选

说明

task_group

任务组名称,固定为"aigc"

task

任务名称,固定为"multimodal-generation"

function

调用功能,固定为"generation"

model

服务名称,固定为"multimodal-dialog"

input

workspace_id

string

用户业务空间ID

app_id

string

客户在管控台创建的应用ID,可以根据值规律确定使用哪个对话系统

directive

string

指令名称:Start

dialog_id

string

对话ID,如果传入表示接着聊

parameters

upstream

type

string

上行类型:

  • AudioOnly:仅语音通话

  • AudioAndVideo:上传视频

mode

string

客户端使用的模式,可选项:

  • push2talk

  • tap2talk

  • duplex

默认tap2talk

audio_format

string

音频格式,支持pcm,opus,默认为pcm

downstream

voice

string

合成语音的音色

sample_rate

int

合成语音的采样率,默认采样率24k

intermediate_text

string

控制返回给用户哪些中间文本:

  • transcript:返回用户语音识别结果

  • dialog:返回对话系统回答中间结果

可以设置多种,以逗号分隔,默认为transcript

audio_format

string

音频格式,支持pcm,mp3,默认为pcm

client_info

user_id

string

终端用户ID,用来做用户相关的处理

device

uuid

string

客户端全局唯一的ID,需要用户自己生成,传入SDK

network

ip

string

调用方公网IP

location

latitude

string

调用方维度信息

longitude

string

调用方经度信息

city_name

string

调用方所在城市

biz_params

user_defined_params

object

其他需要透传给agent的参数

user_defined_tokens

object

透传agent所需鉴权信息

tool_prompts

object

透传agent所需prompt

user_query_params

object

透传用户请求自定义参数

user_prompt_params

object

透传用户prompt自定义参数

RequestToRespond 请求参数

一级参数

二级参数

三级参数

类型

是否必选

说明

input

directive

string

指令名称:RequestToRespond

dialog_id

string

对话ID

type

string

服务应该采取的交互类型:

  • transcript:表示直接把文本转语音

  • prompt:表示把文本发送给大模型并获取回答

text

string

要处理的文本,可以是""空字符串,非null即可

parameters

images

list[]

需要分析的图片信息

biz_params

object

Start消息中biz_params相同,传递对话系统自定义参数。RequestToRespondbiz_params参数只在本次请求中生效。

UpdateInfo 请求参数

一级参数

二级参数

三级参数

类型

是否必选

说明

input

directive

string

指令名称:UpdateInfo

dialog_id

string

对话ID

parameters

images

list[]

图片数据

client_info

status

object

客户端当前状态

biz_params

object

Start消息中biz_params相同,传递对话系统自定义参数

对话状态说明(DialogState

voicechat服务有LISTENING、THINKING、RESPONDING三个状态,分别代表:

IDLE("Idle"), 系统尚未建立连接
LISTENING("Listening"), 表示机器人正在监听用户输入。用户可以发送音频。
THINKING("Thinking"),表示机器人正在思考。
RESPONDING("Responding")表示机器人正在生成语音或语音回复中。

对话LLM输出结果

onRespondingContent

一级参数

二级参数

三级参数

类型

是否必选

说明

output

event

string

事件名称如:RequestAccepted

dialog_id

string

对话ID

round_id

string

本轮交互的ID

llm_request_id

string

调用LLMrequest_id

text

string

系统对外输出的文本,流式全量输出

spoken

string

合成语音时使用的文本,流式全量输出

finished

bool

输出是否结束

extra_info

object

其他扩展信息,目前支持:

  • commands: 命令字符串

  • agent_info: 智能体信息

  • tool_calls: 插件返回的信息

  • dialog_debug: 对话debug信息

  • timestamps: 链路中各节点时间戳

调用交互时序图

image.svg

调用示例

1、初始化对话参数

调用MultiModalRequestParam类中的各子类的builder方法构建参数。

MultiModalRequestParam params =
        MultiModalRequestParam.builder()
            .customInput(
                MultiModalRequestParam.CustomInput.builder()
                    .build())
            .upStream(
                MultiModalRequestParam.UpStream.builder()
                    .mode("push2talk")
                    .audioFormat("pcm")
                    .build())
            .downStream(
                MultiModalRequestParam.DownStream.builder()
                    .voice("longxiaochun_v2")
                    .sampleRate(48000)
                    .build())
            .dialogAttributes(
                MultiModalRequestParam.DialogAttributes.builder().prompt("你好").build())
            .clientInfo(
                MultiModalRequestParam.ClientInfo.builder()
                    .userId("1234")
                    .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                    .build())
            .apiKey("sk-9d11****************************")
            .model("multimodal-dialog")
            .build();

2、实例化回调函数

public static MultiModalDialogCallback getCallback() {
    return new MultiModalDialogCallbackImpl();
  }

  public static class MultiModalDialogCallbackImpl extends MultiModalDialogCallback {
    ...
  }

3、创建MultiModalDialog对象

conversation = new MultiModalDialog(params, getCallback());

完整调用示例

建议使用 Maven 或 Gradle 构建工具运行代码示例,并在相应的配置文件(Maven 的 pom.xml 或 Gradle 的 build.gradle)中添加所需的依赖项,包括 Gson、OkHttp、RxJava、Apache Log4j、Lombok 和 JUnit。

配置文件

具体配置可能因所使用的 Maven 或 Gradle 版本而异,以下内容仅供参考,请根据实际环境进行相应调整。

Maven

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.alibaba.nls</groupId>
    <artifactId>bailian-demo</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>dashscope-sdk-java</artifactId>
            <version>2.20.5</version>
        </dependency>
        <dependency>
            <groupId>io.reactivex.rxjava2</groupId>
            <artifactId>rxjava</artifactId>
            <version>2.2.21</version>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.26</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-engine</artifactId>
            <version>5.10.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-api</artifactId>
            <version>5.10.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.junit-pioneer</groupId>
            <artifactId>junit-pioneer</artifactId>
            <version>1.9.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>io.javalin</groupId>
            <artifactId>javalin</artifactId>
            <version>4.4.0</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.8.9</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>2.0.7</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>2.0.7</version>
        </dependency>
        <!--  dependent by okhttp -->
        <dependency>
            <groupId>org.jetbrains.kotlin</groupId>
            <artifactId>kotlin-stdlib-jdk8</artifactId>
            <version>1.8.21</version>
        </dependency>
        <!--  dependent by okhttp -->
        <dependency>
            <groupId>com.squareup.okio</groupId>
            <artifactId>okio</artifactId>
            <version>3.6.0</version>
        </dependency>
        <dependency>
            <groupId>com.squareup.okhttp3</groupId>
            <artifactId>mockwebserver</artifactId>
            <version>4.12.0</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.squareup.okhttp3</groupId>
            <artifactId>logging-interceptor</artifactId>
            <version>4.12.0</version>
        </dependency>
        <dependency>
            <groupId>com.squareup.okhttp3</groupId>
            <artifactId>okhttp-sse</artifactId>
            <version>4.12.0</version>
        </dependency>
        <dependency>
            <groupId>com.squareup.okhttp3</groupId>
            <artifactId>okhttp</artifactId>
            <version>4.12.0</version>
        </dependency>
        <!--  function call support  -->
        <dependency>
            <groupId>com.github.victools</groupId>
            <artifactId>jsonschema-generator</artifactId>
            <version>4.31.1</version>
        </dependency>
    </dependencies>
    <properties>
        <maven.compiler.source>11</maven.compiler.source>
        <maven.compiler.target>11</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

</project>

Gradle

plugins {
    id 'java'
}

group = 'com.test'
version = '1.0-SNAPSHOT'

repositories {
    mavenCentral()
    maven {
        url 'https://maven.aliyun.com/repository/public'
    }
    maven {
        url 'https://plugins.gradle.org/m2/'
    }
}


dependencies {
    // Dashscope SDK
    implementation 'com.alibaba:dashscope-sdk-java:2.20.5'
    // 第三方库
    // https://mvnrepository.com/artifact/com.google.code.gson/gson
    implementation 'com.google.code.gson:gson:2.13.1'
    // https://mvnrepository.com/artifact/io.reactivex.rxjava2/rxjava
    implementation 'io.reactivex.rxjava2:rxjava:2.2.21'
    // https://mvnrepository.com/artifact/com.squareup.okhttp3/okhttp
    implementation 'com.squareup.okhttp3:okhttp:4.12.0'          // 核心 OkHttp
    implementation 'com.squareup.okhttp3:logging-interceptor:4.12.0' // 日志拦截器


    // 日志相关依赖
    implementation 'org.apache.logging.log4j:log4j-core:2.17.1'
    implementation 'org.apache.logging.log4j:log4j-api:2.17.1'
    // 如果你使用 SLF4J 门面
    implementation 'org.apache.logging.log4j:log4j-slf4j-impl:2.17.1'

    // 测试相关依赖
    testImplementation 'org.junit.jupiter:junit-jupiter-api:5.10.0'
    testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.10.0'
    testImplementation 'org.apache.logging.log4j:log4j-core:2.17.1'
    testImplementation 'org.apache.logging.log4j:log4j-slf4j-impl:2.17.1'
    testAnnotationProcessor 'org.projectlombok:lombok:1.18.30'

    // Lombok 支持
    implementation 'org.projectlombok:lombok:1.18.30'
    annotationProcessor 'org.projectlombok:lombok:1.18.30'
}

test {
    useJUnitPlatform()
}

Java代码

TestMultiModalDialog.java

import com.alibaba.dashscope.multimodal.MultiModalDialog;
import com.alibaba.dashscope.multimodal.MultiModalDialogCallback;
import com.alibaba.dashscope.multimodal.MultiModalRequestParam;
import com.alibaba.dashscope.multimodal.State;
import com.alibaba.dashscope.utils.Constants;
import com.alibaba.dashscope.utils.JsonUtils;
import com.google.gson.Gson;
import com.google.gson.JsonArray;
import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.parallel.Execution;
import org.junit.jupiter.api.parallel.ExecutionMode;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.Base64;
import java.util.List;

import static java.lang.Thread.sleep;

@Execution(ExecutionMode.SAME_THREAD)

class TestMultiModalDialog {
    static State.DialogState currentState;
    static MultiModalDialog conversation;
    static int enterListeningTimes = 0;
    static FileWriterUtil fileWriterUtil;
    static boolean vqaUseUrl = false;
    static String workspaceId = "your_workspace_id"; // 替换为您的Workspace ID
    static String appId = "your_app_id"; // 替换为您的 APP ID
    static String dialogId = "";
    static String apiKey = "your_api_key"; // 替换为您的API Key
    static String model = "multimodal-dialog";
    static Logger log = LoggerFactory.getLogger(TestMultiModalDialog.class);

    @BeforeAll
    public static void before() {
        System.out.println("############ before Test Multimodal ############");
        Constants.baseWebsocketApiUrl = "wss://dashscope.aliyuncs.com/api-ws/v1/inference";
        log.info("baseWebsocketApiUrl: {}", Constants.baseWebsocketApiUrl);
        fileWriterUtil = new FileWriterUtil();
    }

    @BeforeEach
    void setUp() {
        enterListeningTimes = 0;
    }


    @Test
    void testDuplicate(){
        for (int i = 0; i < 2; i++) {
            testMultimodalPush2Talk();
        }
    }

    @Test
    void testMultimodalPush2Talk() {
    /*
      step1. 设置push2talk
      step2. 等待Listening
      step3. startSpeech
      step4. 发送音频
      step5. stopSpeech
      step6. 等待Responding
      */
        System.out.println("############ Start Test Push2Talk ############");


        MultiModalRequestParam params =
                MultiModalRequestParam.builder()
                        .customInput(
                                MultiModalRequestParam.CustomInput.builder()
                                        .workspaceId(workspaceId)
                                        .appId(appId)
                                        .build())
                        .upStream(
                                MultiModalRequestParam.UpStream.builder()
                                        .mode("push2talk")
                                        .audioFormat("pcm")
                                        .build())
                        .downStream(
                                MultiModalRequestParam.DownStream.builder()
                                        .voice("longxiaochun_v2")
                                        .sampleRate(48000)
                                        .build())
                        .clientInfo(
                                MultiModalRequestParam.ClientInfo.builder()
                                        .userId("1234333333344432")
                                        .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                                        .build())
                        .apiKey(apiKey)
                        .model(model)
                        .build();

        log.debug("params: {}", JsonUtils.toJson(params));

        conversation = new MultiModalDialog(params, getCallback());

        conversation.start();
        while (currentState != State.DialogState.LISTENING) {
            try {
                sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.startSpeech();
        // 发送音频
        String filePath = "./src/test/resources/1_plus_1.wav";
        // 循环读取3200bytes 直到文件结束
        File fileIn = new File(filePath);
        try {
            AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(fileIn);
            int numBytes = 3200 ;
            byte[] audioBytes = new byte[numBytes];
            try {
                int totalFramesRead = 0;
                int numBytesRead = 0;
                int numFramesRead = 0;
                // Try to read numBytes bytes from the file.
                while ((numBytesRead = audioInputStream.read(audioBytes)) != -1) {
                    // Calculate the number of frames actually read.
                    //                    numFramesRead = numBytesRead / bytesPerFrame;
                    totalFramesRead += numFramesRead;
                    sleep(100);
                    conversation.sendAudioData(ByteBuffer.wrap(audioBytes, 0, numBytesRead));
                }
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        conversation.stopSpeech();

        while (enterListeningTimes < 2) {
            try {
                sleep(2000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.stop();
        try {
            sleep(10000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.out.println("############ End Test Push2Talk ############");
    }

    @Test
    void testMultimodalTap2Talk() {
    /*
      step1. 设置tap2talk
      step2. 等待Listening
      step3. startSpeech ## 注意tap2talk需要手动start_speech
      step4. 发送音频
      step6. 等待Responding
      */
        System.out.println("############ Start Test Tap2talk ############");

        MultiModalRequestParam params =
                MultiModalRequestParam.builder()
                        .customInput(MultiModalRequestParam.CustomInput.builder().dialogId("").build())
                        .upStream(
                                MultiModalRequestParam.UpStream.builder()
                                        .mode("tap2talk")
                                        .audioFormat("pcm")
                                        .build())
                        .downStream(
                                MultiModalRequestParam.DownStream.builder()
                                        .voice("longxiaochun_v2")
                                        .sampleRate(48000)
                                        .build())
                        .clientInfo(
                                MultiModalRequestParam.ClientInfo.builder()
                                        .userId("1234")
                                        .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                                        .build())
                        .apiKey(apiKey)
                        .model(model)
                        .build();

        log.debug("params: {}", JsonUtils.toJson(params));

        conversation = new MultiModalDialog(params, getCallback());

        conversation.start();
        while (currentState != State.DialogState.LISTENING) {
            try {
                sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }

        conversation.startSpeech();
        // 发送音频
        String filePath = "./src/test/resources/1_plus_1.wav";
        // 循环读取3200bytes 直到文件结束
        File fileIn = new File(filePath);
        try {
            AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(fileIn);
            int numBytes = 3200 ;
            byte[] audioBytes = new byte[numBytes];
            try {
                int totalFramesRead = 0;
                int numBytesRead = 0;
                int numFramesRead = 0;
                // Try to read numBytes bytes from the file.
                while ((numBytesRead = audioInputStream.read(audioBytes)) != -1) {
                    // Calculate the number of frames actually read.
                    //                    numFramesRead = numBytesRead / bytesPerFrame;
                    totalFramesRead += numFramesRead;
                    sleep(100);
                    conversation.sendAudioData(ByteBuffer.wrap(audioBytes, 0, numBytesRead));
                }
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }

        while (enterListeningTimes < 2) {
            try {
                sleep(2000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.stop();
        try {
            sleep(1000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.out.println("############ End Test Tap2talk ############");
    }


    @Test
    void testMultimodalDuplex() {
    /*
      step1. 设置duplex
      step2. 等待Listening
      step4. 发送音频
      step6. 等待Responding
      */
        System.out.println("############ Start Test duplex ############");

        MultiModalRequestParam params =
                MultiModalRequestParam.builder()
                        .customInput(MultiModalRequestParam.CustomInput.builder().dialogId("").build())
                        .upStream(
                                MultiModalRequestParam.UpStream.builder()
                                        .mode("duplex")
                                        .audioFormat("pcm")
                                        .build())
                        .downStream(
                                MultiModalRequestParam.DownStream.builder()
                                        .voice("longxiaochun_v2")
                                        .sampleRate(48000)
                                        .build())
                        .clientInfo(
                                MultiModalRequestParam.ClientInfo.builder()
                                        .userId("1234")
                                        .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                                        .build())
                        .apiKey(apiKey)
                        .model(model)
                        .build();

        log.info("params: {}", JsonUtils.toJson(params));

        conversation = new MultiModalDialog(params, getCallback());

        conversation.start();
        while (currentState != State.DialogState.LISTENING) {
            try {
                sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }


        // 发送音频
        String filePath = "./src/test/resources/1_plus_1.wav";
        // 循环读取3200bytes 直到文件结束
        File fileIn = new File(filePath);
        try {
            AudioInputStream audioInputStream = AudioSystem.getAudioInputStream(fileIn);
            int numBytes = 3200 ;
            byte[] audioBytes = new byte[numBytes];
            try {
                int totalFramesRead = 0;
                int numBytesRead = 0;
                int numFramesRead = 0;
                // Try to read numBytes bytes from the file.
                while ((numBytesRead = audioInputStream.read(audioBytes)) != -1) {
                    // Calculate the number of frames actually read.
                    //                    numFramesRead = numBytesRead / bytesPerFrame;
                    totalFramesRead += numFramesRead;
                    sleep(100);
                    conversation.sendAudioData(ByteBuffer.wrap(audioBytes, 0, numBytesRead));
                }
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }

        while (enterListeningTimes < 2) {
            try {
                sleep(2000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.stop();
        try {
            sleep(1000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.out.println("############ End Test duplex ############");
    }

    @Test
    void testMultimodalTextSynthesizer() {
    /*
      step1. 发送文本合成tts
      step2. 保存tts音频
      */
        System.out.println("############ Start Test TextSynthesizer ############");

        MultiModalRequestParam params =
                MultiModalRequestParam.builder()
                        .customInput(
                                MultiModalRequestParam.CustomInput.builder()
                                        .workspaceId(workspaceId)
                                        .appId(appId)
                                        //                        .dialogId("a3e34a6f9cfd4995aaa64e8362b0efea")
                                        .build())
                        .upStream(
                                MultiModalRequestParam.UpStream.builder()
                                        .mode("push2talk")
                                        .audioFormat("pcm")
                                        .build())
                        .downStream(
                                MultiModalRequestParam.DownStream.builder()
                                        .voice("longxiaochun_v2")
                                        .sampleRate(48000)
                                        .build())
                        .clientInfo(
                                MultiModalRequestParam.ClientInfo.builder()
                                        .userId("1234")
                                        .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                                        .build())
                        .apiKey(apiKey)
                        .model(model)
                        .build();

        log.debug("params: {}", JsonUtils.toJson(params));

        conversation = new MultiModalDialog(params, getCallback());

        conversation.start();
        while (currentState != State.DialogState.LISTENING) {
            try {
                sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        try {
            fileWriterUtil.createFile("./test.pcm");
        } catch (IOException e) {
            log.error(e.getMessage());
            throw new RuntimeException(e);
        }
        conversation.requestToRespond("transcript","幸福是一种技能,是你摒弃了外在多余欲望后的内心平和。",null);


        // 增加交互流程等待
        while (enterListeningTimes < 2) {
            try {
                sleep(2000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        try {
            fileWriterUtil = new FileWriterUtil();
            fileWriterUtil.finishWriting();
        } catch (IOException e) {
            log.error(e.getMessage());
            throw new RuntimeException(e);
        }
        conversation.stop();

        System.out.println("############ End Test TextSynthesizer ############");
    }



    @Test
    void testMultimodalVQA() {
    /*
      step1. 发送”看看前面有什么东西“,onRespondingContent 返回visual_qa 指令
      step2. 发送图片列表
      step3. 返回图片的对话结果
      */
        System.out.println("############ Start Test VQA ############");
        vqaUseUrl = true;

        MultiModalRequestParam params =
                MultiModalRequestParam.builder()
                        .customInput(
                                MultiModalRequestParam.CustomInput.builder()
                                        .workspaceId(workspaceId)
                                        .appId(appId)
                                        .build())
                        .upStream(
                                MultiModalRequestParam.UpStream.builder()
                                        .mode("push2talk")
                                        .audioFormat("pcm")
                                        .build())
                        .downStream(
                                MultiModalRequestParam.DownStream.builder()
                                        .voice("longxiaochun_v2")
                                        .sampleRate(48000)
                                        .build())
                        .clientInfo(
                                MultiModalRequestParam.ClientInfo.builder()
                                        .userId("1234")
                                        .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                                        .build())
                        .apiKey(apiKey)
                        .model(model)
                        .build();

        log.debug("params: {}", JsonUtils.toJson(params));

        conversation = new MultiModalDialog(params, getCallback());

        conversation.start();
        while (currentState != State.DialogState.LISTENING) {
            try {
                sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.requestToRespond("prompt","拍照看看前面有什么东西",null);
        // 增加交互流程等待
        while (enterListeningTimes < 3) {
            try {
                sleep(2000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.stop();
        try {
            sleep(1000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.out.println("############ End Test Push2Talk ############");
    }

    @Test
    void testMultimodalVQABase64() {
    /*
      step1. 发送”看看前面有什么东西“,onRespondingContent 返回visual_qa 指令
      step2. 发送图片列表
      step3. 返回图片的对话结果
      */
        System.out.println("############ Start Test testMultimodalVQABase64 ############");
        vqaUseUrl = false;

        MultiModalRequestParam params =
                MultiModalRequestParam.builder()
                        .customInput(
                                MultiModalRequestParam.CustomInput.builder()
                                        .workspaceId(workspaceId)
                                        .appId(appId)
                                        .build())
                        .upStream(
                                MultiModalRequestParam.UpStream.builder()
                                        .mode("push2talk")
                                        .audioFormat("pcm")
                                        .build())
                        .downStream(
                                MultiModalRequestParam.DownStream.builder()
                                        .voice("longxiaochun_v2")
                                        .sampleRate(48000)
                                        .build())
                        .clientInfo(
                                MultiModalRequestParam.ClientInfo.builder()
                                        .userId("1234")
                                        .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                                        .build())
                        .apiKey(apiKey)
                        .model(model)
                        .build();

        log.debug("params: {}", JsonUtils.toJson(params));

        conversation = new MultiModalDialog(params, getCallback());

        conversation.start();
        while (currentState != State.DialogState.LISTENING) {
            try {
                sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.requestToRespond("prompt","拍照看看前面有什么东西",null);
        // 增加交互流程等待
        while (enterListeningTimes < 3) {
            try {
                sleep(2000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.stop();
        try {
            sleep(1000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.out.println("############ End Test testMultimodalVQABase64 ############");
    }

    @Test
    void testMultimodalAdjustBrightness() {
    /*
      step1. 发送”看看前面有什么东西“,onRespondingContent 返回visual_qa 指令
      step2. 发送图片列表
      step3. 返回图片的对话结果
      */
        System.out.println("############ Start Test VQA ############");
        vqaUseUrl = true;

        MultiModalRequestParam params =
                MultiModalRequestParam.builder()
                        .customInput(
                                MultiModalRequestParam.CustomInput.builder()
                                        .workspaceId(workspaceId)
                                        .appId(appId)
                                        .build())
                        .upStream(
                                MultiModalRequestParam.UpStream.builder()
                                        .mode("push2talk")
                                        .audioFormat("pcm")
                                        .build())
                        .downStream(
                                MultiModalRequestParam.DownStream.builder()
                                        .voice("longxiaochun_v2")
                                        .sampleRate(48000)
                                        .build())
                        .clientInfo(
                                MultiModalRequestParam.ClientInfo.builder()
                                        .userId("1234")
                                        .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                                        .build())
                        .apiKey(apiKey)
                        .model(model)
                        .build();

        log.debug("params: {}", JsonUtils.toJson(params));

        conversation = new MultiModalDialog(params, getCallback());

        conversation.start();
        while (currentState != State.DialogState.LISTENING) {
            try {
                sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.requestToRespond("prompt","调高亮度到10",null);


        JsonObject status = new JsonObject();
        JsonObject bluetooth_announcement = new JsonObject();
        bluetooth_announcement.addProperty("status","stopped");
        status.add("bluetooth_announcement",bluetooth_announcement);

        MultiModalRequestParam.UpdateParams updateParams = MultiModalRequestParam.UpdateParams.builder()
                .clientInfo(MultiModalRequestParam.ClientInfo.builder()
                        .status(status)
                        .build())
                .build();

//    conversation.updateInfo(updateParams);
        // 增加交互流程等待
        while (enterListeningTimes >= 2) {
            try {
                sleep(1000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        System.out.println("############ before stop ############");
        conversation.stop();
        try {
            sleep(1000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.out.println("############ End Test Push2Talk ############");
    }


    @Test
    void testMultimodalAgentDJ() {
    /*
      step1. 发送”播放新闻电台“
      */
        System.out.println("############ Start Test testMultimodalAgentDJ ############");
        vqaUseUrl = false;

        MultiModalRequestParam params =
                MultiModalRequestParam.builder()
                        .customInput(
                                MultiModalRequestParam.CustomInput.builder()
                                        .workspaceId(workspaceId)
                                        .appId(appId)
                                        .build())
                        .upStream(
                                MultiModalRequestParam.UpStream.builder()
                                        .mode("push2talk")
                                        .audioFormat("pcm")
                                        .build())
                        .downStream(
                                MultiModalRequestParam.DownStream.builder()
                                        .voice("longxiaochun_v2")
                                        .sampleRate(48000)
                                        .build())
                        .clientInfo(
                                MultiModalRequestParam.ClientInfo.builder()
                                        .userId("1234")
                                        .device(MultiModalRequestParam.ClientInfo.Device.builder().uuid("1234").build())
                                        .build())
                        .apiKey(apiKey)
                        .model(model)
                        .build();

        log.debug("params: {}", JsonUtils.toJson(params));

        conversation = new MultiModalDialog(params, getCallback());

        conversation.start();
        while (currentState != State.DialogState.LISTENING) {
            try {
                sleep(100);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.requestToRespond("prompt","打开雷鸟电台",null);
        // 增加交互流程等待
        while (enterListeningTimes < 3) {
            try {
                sleep(2000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
        conversation.stop();
        try {
            sleep(1000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.out.println("############ End Test testMultimodalAgentDJ ############");
    }


    public static MultiModalDialogCallback getCallback() {
        return new MultiModalDialogCallbackImpl();
    }

    public static class MultiModalDialogCallbackImpl extends MultiModalDialogCallback {
        @Override
        public void onConnected() {}

        @Override
        public void onStarted(String dialogId) {

            log.info("onStarted: {}, self {}", dialogId, this);
        }

        @Override
        public void onStopped(String dialogId) {
            log.info("onStopped: {}", dialogId);
        }

        @Override
        public void onSpeechStarted(String dialogId) {
            log.info("onSpeechStarted: {}", dialogId);
        }

        @Override
        public void onSpeechEnded(String dialogId) {
            log.info("onSpeechEnded: {}", dialogId);
        }
        @Override
        public void onError(String dialogId, String errorCode, String errorMsg) {
            log.error("onError: {}, {}, {}, self {}", dialogId, errorCode, errorMsg, this);
            enterListeningTimes++ ; //force quit dialog test
        }

        @Override
        public void onStateChanged(State.DialogState state) {
            log.info("onStateChanged: {}", state);
            currentState = state;
            if (currentState == State.DialogState.LISTENING) {
                enterListeningTimes++;
                log.info("enterListeningTimes: {}", enterListeningTimes);
            }
        }

        @Override
        public void onSpeechAudioData(ByteBuffer audioData) {
            try {
                if (fileWriterUtil!= null){
                    fileWriterUtil.Writing(audioData);
                }
            } catch (IOException e) {
                log.error(e.getMessage());
                throw new RuntimeException(e);
            }
        }

        @Override
        public void onRespondingStarted(String dialogId) {
            log.info("onRespondingStarted: {}", dialogId);
            conversation.localRespondingStarted();
        }

        @Override
        public void onRespondingEnded(String dialogId) {
            log.info("onRespondingEnded: {}", dialogId);
            conversation.localRespondingEnded();
        }

        @Override
        public void onRespondingContent(String dialogId, JsonObject content) {
            log.info("onRespondingContent: {}, {}", dialogId, content);
            if (content.has("extra_info")) {
                JsonObject extraInfo = content.getAsJsonObject("extra_info");
                if (extraInfo.has("commands")) {
                    String commandsStr = extraInfo.get("commands").getAsString();
                    log.info("commandsStr: {}", commandsStr);
                    //"[{\"name\":\"visual_qa\",\"params\":[{\"name\":\"shot\",\"value\":\"拍照看看\",\"normValue\":\"True\"}]}]"
                    JsonArray commands = new Gson().fromJson(commandsStr, JsonArray.class);
                    for (JsonElement command : commands) {
                        JsonObject commandObj = command.getAsJsonObject();
                        if (commandObj.has("name")) {
                            String commandStr = commandObj.get("name").getAsString();
                            if (commandStr.equals("visual_qa")) {
                                log.info("拍照了!!!!");
                                MultiModalRequestParam.UpdateParams updateParams = MultiModalRequestParam.UpdateParams.builder()
                                        .images(getMockOSSImage())
                                        .build();
                                conversation.requestToRespond("prompt","",updateParams);
                            }
                        }
                    }
                }
            }
        }

        @Override
        public void onSpeechContent(String dialogId, JsonObject content) {
            log.info("onSpeechContent: {}, {}", dialogId, content);
        }

        @Override
        public void onRequestAccepted(String dialogId) {
            log.info("onRequestAccepted: {}", dialogId);
        }

        @Override
        public void onClosed() {
            log.info("onClosed , self {}", this);
            enterListeningTimes++ ;
        }
    }

    public static List<Object> getMockOSSImage() {
        JsonObject imageObject = new JsonObject();
        JsonObject extraObject = new JsonObject();
        List<Object> images = new ArrayList<>();
        try{
            if (vqaUseUrl){
                imageObject.addProperty("type", "url");
                imageObject.addProperty("value", "https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7043267371/p909896.png");
                imageObject.addProperty("bucket", "bucketName");
                imageObject.add("extra", extraObject);
            }else {
                imageObject.addProperty("type", "base64");
                imageObject.addProperty("value", getLocalImageBase64());
            }
            images.add(imageObject);

        }catch (Exception e){
            e.printStackTrace();
        }
        return images;
    }

    public static String getLocalImageBase64() {
        // 图片文件路径
        String imagePath = "./src/test/resources/jpeg-bridge.jpg";

        try {
            FileInputStream fileInputStream = new FileInputStream(new File(imagePath));
            byte[] bytes = new byte[fileInputStream.available()];
            fileInputStream.read(bytes);
            fileInputStream.close();

            return Base64.getEncoder().encodeToString(bytes);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
}

FileWriterUtil.java

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;

/**
 * 文件写入工具类。
 * 支持创建文件、开始写入数据、追加写入数据和结束写入操作。
 * 使用 ByteBuffer 存储数据并将数据写入指定的文件中。
 */
public class FileWriterUtil {

    private File file;
    private FileOutputStream fos;
    private FileChannel fc;

    /**
     * 创建一个新的文件。
     *
     * @param filePath 文件路径
     */
    public void createFile(String filePath) throws IOException {
        file = new File(filePath);
        if (!file.exists()) {
            file.createNewFile();
        }

        fos = new FileOutputStream(file, true);
        fc = fos.getChannel();
    }

    /**
     * 写入数据。
     *
     * @param data 要追加的数据
     */
    public void Writing(ByteBuffer data) throws IOException {
        fc.position(fc.size());
        fc.write(data);
    }

    /**
     * 结束写入操作。
     */
    public void finishWriting() throws IOException {
        if (fc != null) {
            fc.close();
        }

        if (fos != null) {
            fos.close();
        }
    }
}

更多SDK接口使用说明

VQA交互

VQA 是对话过程中通过发送图片实现图片+语音的多模交互的功能。

核心过程是语音或者文本请求拍照意图触发"visual_qa"拍照指令。

当收到拍照指令后, 发送图片链接或者base64数据(支持图片大小<180KB)。

  • 建联后发起拍照请求。

@Test
  void testMultimodalVQA() {
    /*
      step1. 发送”看看前面有什么东西“,onRespondingContent 返回visual_qa 指令
      step2. 发送图片列表
      step3. 返回图片的对话结果
      */
    System.out.println("############ Start Test VQA ############");
    vqaUseUrl = true;

    log.debug("params: {}", JsonUtils.toJson(params));

    conversation = new MultiModalDialog(params, getCallback());

    conversation.start();
    while (currentState != State.DialogState.LISTENING) {
      try {
        sleep(100);
      } catch (InterruptedException e) {
        throw new RuntimeException(e);
      }
    }
    conversation.requestToRespond("prompt","拍照看看前面有什么东西",null);
    // 增加交互流程等待
    while (enterListeningTimes < 3) {
      try {
        sleep(2000);
      } catch (InterruptedException e) {
        throw new RuntimeException(e);
      }
    }
    conversation.stop();
    try {
      sleep(1000);
    } catch (InterruptedException e) {
      throw new RuntimeException(e);
    }
    System.out.println("############ End Test Push2Talk ############");
  }
  • 处理"visual_qa" command和上传拍照。

@Override
    public void onRespondingContent(String dialogId, JsonObject content) {
      log.info("onRespondingContent: {}, {}", dialogId, content);
      if (content.has("extra_info")) {
        JsonObject extraInfo = content.getAsJsonObject("extra_info");
        if (extraInfo.has("commands")) {
          String commandsStr = extraInfo.get("commands").getAsString();
          log.info("commandsStr: {}", commandsStr);
          //"[{\"name\":\"visual_qa\",\"params\":[{\"name\":\"shot\",\"value\":\"拍照看看\",\"normValue\":\"True\"}]}]"
          JsonArray commands = new Gson().fromJson(commandsStr, JsonArray.class);
          for (JsonElement command : commands) {
            JsonObject commandObj = command.getAsJsonObject();
            if (commandObj.has("name")) {
              String commandStr = commandObj.get("name").getAsString();
              if (commandStr.equals("visual_qa")) {
                log.info("拍照了!!!!");
                MultiModalRequestParam.UpdateParams updateParams = MultiModalRequestParam.UpdateParams.builder()
                        .images(getMockOSSImage())
                        .build();
                conversation.requestToRespond("prompt","",updateParams);
              }
            }
          }
        }
      }
    }

//上传拍照
public static List<Object> getMockOSSImage() {
    JsonObject imageObject = new JsonObject();
    JsonObject extraObject = new JsonObject();
    List<Object> images = new ArrayList<>();
    try{
      if (vqaUseUrl){
        imageObject.addProperty("type", "url");
        imageObject.addProperty("value", "https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7043267371/p909896.png");
        imageObject.addProperty("bucket", "bucketName");
        imageObject.add("extra", extraObject);
      }else {
        imageObject.addProperty("type", "base64");
        imageObject.addProperty("value", getLocalImageBase64());
      }
      images.add(imageObject);

    }catch (Exception e){
      e.printStackTrace();
    }
    return images;
  }

文本合成TTS

SDK支持通过文本直接请求服务端合成音频。

您需要在客户端处于Listening状态下发送requestToRespond请求。

若当前状态非Listening,需要先调用Interrupt 接口打断当前播报。

  @Test
  void testMultimodalTextSynthesizer() {
    /*
      step1. 发送文本合成tts
      step2. 保存tts音频
      */
    System.out.println("############ Start Test TextSynthesizer ############");

    conversation = new MultiModalDialog(params, getCallback());

    conversation.start();
    while (currentState != State.DialogState.LISTENING) {
      try {
        sleep(100);
      } catch (InterruptedException e) {
        throw new RuntimeException(e);
      }
    }
    try {
      fileWriterUtil.createFile("./test.pcm");
    } catch (IOException e) {
      log.error(e.getMessage());
      throw new RuntimeException(e);
    }
    conversation.requestToRespond("transcript","幸福是一种技能,是你摒弃了外在多余欲望后的内心平和。",null);


    // 增加交互流程等待
    while (enterListeningTimes < 2) {
      try {
        sleep(2000);
      } catch (InterruptedException e) {
        throw new RuntimeException(e);
      }
    }
    try {
      fileWriterUtil = new FileWriterUtil();
      fileWriterUtil.finishWriting();
    } catch (IOException e) {
      log.error(e.getMessage());
      throw new RuntimeException(e);
    }
    conversation.stop();

    System.out.println("############ End Test TextSynthesizer ############");
  }

通过Websocket请求LiveAI

LiveAI (视频通话)是百炼多模交互提供的官方Agent。通过Java SDK发送图片序列的方式,可以实现视频通话的功能。 我们推荐您的服务端和客户端(网页或者APP)通过RTC传输视频和音频,然后将服务端采集到的视频帧以 500ms/张 的速度发送给SDK,同时保持实时的音频输入。

注意:LiveAI发送图片只支持base64编码,每张图片的大小在180K以下。

  • LiveAI调用时序

截屏2025-06-20 11

  • 关键代码示例

完整代码请参考 Github示例代码

public void testMultimodalLiveAI() {
        log.info("############ Starting LiveAI Test ############");

        try {
            // Build request parameters for duplex mode
            MultiModalRequestParam params = buildBaseRequestParams("duplex");
            log.info("Request parameters: {}", JsonUtils.toJson(params));

            // Initialize conversation
            conversation = new MultiModalDialog(params, getCallback());
            conversation.start();

            // Start send video frame loop
            startVideoFrameStreaming();
            // Wait for listening state
            waitForListeningState();
            conversation.sendHeartBeat();
            // Send video channel connect request, will response command : switch_video_call_success
            conversation.requestToRespond("prompt", "", connectVideoChannelRequest());

            // Send audio data directly (no manual start needed for duplex)
                           sendAudioFromFile("./src/main/resources/what_in_picture.wav");

            // Wait for listening state twice
            waitForListeningState();

            sendAudioFromFile("./src/main/resources/what_color.wav");

            // Wait for conversation completion
            waitForConversationCompletion(3);

            // Clean up
            stopConversation();

            isVideoStreamingActive = false;
            if (videoStreamingThread != null && videoStreamingThread.isAlive()) {
                videoStreamingThread.interrupt();
            }

        } catch (Exception e) {
            log.error("Error in LiveAI test: ", e);
        } finally {
            log.info("############ LiveAI Test Completed ############");
        }
    }

/**
     * Build video channel connection request for LiveAI
     *
     * @return UpdateParams containing video channel connection configuration
     */
    private MultiModalRequestParam.UpdateParams connectVideoChannelRequest(){

        Map<String, String> video = new HashMap<>();
        video.put("action", "connect");
        video.put("type", "voicechat_video_channel");
        ArrayList<Map<String, String>> videos = new ArrayList<>();
        videos.add(video);
        MultiModalRequestParam.BizParams bizParams = MultiModalRequestParam.BizParams.builder().videos(videos).build();

        return MultiModalRequestParam.UpdateParams.builder().bizParams(bizParams).build();
    }

/**
     * Start continuous video frame streaming for LiveAI
     */
    private void startVideoFrameStreaming() {
        log.info("Starting continuous video frame streaming for LiveAI...");

        vqaUseUrl = false;
        isVideoStreamingActive = true;

        videoStreamingThread = new Thread(() -> {
            try {
                while (isVideoStreamingActive && !Thread.currentThread().isInterrupted()) {
                    Thread.sleep(VIDEO_FRAME_INTERVAL_MS);

                    MultiModalRequestParam.UpdateParams videoUpdate =
                            MultiModalRequestParam.UpdateParams.builder()
                                    .images(getMockImageRequest())
                                    .build();

                    if (conversation != null && isVideoStreamingActive) {
                        conversation.updateInfo(videoUpdate);
                        log.debug("Video frame sent to LiveAI");
                    }
                }
            } catch (InterruptedException e) {
                log.info("Video streaming thread interrupted - stopping video stream");
                Thread.currentThread().interrupt();
            } catch (Exception e) {
                log.error("Error in video streaming thread: ", e);
            } finally {
                log.info("Video streaming thread terminated");
            }
        });

        videoStreamingThread.setDaemon(true);
        videoStreamingThread.setName("LiveAI-VideoStreaming");
        videoStreamingThread.start();

        log.info("Video streaming thread started successfully");
    }

自定义提示词变量和传值

  • 在管控台项目【提示词】配置自定义变量。

如下图示例,定义了一个user_name字段代表用户昵称。并将变量user_name以占位符形式${user_name} 插入到Prompt 中。

image.png

  • 在代码中设置变量。

如下示例,设置"user_name" = "大米"

//在请求参数构建中传入biz_params.user_prompt_params
MultiModalRequestParam.builder()
        .bizParams(
                MultiModalRequestParam.BizParams.builder()
                        .userPromptParams(setUserPromptParams())
                        .build())
        .apiKey(apiKey)
        .model(model)
        .build();   
private HashMap<String, String> setUserPromptParams() {
        HashMap<String, String> promptSettings = new HashMap<>();
        promptSettings.put("user_name", "大米");
        return promptSettings;
    }
  • 请求回复

image.png

更多使用示例

请参考 Github示例代码