AIAgentConfig-阿里云帮助中心

object

The parameters for the AI agent template.

Greeting

string

The welcome message the AI agent plays when joining the session. Changes apply to subsequent sessions. If omitted, no welcome message is played.

你好

WakeUpQuery

string

A user-provided command that the AI agent responds to immediately after the call starts.

今天天气怎么样？

MaxIdleTime

integer

The maximum idle duration in seconds before the AI agent disconnects. If the agent receives no user interaction within this period, it ends the task. Default: 600.

600

UserOnlineTimeout

integer

The duration in seconds the AI agent waits for a user to join. If the user does not join within this time, the agent terminates the task. Default: 60.

60

UserOfflineTimeout

integer

The duration in seconds the AI agent waits before terminating the task after a user leaves the session. Default: 5.

5

EnablePushToTalk

boolean

Specifies whether to enable push-to-talk mode. Default: false.

false

GracefulShutdown

boolean

Specifies whether to enable graceful shutdown. Default: false.

If enabled, the AI agent completes its current utterance before disconnecting when the task is stopped. The agent will not speak for more than 10 seconds.

false

Volume

integer

The speaking volume of the AI agent.

If not set, the adaptive volume mode recommended by Alibaba Cloud is used by default.
If set, the value must be in the range of 0 to 400. The final output volume is calculated as: (Workflow volume) * (volume / 100). For example:

If volume is 0, the output volume is 0.
If volume is 100, the output volume is the same as the original volume.
If volume is 200, the output volume is twice the original volume.

100

WorkflowOverrideParams

string

A JSON string containing parameters to override the default workflow configuration.

{}

AvatarUrl

string

The URL of the avatar to display during voice calls. If omitted, no avatar is displayed.

http://example.com/a.jpg

AvatarUrlType

string

The type of the avatar URL. By default, this parameter is not set.

USER

EnableIntelligentSegment

boolean

Specifies whether to enable intelligent segmentation. When enabled, short user utterances are merged into a single sentence. Default: true.

true

AsrConfig

object

Configuration for automatic speech recognition (ASR).

AsrLanguageId

string

The language for ASR. Valid values:

zh_mandarin: Chinese (Mandarin)
en: English
zh_en: Chinese-English mixed
es: Spanish
jp: Japanese

zh_mandarin

AsrMaxSilence

integer

The maximum duration of silence in milliseconds before the ASR engine finalizes an utterance. A pause longer than this value signals a sentence break. Range: 200–1200. Default: 400.

400

AsrHotWords

array

A list of hotwords to improve ASR accuracy. You can specify a maximum of 128 hotwords.

string

A hotword string. Length: 1 to 10 characters.

检查

VadLevel

integer

The Voice Activity Detection (VAD) threshold for interruptions. Range: 0–11. Default: 11.

0: Disables VAD.
1–10: Sets the interruption sensitivity. A higher value makes the agent harder to interrupt.
11: An enhanced mode with lower audio distortion and stronger noise resistance.

11

CustomParams

string

Passthrough parameters for proprietary ASR integrations.

mode=fast&sample=16000&format=wav

VadDuration

integer

The minimum duration in milliseconds of continuous user speech required to trigger an interruption. This controls interruption sensitivity. A value of 0 disables this feature. Range: 200–2000. A common range is 200–500 ms, which typically corresponds to 1 to 4 Chinese characters. If omitted, this feature is disabled.

300

TtsConfig

object

Configuration for text-to-speech (TTS).

VoiceId

string

The ID of the preset TTS voice. Changes apply to the next utterance. If omitted, the voice from the AI agent template is used. The ID can be a maximum of 64 characters. For available voices, see Intelligent Voice Samples.

longcheng_v2

VoiceIdList

array

A list of available voices.

string

A voice.

zhixiaoxia

PronunciationRules

array

A list of TTS pronunciation rules, executed in order. You can specify a maximum of 20 rules.

object

A TTS pronunciation rule.

Word

string

The word to be replaced. It must be 1 to 9 Chinese characters long and cannot contain spaces.

一一零

Pronunciation

string

The replacement pronunciation. It must be 1 to 9 Chinese characters long and cannot contain spaces.

幺幺零

Type

string

The type of pronunciation rule. Valid value:

replacement: Replaces the specified Word with the Pronunciation.

replacement

ModelId

string

This parameter applies only to the Minimax provider. Valid values: speech-01-turbo, speech-02-turbo

speech-01-turbo

LanguageId

string

This parameter is for the minimax provider only. It enhances recognition for specific low-resource languages and dialects. If the language is unknown, set this to auto for automatic detection. By default, this parameter is not set. Supported values include:

Supported languages

Chinese
Chinese,Yue: Cantonese
English
Arabic
Russian
Spanish
French
Portuguese
German
Turkish
Dutch
Ukrainian
Vietnamese
Indonesian
Japanese
Italian
Korean
Thai
Polish
Romanian
Greek
Czech
Finnish
Hindi
auto

Chinese

Emotion

string

This parameter applies only to the Minimax provider. Supported emotions include:

happy
sad
angry
fearful
disgusted
surprised
calm

happy

SpeechRate

number

The speech rate, where a value of 1.0 is normal speed. The supported range can vary by provider. For CosyVoice, the range is 0.5 to 2.0 (default: 1.0). For Minimax, the range is 0.5 to 2.0 (default: 1.0).

1.0

LlmConfig

object

Configuration for the large language model (LLM).

LlmHistory

array

The conversation history context for the LLM/MLLM.

object

A single conversational turn.

Role

string

The role of the participant in the conversation. Valid values:

user
assistant
system
function
plugin
tool

user

Content

string

The text content of the message from this role.

你好

LlmHistoryLimit

integer

The maximum number of recent conversational turns to include in the LLM/MLLM context. Default: 10.

10

LlmSystemPrompt

string

The system prompt for the LLM after the call starts.

你是一位友好且乐于助人的助手，专注于为用户提供准确的信息和建议。

BailianAppParams

string

Parameters for Alibaba Cloud Model Studio, provided as a JSON string. For the parameter format, see Alibaba Cloud Model Studio Parameters

"{\"biz_params\":{\"user_defined_params\":{\"your_plugin_id\":{\"article_index\":2}}},\"memory_id\":\"your_memory_id\",\"image_list\":[\"https://your_image_url\"],\"rag_options\":{\"pipeline_ids\":[\"your_id\"],\"file_ids\":[\"文档ID1\",\"文档ID2\"],\"metadata_filter\":{\"name\":\"张三\"},\"structured_filter\":{\"key1\":\"value1\",\"key2\":\"value2\"},\"tags\":[\"标签1\",\"标签2\"]}}"

OpenAIExtraQuery

string

Additional query parameters for an OpenAI-compatible LLM. Parameters must be provided as a URL query string (e.g., key1=value1&key2=value2). All values must be strings.

api-version=2024-02-01&api-key=sk-xxx

LlmCompleteReply

boolean

When set to true, the agent sends the entire LLM response in a single message after it is fully generated, rather than streaming it. This setting does not affect the streaming of subtitles.

true

FunctionMap

array

Maps built-in agent functions to custom LLM functions. Currently, this only supports function calling for custom, OpenAI-compatible LLMs.

object

A single mapping rule.

Function

string

The name of a built-in function provided by the AI agent system. Currently, only hangup is supported.

hangup

MatchFunction

string

The name of the custom LLM function that maps to the agent's built-in function. For details on the custom LLM protocol, see LLM Standard Interface.

hangup

OutputMinLength

integer

The minimum number of characters in a text chunk before it is sent to the TTS engine. Shorter chunks are buffered. Range: 0–100. A value of 0 or omitting this parameter disables buffering. Default: Not set.

5

OutputMaxDelay

integer

The maximum delay in milliseconds before buffered text is sent to the TTS engine, even if OutputMinLength is not met. Range: 1000–10000. A value of 0 or omitting this parameter disables the delay limit. Default: Not set.

2000

HistorySyncWithTTS

boolean

Specifies whether the LLM message history is synchronized with the content played by the TTS. Default: false. When enabled, the saved LLM messages match the content actually played by the TTS.

Note

When a user interrupts the agent, the <ims_agent_interrupted> tag is inserted into the message history at the point of interruption. This affects the next message sent to the LLM. For example:

[
  {"role": "user", "content": "Tell me a story."},
  {"role": "assistant", "content": "Okay, I can tell you a story about the Three Kingdoms. Would you<ims_agent_interrupted> like that?"},
  {"role": "user", "content": "Tell me a different one."}
]

false

AvatarConfig

object

Configuration for the avatar. This takes effect only if the workflow includes an avatar node.

AvatarId

string

The model ID of the avatar.

5257

InterruptConfig

object

Configuration for the speech interruption policy.

EnableVoiceInterrupt

boolean

Specifies whether to enable speech interruption. Default: true.

true

InterruptWords

array

A list of specific words or phrases that trigger an interruption.

string

A specific word or phrase that triggers an interruption.

打断一下

NoInterruptMode

string

Specifies how to handle user speech that occurs during a non-interruptible section of the agent's utterance.

cache: Caches the user's speech and processes it in the next conversational turn.
discard: Discards the user's speech.

Default: cache.

cache

KeepInterruptWordsForLLM

boolean

Specifies whether to include the interrupt words in the text sent to the LLM. Default: false (words are discarded).

Note

For example, if "hold on" is an interrupt word and the user says "hold on, what is the weather like today?", setting this to false results in only "what is the weather like today?" being sent to the LLM.

true

VoiceprintConfig

object

Configuration for voiceprint recognition.

UseVoiceprint

boolean

Specifies whether to enable voiceprint recognition. Default: false. If set to true, you must also provide a valid VoiceprintId.

false

VoiceprintId

string

The unique identifier for the voiceprint. This is not set by default. The ID must correspond to a voiceprint registered using the voiceprint registration API. For more information, see Register a voiceprint.

zhixiaoxia

RegistrationMode

string

The voiceprint registration mode. Default: Explicit.

Value	Description
`Explicit`	In `Explicit` mode, the user must register their voiceprint in advance by using the voiceprint registration API.
`Implicit`	In `Implicit` mode, the system automatically collects user speech during the conversation to register a voiceprint.

Explicit

TurnDetectionConfig

object

Configuration for conversational turn detection.

TurnEndWords

array

A list of keywords used to determine the end of a user's conversational turn.

string

A keyword used to determine the end of a user's conversational turn.

我说完了

Mode

string

The conversational turn detection mode.

Normal (Default): The agent relies on pauses to detect the end of a user's turn.
Semantic: The agent uses AI to analyze conversational context to determine if the user has finished speaking.

Semantic

SemanticWaitDuration

integer

The pause detection time in AI mode, in milliseconds. Default: -1.

-1: The AI automatically determines a suitable wait time.
0–10000: A custom wait time. A range of 0–1500 ms is recommended.

Note

This parameter has no effect in Normal mode.

-1

Eagerness

string

Controls the agent's response speed after detecting a user pause. This parameter applies only in Semantic mode. A higher setting results in a faster response but increases the risk of interrupting the user:

Low: Waits patiently with a maximum wait time of 6 seconds, reducing the risk of interruption.
Medium: A balanced wait time (up to 4 seconds), suitable for most scenarios.
High: Responds quickly (up to 2 seconds), which improves speed but may increase the risk of incorrect turn-taking.

This field is empty by default.

High

ExperimentalConfig

string

Parameters for experimental features. Contact support for assistance.

""

VcrConfig

object

Configuration for video content recognition. This enables the system to send callbacks to the client about events detected in the video stream.

StillFrameMotion

object

Configuration for still frame detection.

Enabled

boolean

Specifies whether to enable still frame detection. Default: false.

false

CallbackDelay

integer

The duration in milliseconds that a frame must remain still before a notification is sent. If not specified, the setting from the console is used. Range: 200–5000.

3000

InvalidFrameMotion

object

Configuration for invalid frame detection.

Enabled

boolean

Specifies whether to enable invalid frame detection.

false

CallbackDelay

integer

The duration in milliseconds that an invalid frame must persist before a notification is sent. If not specified, the setting from the console is used. Range: 200–5000.

3000

PeopleCount

object

Configuration for the people counting feature.

Enabled

boolean

Specifies whether to enable people counting. Default: false.

false

Equipment

object

Configuration for device identification.

Enabled

boolean

Specifies whether to enable device identification. Default: false.

false

HeadMotion

object

Configuration for head motion detection.

Enabled

boolean

Specifies whether to enable head motion detection. Default: false.

false

LookAway

object

Configuration for look-away detection.

Enabled

boolean

Specifies whether to enable look-away detection. Default: false.

true

AmbientSoundConfig

object

Configuration for ambient sound during the call.

ResourceId

string

The ID of the ambient sound resource. You can obtain this ID from the advanced settings of the agent configuration in the console.

f67901c595834************

Volume

integer

The volume of the ambient sound. Range: 0–100. A value of 0 disables the sound.

50

AutoSpeechConfig

object

Configuration for the agent's automatic speech, including prompts for LLM latency and long periods of user silence.

UserIdle

object

Configuration for prompts to play when the user is silent for an extended period.

WaitTime

integer

The silence duration threshold in milliseconds. If the user is silent for longer than this period, a prompt is triggered. Range: 5000–600000. This is a required field.

5000

MaxRepeats

integer

The maximum number of times the prompt can be repeated. Range: 0–10. This is a required field. If the limit is exceeded, the call is terminated.

5

Messages

array

A collection of prompt messages. A maximum of 10 messages are supported, each up to 100 characters. The sum of all probabilities must be 100%.

object

The structure of a prompt message.

Text

string

The text of the prompt message, up to 100 characters.

您还在吗？

Probability

number

The probability of this message being selected. Range: 0–1, corresponding to 0%–100%.

0.5

HangupEndWord

string

A farewell message played before hanging up due to user inactivity.

LlmPending

object

Configuration for prompts to play during LLM response latency.

WaitTime

integer

The wait time threshold for LLM responses. If the threshold is exceeded, a prompt is played. This is a required field. Unit: ms. Range: 500–10000. Set this value based on the actual performance of your LLM.

3000

Mode

string

The mode for handling LLM latency prompts. random: Plays a random message from the list. sequence: Plays messages in order. This is a required field.

Messages

array

A collection of prompt messages. A maximum of 10 messages are supported, each up to 100 characters. The sum of all probabilities must be 100%.

object

The structure of a prompt message.

Text

string

The text of the prompt message, up to 100 characters.

稍等一下

Probability

number

The probability of this message being selected. Range: 0–1, corresponding to 0%–100%.

0.5

BackChannelingConfigs

array

Configuration for back-channeling. When enabled, the system plays short, responsive phrases at specific trigger points.

object

A single back-channeling configuration.

Enabled

boolean

Specifies whether to enable this back-channeling rule. This is a required field.

true

TriggerStage

string

The trigger for the back-channeling. Valid value:

pause_detected: Triggered when a short pause in speech is detected.

pause_detected

Probability

number

The trigger probability. Range: 0.0–1.0. This is a required field.

0.5

Words

array

A collection of acknowledgment phrases. You can specify a maximum of 10 phrases. Each phrase must be 20 characters or less, and the sum of their probabilities must be 1.0.

object

Configuration for a responsive phrase.

Text

string

短语文本，长度 ≤ 20 字符，支持多语言。必填。

嗯嗯

Probability

number

本短语的触发概率，范围 0.0–1.0，必填。

0.3

BackChannelingConfig

array

Important 已废弃，请使用 BackChannelingConfigs

object

单个附和语配置

Enabled

boolean

是否启用附和功能。必填，取值 true/false。

true

TriggerStage

string

附和触发的时机。可选值：

pause_detected（检测到说话短暂停顿）

pause_detected

Probability

number

功能触发概率。范围 0.0–1.0。必填。

0.5

Words

array

附和短语集合。最大 10 条，每条短语长度 ≤ 20 字符，概率总和为 1.0。

object

附和短语配置

Text

string

短语文本，长度 ≤ 20 字符，支持多语言。必填。

嗯嗯

Probability

number

本短语的触发概率，范围 0.0–1.0，必填。

0.3