(Charged the first time a cloned voice is used for speech synthesis.)
None
MiniMax/speech-02-hd
CNY 3.5
MiniMax/speech-2.8-turbo
CNY 2
MiniMax/speech-02-turbo
CNY 2
URL
China (Beijing)
HTTP request URL: POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
Base URL for SDK calls: https://dashscope.aliyuncs.com/api/v1
Important
Model Studio has released a workspace-specific domain for the Singapore region: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from https://dashscope-intl.aliyuncs.com to the new domain.
{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.
Headers
Parameter
Type
Required
Description
Authorization
string
Yes
The authentication token. The format is Bearer <your_api_key>. Replace <your_api_key> with your actual API key.
Content-Type
string
Yes
The media type of the request body. This must be set to application/json.
X-DashScope-SSE
string
No
Enables Server-Sent Events (SSE) for streaming output. Set to enable to receive results incrementally as a stream, which is ideal for real-time applications. If this header is omitted, the API responds synchronously.
Controls whether the audio field in the final synthesis-complete frame returns the full concatenated audio of the entire synthesis (i.e., the concatenation of all in-progress chunks as a single hex string).
Valid values:
false: The audio field of the final frame returns the full concatenated audio. The client can directly consume the complete audio at stream end without having to concatenate the preceding chunks itself.
true: The audio field of the final frame is an empty string. The client must concatenate the audio from all preceding in-progress chunks in order to obtain the complete audio. This significantly reduces the size and transfer time of the final frame, and is suitable for clients that already play chunks as they arrive or accumulate audio on their own.
Note
This parameter takes effect only in streaming output mode; it has no effect in non-streaming mode.
voice_settingobject(Required)
Properties
voice_idstring(Required)
Specifies the voice ID.
To blend multiple voices, configure the timbre_weights parameter and leave this parameter empty.
speedfloat(Optional) Default: 1.0
Specifies the speech rate.
Value range: [0.5, 2.0].
volfloat(Optional) Default: 1.0
Specifies the volume.
Value range: (0.0, 10.0].
pitchinteger(Optional) Default: 0
Specifies the pitch.
Value range: [-12, 12].
emotionstring(Optional) No default value
Specifies the emotion. The model automatically detects the appropriate emotion from the input text, so you typically do not need to set this parameter manually.
Valid values:
happy: Happy
sad: Sad
angry: Angry
fearful: Fearful
disgusted: Disgusted
surprised: Surprised
calm: Calm
whisper: Whisper
Note
The speech-2.8-hd and speech-2.8-turbo models do not support whisper.
Specifies whether to enable text normalization for Chinese and English. Enabling this improves performance in scenarios that involve reading numbers, but it might slightly increase latency.
Valid values:
true: Enable
false: Disable
latex_readboolean(Optional) Default: false
Specifies whether to enable reading of LaTeX formulas.
Valid values:
true: Enable
false: Disable
Example:
x=2a−b±b2−4ac
The preceding formula should be represented as: $$x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$$.
Note
This feature is supported only for Chinese. When enabled, the language_boost parameter is automatically set to Chinese.
Formulas in the request must be enclosed in $$.
If a formula in the request contains a backslash ("\"), it must be escaped as "\\".
audio_settingobject(Optional)
Properties
sample_rateinteger(Optional) Default: 32000
Specifies the sample rate (in Hz) of the generated audio.
Valid values:
8000
16000
22050
24000
32000
44100
bitrateinteger(Optional) Default: 128000
Specifies the bitrate (in kbps) of the generated audio.
Valid values:
32000
64000
128000
256000
Note
This parameter applies only when the format parameter is set to mp3. For other formats, this parameter is ignored.
formatstring(Optional) Default: mp3
Specifies the format of the generated audio.
Valid values:
mp3
pcm
flac
wav
channelinteger(Optional) Default: 1
Specifies the number of audio channels for the generated audio.
Valid values:
1: Mono
2: Stereo
force_cbrboolean(Optional) Default: false
Specifies whether to encode the audio at a constant bitrate (CBR).
Valid values:
true: Yes
false: No
Note
This parameter is effective only for streaming output and when the audio format is mp3.
pronunciation_dictobject(Optional)
Properties
tonestring[](Optional)
Defines custom pronunciation or replacement rules for specific words or symbols.
Use a forward slash (/) as a separator.
For Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second, 3 for the third, 4 for the fourth, and 5 for the neutral tone.
Example:
["Yan Shaofei/(yan4)(shao3)(fei1)", "omg/oh my god"]
timbre_weightsobject[](Optional)
Specifies the weights for blending multiple voices. You can blend up to four voices.
Properties
voice_idstring(Required)
Specifies the voice ID.
weightinteger(Required)
Specifies the weight of the voice. This must be set along with voice_id. A higher weight makes the synthesized voice sound more similar to the corresponding voice.
Value range: [1, 100].
language_booststring(Optional) Default: null
Specifies whether to improve synthesis quality for specific low-resource languages and dialects. Set to auto to let the model decide automatically.
Valid values (click to view):
Chinese
Chinese,Yue
English
Arabic
Russian
Spanish
French
Portuguese
German
Turkish
Dutch
Ukrainian
Vietnamese
Indonesian
Japanese
Italian
Korean
Thai
Polish
Romanian
Greek
Czech
Finnish
Hindi
Bulgarian
Danish
Hebrew
Malay
Persian
Slovak
Swedish
Croatian
Filipino
Hungarian
Norwegian
Slovenian
Catalan
Nynorsk
Tamil
Afrikaans
auto
voice_modifyobject(Optional)
Modifies audio characteristics or applies sound effects. This parameter is supported for the following audio formats:
Non-streaming: mp3, wav, flac
Streaming: mp3
Properties
pitchinteger(Optional) No default value
Sets the pitch (deep/bright).
Lower values make the voice deeper; higher values make it brighter.
Value range: [-100, 100].
intensityinteger(Optional) No default value
Sets the intensity (powerful/soft).
Lower values make the voice more powerful; higher values make it softer.
Value range: [-100, 100].
timbreinteger(Optional) No default value
Sets the timbre brightness (richer/crisp).
Lower values make the voice richer; higher values make it crisper.
Value range: [-100, 100].
sound_effectsstring(Optional) No default value
Sets the sound effect.
Valid values:
spacious_echo: Spacious Echo
auditorium_echo: Auditorium Echo
lofi_telephone: Lo-Fi Telephone
robotic: Robotic
subtitle_enableboolean(Optional) Default: false
Specifies whether to enable subtitle generation.
Valid values:
true: Yes
false: No
Note
This parameter is valid only in non-streaming output scenarios and only for the speech-2.8-hd, speech-2.8-turbo, speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, and speech-01-turbo models.
output_formatstring(Optional) Default: hex
Specifies the format of the output result.
Valid values:
url: The synthesized audio is returned as a URL, which is valid for 24 hours.
hex: The synthesized audio is returned in hexadecimal-encoded binary format.
Note
This parameter applies only to non-streaming mode. Streaming mode always returns results in hex format.
aigc_watermarkboolean(Optional) Default: false
Specifies whether to add an inaudible AIGC watermark to the end of the synthesized audio.
Valid values:
true: Yes
false: No
Note
This parameter applies only to non-streaming output.
The percentage of invalid characters. If the percentage is 10% or less, the audio is generated successfully and this value is returned. If it exceeds 10%, the request fails.
The number of spoken words, including Chinese characters, numbers, and letters, but excluding punctuation marks.
trace_id string
Include this ID in feedback or support requests to aid troubleshooting.
base_resp object
atudio string
The synthesized audio data is hex-encoded, and its format is consistent with the output format specified by the output_format parameter in the request.