MiniMax Synchronous Speech Synthesis API Reference-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

Supported models

Model name	Synthesis price (per 10,000 characters)	Voice cloning price	Free quota (Note)
MiniMax/speech-2.8-hd	CNY 3.5	CNY 9.9 (Charged the first time a cloned voice is used for speech synthesis.)	None
MiniMax/speech-02-hd	CNY 3.5
MiniMax/speech-2.8-turbo	CNY 2
MiniMax/speech-02-turbo	CNY 2

URL

China (Beijing)

HTTP request URL: POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Base URL for SDK calls: https://dashscope.aliyuncs.com/api/v1

Important

Model Studio has released a workspace-specific domain for the Singapore region: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from https://dashscope-intl.aliyuncs.com to the new domain.

{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.

Headers

Parameter	Type	Required	Description
Authorization	string	Yes	The authentication token. The format is `Bearer <your_api_key>`. Replace `<your_api_key>` with your actual API key.
Content-Type	string	Yes	The media type of the request body. This must be set to `application/json`.
X-DashScope-SSE	string	No	Enables Server-Sent Events (SSE) for streaming output. Set to `enable` to receive results incrementally as a stream, which is ideal for real-time applications. If this header is omitted, the API responds synchronously.

Request body

Non-streaming

curl -X POST "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "MiniMax/speech-2.8-hd",
  "input": {
    "text": "Are you happy today? (laughs), Of course!",
    "voice_setting": {
      "voice_id": "male-qn-qingse",
      "speed": 1,
      "vol": 1,
      "pitch": 0,
      "emotion": "happy"
    },
    "audio_setting": {
      "sample_rate": 32000,
      "bitrate": 128000,
      "format": "mp3",
      "channel": 1
    },
    "pronunciation_dict": {
      "tone": [
        "处理/(chu3)(li3)",
        "危险/dangerous"
      ]
    },
    "subtitle_enable": false
  }
}'

Streaming

curl -X POST "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
  "model": "MiniMax/speech-2.8-hd",
  "input": {
    "text": "Are you happy today? (laughs), Of course!",
    "voice_setting": {
      "voice_id": "male-qn-qingse",
      "speed": 1,
      "vol": 1,
      "pitch": 0,
      "emotion": "happy"
    },
    "audio_setting": {
      "sample_rate": 32000,
      "bitrate": 128000,
      "format": "mp3",
      "channel": 1
    },
    "pronunciation_dict": {
      "tone": [
        "处理/(chu3)(li3)",
        "危险/dangerous"
      ]
    },
    "subtitle_enable": false
  }
}'

model string (Required)

The name of the model to use. Supported models:

MiniMax/speech-2.8-hd

MiniMax/speech-02-hd

MiniMax/speech-2.8-turbo

MiniMax/speech-02-turbo

input object (Required)

Properties

text string (Required)

The text to synthesize into speech.

The text must be less than 10,000 characters long. For text longer than 3,000 characters, streaming output is recommended.

stream_options object (Optional)

Streaming output configuration. Effective only when the request header X-DashScope-SSE: enable is set.

Properties

exclude_aggregated_audio boolean (Optional) Default: false

Controls whether the audio field in the final synthesis-complete frame returns the full concatenated audio of the entire synthesis (i.e., the concatenation of all in-progress chunks as a single hex string).

Valid values:

false: The audio field of the final frame returns the full concatenated audio. The client can directly consume the complete audio at stream end without having to concatenate the preceding chunks itself.
true: The audio field of the final frame is an empty string. The client must concatenate the audio from all preceding in-progress chunks in order to obtain the complete audio. This significantly reduces the size and transfer time of the final frame, and is suitable for clients that already play chunks as they arrive or accumulate audio on their own.

Note

This parameter takes effect only in streaming output mode; it has no effect in non-streaming mode.

voice_setting object (Required)

Properties

voice_id string (Required)

Specifies the voice ID.

To blend multiple voices, configure the timbre_weights parameter and leave this parameter empty.

speed float (Optional) Default: 1.0

Specifies the speech rate.

Value range: [0.5, 2.0].

vol float (Optional) Default: 1.0

Specifies the volume.

Value range: (0.0, 10.0].

pitch integer (Optional) Default: 0

Specifies the pitch.

Value range: [-12, 12].

emotion string (Optional) No default value

Specifies the emotion. The model automatically detects the appropriate emotion from the input text, so you typically do not need to set this parameter manually.

Valid values:

happy: Happy
sad: Sad
angry: Angry
fearful: Fearful
disgusted: Disgusted
surprised: Surprised
calm: Calm
whisper: Whisper

Note

The speech-2.8-hd and speech-2.8-turbo models do not support whisper.

text_normalization boolean (Optional) Default: false

Specifies whether to enable text normalization for Chinese and English. Enabling this improves performance in scenarios that involve reading numbers, but it might slightly increase latency.

Valid values:

true: Enable
false: Disable

latex_read boolean (Optional) Default: false

Specifies whether to enable reading of LaTeX formulas.

Valid values:

true: Enable
false: Disable

Example:

$x = \frac{- b \pm b ^{2} - 4 a c}{2 a}$

The preceding formula should be represented as: $$x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$$.

Note

This feature is supported only for Chinese. When enabled, the language_boost parameter is automatically set to Chinese.
Formulas in the request must be enclosed in $$.
If a formula in the request contains a backslash ("\"), it must be escaped as "\\".

audio_setting object (Optional)

Properties

sample_rate integer (Optional) Default: 32000

Specifies the sample rate (in Hz) of the generated audio.

Valid values:

8000
16000
22050
24000
32000
44100

bitrate integer (Optional) Default: 128000

Specifies the bitrate (in kbps) of the generated audio.

Valid values:

32000
64000
128000
256000

Note

This parameter applies only when the format parameter is set to mp3. For other formats, this parameter is ignored.

format string (Optional) Default: mp3

Specifies the format of the generated audio.

Valid values:

mp3
pcm
flac
wav

channel integer (Optional) Default: 1

Specifies the number of audio channels for the generated audio.

Valid values:

1: Mono
2: Stereo

force_cbr boolean (Optional) Default: false

Specifies whether to encode the audio at a constant bitrate (CBR).

Valid values:

true: Yes
false: No

Note

This parameter is effective only for streaming output and when the audio format is mp3.

pronunciation_dict object (Optional)

Properties

tone string[] (Optional)

Defines custom pronunciation or replacement rules for specific words or symbols.

Use a forward slash (/) as a separator.

For Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second, 3 for the third, 4 for the fourth, and 5 for the neutral tone.

Example:

["Yan Shaofei/(yan4)(shao3)(fei1)", "omg/oh my god"]

timbre_weights object[] (Optional)

Specifies the weights for blending multiple voices. You can blend up to four voices.

Properties

voice_id string (Required)

Specifies the voice ID.

weight integer (Required)

Specifies the weight of the voice. This must be set along with voice_id. A higher weight makes the synthesized voice sound more similar to the corresponding voice.

Value range: [1, 100].

language_boost string (Optional) Default: null

Specifies whether to improve synthesis quality for specific low-resource languages and dialects. Set to auto to let the model decide automatically.

Valid values (click to view):

Chinese
Chinese,Yue
English
Arabic
Russian
Spanish
French
Portuguese
German
Turkish
Dutch
Ukrainian
Vietnamese
Indonesian
Japanese
Italian
Korean
Thai
Polish
Romanian
Greek
Czech
Finnish
Hindi
Bulgarian
Danish
Hebrew
Malay
Persian
Slovak
Swedish
Croatian
Filipino
Hungarian
Norwegian
Slovenian
Catalan
Nynorsk
Tamil
Afrikaans
auto

voice_modify object (Optional)

Modifies audio characteristics or applies sound effects. This parameter is supported for the following audio formats:

Non-streaming: mp3, wav, flac
Streaming: mp3

Properties

pitch integer (Optional) No default value

Sets the pitch (deep/bright).

Lower values make the voice deeper; higher values make it brighter.

Value range: [-100, 100].

intensity integer (Optional) No default value

Sets the intensity (powerful/soft).

Lower values make the voice more powerful; higher values make it softer.

Value range: [-100, 100].

timbre integer (Optional) No default value

Sets the timbre brightness (richer/crisp).

Lower values make the voice richer; higher values make it crisper.

Value range: [-100, 100].

sound_effects string (Optional) No default value

Sets the sound effect.

Valid values:

spacious_echo: Spacious Echo
auditorium_echo: Auditorium Echo
lofi_telephone: Lo-Fi Telephone
robotic: Robotic

subtitle_enable boolean (Optional) Default: false

Specifies whether to enable subtitle generation.

Valid values:

true: Yes
false: No

Note

This parameter is valid only in non-streaming output scenarios and only for the speech-2.8-hd, speech-2.8-turbo, speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, and speech-01-turbo models.

output_format string (Optional) Default: hex

Specifies the format of the output result.

Valid values:

url: The synthesized audio is returned as a URL, which is valid for 24 hours.
hex: The synthesized audio is returned in hexadecimal-encoded binary format.

Note

This parameter applies only to non-streaming mode. Streaming mode always returns results in hex format.

aigc_watermark boolean (Optional) Default: false

Specifies whether to add an inaudible AIGC watermark to the end of the synthesized audio.

Valid values:

true: Yes
false: No

Note

This parameter applies only to non-streaming output.

Response body	Non-streaming `{ "output": { "base_resp": { "status_code": 0, "status_msg": "success" }, "data": { "audio": "<hex-encoded-audio>", "status": 2 }, "extra_info": { "audio_channel": 1, "audio_format": "mp3", "audio_length": 3528, "audio_sample_rate": 16000, "audio_size": 58164, "bitrate": 128000, "invisible_character_ratio": 0, "usage_characters": 26, "word_count": 14 }, "trace_id": "05fdef92e4c1b32283e3d1c456971a80" }, "usage": { "characters": 26 }, "request_id": "233b9516-1038-9697-b458-87e95a1f8108" }` Streaming `{ "output": { "base_resp": { "status_code": 0, "status_msg": "success" }, "data": { "audio": "<hex-encoded-audio>", "status": 2 }, "extra_info": { "audio_channel": 1, "audio_format": "mp3", "audio_length": 3528, "audio_sample_rate": 16000, "audio_size": 58164, "bitrate": 128000, "invisible_character_ratio": 0, "usage_characters": 26, "word_count": 14 }, "trace_id": "05fdef92e4c1b32283e3d1c456971a80" }, "usage": { "characters": 26 }, "request_id": "233b9516-1038-9697-b458-87e95a1f8108" }`
request_id `string` A unique identifier for this API call.
output `object` The data returned by the model. Properties base_resp `object` The status code and details of this request. Properties status_code `integer` The status code. `0`: The request was successful. `1000`: An unknown error occurred. `1001`: The request timed out. `1002`: The request was throttled. `1004`: Authentication failed. `1039`: TPM limit reached. `1042`: The percentage of invalid characters exceeded 10%. `2013`: The input parameters are invalid. For more information, see the Error Code List. status_msg `string` Details about the status. data `object` The synthesized data object. This value can be null, so you should perform a null check. Properties audio `string` The synthesized audio data. Its content depends on the `output_format` request parameter, containing a URL for `url` or hexadecimal-encoded data for `hex`. status `integer` The current status of the audio stream: `1` indicates that synthesis is in progress, and `2` indicates that synthesis is complete. extra_info `object` Properties audio_length`integer` The audio duration in milliseconds. audio_sample_rate`integer` The audio sample rate. audio_size`integer` The audio file size in bytes. bitrate`integer` The audio bitrate. audio_format `string` The format of the generated audio file. Valid values: `[mp3, pcm, flac]`. Available options: `mp3`, `pcm`, and `flac`. audio_channel `integer` The number of audio channels in the generated audio. A value of 1 indicates mono, and 2 indicates stereo. invisible_character_ratio `float` The percentage of invalid characters. If the percentage is 10% or less, the audio is generated successfully and this value is returned. If it exceeds 10%, the request fails. usage_characters`integer` The number of billable characters. word_count`integer` The number of spoken words, including Chinese characters, numbers, and letters, but excluding punctuation marks. trace_id `string` Include this ID in feedback or support requests to aid troubleshooting. base_resp `object` atudio `string` The synthesized audio data is hex-encoded, and its format is consistent with the output format specified by the `output_format` parameter in the request.
usage `object` The character usage for this request. Properties characters `integer` The number of characters in the input text.

Supported models

URL

China (Beijing)

Headers

Request body

Non-streaming

Streaming

Response body

Non-streaming

Streaming