MiniMax Synchronous Speech Synthesis API Reference

更新时间:
复制 MD 格式

Supported models

Model name

Synthesis price (per 10,000 characters)

Voice cloning price

Free quota (Note)

MiniMax/speech-2.8-hd

CNY 3.5

CNY 9.9

(Charged the first time a cloned voice is used for speech synthesis.)

None

MiniMax/speech-02-hd

CNY 3.5

MiniMax/speech-2.8-turbo

CNY 2

MiniMax/speech-02-turbo

CNY 2

URL

China (Beijing)

HTTP request URL: POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation

Base URL for SDK calls: https://dashscope.aliyuncs.com/api/v1

Important

Model Studio has released a workspace-specific domain for the Singapore region: https://{WorkspaceId}.ap-southeast-1.maas.aliyuncs.com. The new dedicated domain delivers superior performance and higher stability for inference requests. We recommend migrating from https://dashscope-intl.aliyuncs.com to the new domain.

{WorkspaceId} is your workspace ID, which can be found on the Workspace Details page in the Model Studio console. The existing domain remains fully functional.

Headers

Parameter

Type

Required

Description

Authorization

string

Yes

The authentication token. The format is Bearer <your_api_key>. Replace <your_api_key> with your actual API key.

Content-Type

string

Yes

The media type of the request body. This must be set to application/json.

X-DashScope-SSE

string

No

Enables Server-Sent Events (SSE) for streaming output. Set to enable to receive results incrementally as a stream, which is ideal for real-time applications. If this header is omitted, the API responds synchronously.

Request body

Non-streaming

curl -X POST "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "MiniMax/speech-2.8-hd",
  "input": {
    "text": "Are you happy today? (laughs), Of course!",
    "voice_setting": {
      "voice_id": "male-qn-qingse",
      "speed": 1,
      "vol": 1,
      "pitch": 0,
      "emotion": "happy"
    },
    "audio_setting": {
      "sample_rate": 32000,
      "bitrate": 128000,
      "format": "mp3",
      "channel": 1
    },
    "pronunciation_dict": {
      "tone": [
        "处理/(chu3)(li3)",
        "危险/dangerous"
      ]
    },
    "subtitle_enable": false
  }
}'

Streaming

curl -X POST "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-SSE: enable" \
-d '{
  "model": "MiniMax/speech-2.8-hd",
  "input": {
    "text": "Are you happy today? (laughs), Of course!",
    "voice_setting": {
      "voice_id": "male-qn-qingse",
      "speed": 1,
      "vol": 1,
      "pitch": 0,
      "emotion": "happy"
    },
    "audio_setting": {
      "sample_rate": 32000,
      "bitrate": 128000,
      "format": "mp3",
      "channel": 1
    },
    "pronunciation_dict": {
      "tone": [
        "处理/(chu3)(li3)",
        "危险/dangerous"
      ]
    },
    "subtitle_enable": false
  }
}'

model string (Required)

The name of the model to use. Supported models:

MiniMax/speech-2.8-hd

MiniMax/speech-02-hd

MiniMax/speech-2.8-turbo

MiniMax/speech-02-turbo

input object (Required)

Properties

text string (Required)

The text to synthesize into speech.

The text must be less than 10,000 characters long. For text longer than 3,000 characters, streaming output is recommended.

stream_options object (Optional)

Streaming output configuration. Effective only when the request header X-DashScope-SSE: enable is set.

Properties

exclude_aggregated_audio boolean (Optional) Default: false

Controls whether the audio field in the final synthesis-complete frame returns the full concatenated audio of the entire synthesis (i.e., the concatenation of all in-progress chunks as a single hex string).

Valid values:

  • false: The audio field of the final frame returns the full concatenated audio. The client can directly consume the complete audio at stream end without having to concatenate the preceding chunks itself.

  • true: The audio field of the final frame is an empty string. The client must concatenate the audio from all preceding in-progress chunks in order to obtain the complete audio. This significantly reduces the size and transfer time of the final frame, and is suitable for clients that already play chunks as they arrive or accumulate audio on their own.

Note

This parameter takes effect only in streaming output mode; it has no effect in non-streaming mode.

voice_setting object (Required)

Properties

voice_id string (Required)

Specifies the voice ID.

To blend multiple voices, configure the timbre_weights parameter and leave this parameter empty.

speed float (Optional) Default: 1.0

Specifies the speech rate.

Value range: [0.5, 2.0].

vol float (Optional) Default: 1.0

Specifies the volume.

Value range: (0.0, 10.0].

pitch integer (Optional) Default: 0

Specifies the pitch.

Value range: [-12, 12].

emotion string (Optional) No default value

Specifies the emotion. The model automatically detects the appropriate emotion from the input text, so you typically do not need to set this parameter manually.

Valid values:

  • happy: Happy

  • sad: Sad

  • angry: Angry

  • fearful: Fearful

  • disgusted: Disgusted

  • surprised: Surprised

  • calm: Calm

  • whisper: Whisper

Note

The speech-2.8-hd and speech-2.8-turbo models do not support whisper.

text_normalization boolean (Optional) Default: false

Specifies whether to enable text normalization for Chinese and English. Enabling this improves performance in scenarios that involve reading numbers, but it might slightly increase latency.

Valid values:

  • true: Enable

  • false: Disable

latex_read boolean (Optional) Default: false

Specifies whether to enable reading of LaTeX formulas.

Valid values:

  • true: Enable

  • false: Disable

Example:

The preceding formula should be represented as: $$x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$$.

Note
  • This feature is supported only for Chinese. When enabled, the language_boost parameter is automatically set to Chinese.

  • Formulas in the request must be enclosed in $$.

  • If a formula in the request contains a backslash ("\"), it must be escaped as "\\".

audio_setting object (Optional)

Properties

sample_rate integer (Optional) Default: 32000

Specifies the sample rate (in Hz) of the generated audio.

Valid values:

  • 8000

  • 16000

  • 22050

  • 24000

  • 32000

  • 44100

bitrate integer (Optional) Default: 128000

Specifies the bitrate (in kbps) of the generated audio.

Valid values:

  • 32000

  • 64000

  • 128000

  • 256000

Note

This parameter applies only when the format parameter is set to mp3. For other formats, this parameter is ignored.

format string (Optional) Default: mp3

Specifies the format of the generated audio.

Valid values:

  • mp3

  • pcm

  • flac

  • wav

channel integer (Optional) Default: 1

Specifies the number of audio channels for the generated audio.

Valid values:

  • 1: Mono

  • 2: Stereo

force_cbr boolean (Optional) Default: false

Specifies whether to encode the audio at a constant bitrate (CBR).

Valid values:

  • true: Yes

  • false: No

Note

This parameter is effective only for streaming output and when the audio format is mp3.

pronunciation_dict object (Optional)

Properties

tone string[] (Optional)

Defines custom pronunciation or replacement rules for specific words or symbols.

Use a forward slash (/) as a separator.

For Chinese text, tones are represented by numbers: 1 for the first tone, 2 for the second, 3 for the third, 4 for the fourth, and 5 for the neutral tone.

Example:

["Yan Shaofei/(yan4)(shao3)(fei1)", "omg/oh my god"]

timbre_weights object[] (Optional)

Specifies the weights for blending multiple voices. You can blend up to four voices.

Properties

voice_id string (Required)

Specifies the voice ID.

weight integer (Required)

Specifies the weight of the voice. This must be set along with voice_id. A higher weight makes the synthesized voice sound more similar to the corresponding voice.

Value range: [1, 100].

language_boost string (Optional) Default: null

Specifies whether to improve synthesis quality for specific low-resource languages and dialects. Set to auto to let the model decide automatically.

Valid values (click to view):

  • Chinese

  • Chinese,Yue

  • English

  • Arabic

  • Russian

  • Spanish

  • French

  • Portuguese

  • German

  • Turkish

  • Dutch

  • Ukrainian

  • Vietnamese

  • Indonesian

  • Japanese

  • Italian

  • Korean

  • Thai

  • Polish

  • Romanian

  • Greek

  • Czech

  • Finnish

  • Hindi

  • Bulgarian

  • Danish

  • Hebrew

  • Malay

  • Persian

  • Slovak

  • Swedish

  • Croatian

  • Filipino

  • Hungarian

  • Norwegian

  • Slovenian

  • Catalan

  • Nynorsk

  • Tamil

  • Afrikaans

  • auto

voice_modify object (Optional)

Modifies audio characteristics or applies sound effects. This parameter is supported for the following audio formats:

  • Non-streaming: mp3, wav, flac

  • Streaming: mp3

Properties

pitch integer (Optional) No default value

Sets the pitch (deep/bright).

Lower values make the voice deeper; higher values make it brighter.

Value range: [-100, 100].

intensity integer (Optional) No default value

Sets the intensity (powerful/soft).

Lower values make the voice more powerful; higher values make it softer.

Value range: [-100, 100].

timbre integer (Optional) No default value

Sets the timbre brightness (richer/crisp).

Lower values make the voice richer; higher values make it crisper.

Value range: [-100, 100].

sound_effects string (Optional) No default value

Sets the sound effect.

Valid values:

  • spacious_echo: Spacious Echo

  • auditorium_echo: Auditorium Echo

  • lofi_telephone: Lo-Fi Telephone

  • robotic: Robotic

subtitle_enable boolean (Optional) Default: false

Specifies whether to enable subtitle generation.

Valid values:

  • true: Yes

  • false: No

Note

This parameter is valid only in non-streaming output scenarios and only for the speech-2.8-hd, speech-2.8-turbo, speech-2.6-hd, speech-2.6-turbo, speech-02-hd, speech-02-turbo, speech-01-hd, and speech-01-turbo models.

output_format string (Optional) Default: hex

Specifies the format of the output result.

Valid values:

  • url: The synthesized audio is returned as a URL, which is valid for 24 hours.

  • hex: The synthesized audio is returned in hexadecimal-encoded binary format.

Note

This parameter applies only to non-streaming mode. Streaming mode always returns results in hex format.

aigc_watermark boolean (Optional) Default: false

Specifies whether to add an inaudible AIGC watermark to the end of the synthesized audio.

Valid values:

  • true: Yes

  • false: No

Note

This parameter applies only to non-streaming output.

Response body

Non-streaming

{
    "output": {
        "base_resp": {
            "status_code": 0,
            "status_msg": "success"
        },
        "data": {
            "audio": "<hex-encoded-audio>",
            "status": 2
        },
        "extra_info": {
            "audio_channel": 1,
            "audio_format": "mp3",
            "audio_length": 3528,
            "audio_sample_rate": 16000,
            "audio_size": 58164,
            "bitrate": 128000,
            "invisible_character_ratio": 0,
            "usage_characters": 26,
            "word_count": 14
        },
        "trace_id": "05fdef92e4c1b32283e3d1c456971a80"
    },
    "usage": {
        "characters": 26
    },
    "request_id": "233b9516-1038-9697-b458-87e95a1f8108"
}

Streaming

{
    "output": {
        "base_resp": {
            "status_code": 0,
            "status_msg": "success"
        },
        "data": {
            "audio": "<hex-encoded-audio>",
            "status": 2
        },
        "extra_info": {
            "audio_channel": 1,
            "audio_format": "mp3",
            "audio_length": 3528,
            "audio_sample_rate": 16000,
            "audio_size": 58164,
            "bitrate": 128000,
            "invisible_character_ratio": 0,
            "usage_characters": 26,
            "word_count": 14
        },
        "trace_id": "05fdef92e4c1b32283e3d1c456971a80"
    },
    "usage": {
        "characters": 26
    },
    "request_id": "233b9516-1038-9697-b458-87e95a1f8108"
}

request_id string

A unique identifier for this API call.

output object

The data returned by the model.

Properties

base_resp object

The status code and details of this request.

Properties

status_code integer

The status code.

  • 0: The request was successful.

  • 1000: An unknown error occurred.

  • 1001: The request timed out.

  • 1002: The request was throttled.

  • 1004: Authentication failed.

  • 1039: TPM limit reached.

  • 1042: The percentage of invalid characters exceeded 10%.

  • 2013: The input parameters are invalid.

For more information, see the Error Code List.

status_msg string

Details about the status.

data object

The synthesized data object. This value can be null, so you should perform a null check.

Properties

audio string

The synthesized audio data. Its content depends on the output_format request parameter, containing a URL for url or hexadecimal-encoded data for hex.

status integer

The current status of the audio stream: 1 indicates that synthesis is in progress, and 2 indicates that synthesis is complete.

extra_info object

Properties

audio_lengthinteger

The audio duration in milliseconds.

audio_sample_rateinteger

The audio sample rate.

audio_sizeinteger

The audio file size in bytes.

bitrateinteger

The audio bitrate.

audio_format string

The format of the generated audio file. Valid values: [mp3, pcm, flac].

Available options: mp3, pcm, and flac.

audio_channel integer

The number of audio channels in the generated audio. A value of 1 indicates mono, and 2 indicates stereo.

invisible_character_ratio float

The percentage of invalid characters. If the percentage is 10% or less, the audio is generated successfully and this value is returned. If it exceeds 10%, the request fails.

usage_charactersinteger

The number of billable characters.

word_countinteger

The number of spoken words, including Chinese characters, numbers, and letters, but excluding punctuation marks.

trace_id string

Include this ID in feedback or support requests to aid troubleshooting.

base_resp object

atudio string

The synthesized audio data is hex-encoded, and its format is consistent with the output format specified by the output_format parameter in the request.

usage object

The character usage for this request.

Properties

characters integer

The number of characters in the input text.