Access TTS models-Intelligent Media Services(IMS)-阿里云帮助中心

How it works

Configure the TTS node in the console with your HTTPS URL, token, and sample rate.
Start the real-time workflow. For each TTS request, the workflow sends a POST with the text, voice, sample rate, and session metadata to your endpoint.
Your TTS server generates audio and streams it back as an HTTP response.
The workflow forwards audio chunks to downstream nodes in real time.

Prerequisites

Before you begin:

Your TTS model is accessible over the Internet via HTTPS.
Your HTTP server supports streaming responses.
A workflow template with a TTS node is created.

Configure the TTS node

Configure these parameters in the TTS node:

Parameter	Type	Required	Description	Example
Request URL	String	Yes	HTTPS URL of your TTS model endpoint.	`https://www.abc.com`
Token	String	No	Authorization token sent with each request.	`AUJH-pfnTNMPBm6iWXcJAcWsrscb5KYaLitQhHBLKrI`
Sample rate	Integer	Yes	Audio sample rate in Hz. Valid values: `8000`, `16000`, `24000`, `48000`.	`48000`

Note

Only mono S16LE (Signed 16-bit Little-Endian) audio is supported. Resample your output to S16LE if your model uses a different format.

Request parameters

At runtime, the workflow sends a POST request to your endpoint with the following JSON body:

Parameter	Type	Required	Description	Example
Text	String	Yes	Text to synthesize into speech.	`Hello`
VoiceId	String	No	Voice identifier for the TTS model.	`<your-voice-id>`
SampleRate	Integer	Yes	Sample rate in Hz. Matches the value configured in the console.	`48000`
Token	String	No	Authorization token. Matches the value configured in the console.	`AUJH-pfnTNMPBm6iWXcJAcWsrscb5KYaLitQhHBLKrI`
ExtendData	String	Yes	JSON string with session metadata and custom business data. ExtendData fields.	See below

ExtendData fields

ExtendData contains the following fields:

Field	Type	Required	Description	Example
InstanceId	String	Yes	ID of the intelligent agent instance.	`68e00b6640e*****3e943332fee7`
ChannelId	String	Yes	ID of the communication channel.	`123`
SentenceId	Int	Yes	Q&A session ID. All responses to a single user inquiry share the same `SentenceId`.	`3`
Emotion	String	No	Emotion for the synthesized speech. Valid values: `neutral`, `happy`, `sad`. If omitted, no emotion is applied.	`happy`
UserData	String	No	Custom business data passed at intelligent agent instance startup.	`{"aaaa":"bbbb"}`

Example ExtendData value:

{
  "InstanceId": "68e00b6640e*****3e943332fee7",
  "ChannelId": "123",
  "SentenceId": "3",
  "Emotion": "happy",
  "UserData": "{\"aaaa\":\"bbbb\"}"
}

Response requirements

Your TTS server must return audio matching the requested tone and sample rate as an HTTP streaming response. Lower streaming latency directly improves end-to-end performance.

Sample TTS server

Python

Uses aiohttp to implement a streaming TTS server on the /stream-audio endpoint.

from aiohttp import web


async def stream_audio(request):
    data = await request.json()
    text = data.get('Text', "")
    token = data.get('Token', None)
    sample_rate = data.get('SampleRate', 48000)
    extend_data = data.get('ExtendData', "")
    print(f"text:{text}, token:{token}, sample_rate:{sample_rate}, extend_data:{extend_data}")
    # Validate the token here.

    response = web.StreamResponse(
        status=200,
        reason='OK',
        headers={'Content-Type': 'audio/mpeg'}
    )

    # Start the streaming response.
    await response.prepare(request)

    # generate_tts_data is a coroutine that yields audio chunks.
    async for chunk in generate_tts_data(text, sample_rate):
        await response.write(chunk)

    # Signal the end of the response.
    await response.write_eof()

    return response


async def generate_tts_data(text: str, sample_rate: int):
    # Replace this with your TTS model inference logic.
    # This example reads audio data from a local PCM file.
    file_path = '/your_dir/sample.pcm'
    with open(file_path, 'rb') as f:
        while True:
            chunk = f.read(4096)  # Read 4 KB per chunk.
            if not chunk:
                break
            yield chunk

app = web.Application()
app.add_routes([web.post('/stream-audio', stream_audio)])

if __name__ == '__main__':
    web.run_app(app)

In production, replace generate_tts_data with your model's inference logic and update /your_dir/sample.pcm to your actual audio path.

References

Create and manage a workflow template