Integrate a self-developed TTS model into a real-time workflow by deploying an HTTPS streaming endpoint that accepts text and returns audio.
How it works
-
Configure the TTS node in the console with your HTTPS URL, token, and sample rate.
-
Start the real-time workflow. For each TTS request, the workflow sends a POST with the text, voice, sample rate, and session metadata to your endpoint.
-
Your TTS server generates audio and streams it back as an HTTP response.
-
The workflow forwards audio chunks to downstream nodes in real time.
Prerequisites
Before you begin:
-
Your TTS model is accessible over the Internet via HTTPS.
-
Your HTTP server supports streaming responses.
-
A workflow template with a TTS node is created.
Configure the TTS node
Configure these parameters in the TTS node:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
| Request URL | String | Yes | HTTPS URL of your TTS model endpoint. | https://www.abc.com |
| Token | String | No | Authorization token sent with each request. | AUJH-pfnTNMPBm6iWXcJAcWsrscb5KYaLitQhHBLKrI |
| Sample rate | Integer | Yes | Audio sample rate in Hz. Valid values: 8000, 16000, 24000, 48000. |
48000 |
Only mono S16LE (Signed 16-bit Little-Endian) audio is supported. Resample your output to S16LE if your model uses a different format.
Request parameters
At runtime, the workflow sends a POST request to your endpoint with the following JSON body:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
| Text | String | Yes | Text to synthesize into speech. | Hello |
| VoiceId | String | No | Voice identifier for the TTS model. | <your-voice-id> |
| SampleRate | Integer | Yes | Sample rate in Hz. Matches the value configured in the console. | 48000 |
| Token | String | No | Authorization token. Matches the value configured in the console. | AUJH-pfnTNMPBm6iWXcJAcWsrscb5KYaLitQhHBLKrI |
| ExtendData | String | Yes | JSON string with session metadata and custom business data. ExtendData fields. | See below |
ExtendData fields
ExtendData contains the following fields:
| Field | Type | Required | Description | Example |
|---|---|---|---|---|
| InstanceId | String | Yes | ID of the intelligent agent instance. | 68e00b6640e*****3e943332fee7 |
| ChannelId | String | Yes | ID of the communication channel. | 123 |
| SentenceId | Int | Yes | Q&A session ID. All responses to a single user inquiry share the same SentenceId. |
3 |
| Emotion | String | No | Emotion for the synthesized speech. Valid values: neutral, happy, sad. If omitted, no emotion is applied. |
happy |
| UserData | String | No | Custom business data passed at intelligent agent instance startup. | {"aaaa":"bbbb"} |
Example ExtendData value:
{
"InstanceId": "68e00b6640e*****3e943332fee7",
"ChannelId": "123",
"SentenceId": "3",
"Emotion": "happy",
"UserData": "{\"aaaa\":\"bbbb\"}"
}
Response requirements
Your TTS server must return audio matching the requested tone and sample rate as an HTTP streaming response. Lower streaming latency directly improves end-to-end performance.
Sample TTS server
Python
Uses aiohttp to implement a streaming TTS server on the /stream-audio endpoint.
from aiohttp import web
async def stream_audio(request):
data = await request.json()
text = data.get('Text', "")
token = data.get('Token', None)
sample_rate = data.get('SampleRate', 48000)
extend_data = data.get('ExtendData', "")
print(f"text:{text}, token:{token}, sample_rate:{sample_rate}, extend_data:{extend_data}")
# Validate the token here.
response = web.StreamResponse(
status=200,
reason='OK',
headers={'Content-Type': 'audio/mpeg'}
)
# Start the streaming response.
await response.prepare(request)
# generate_tts_data is a coroutine that yields audio chunks.
async for chunk in generate_tts_data(text, sample_rate):
await response.write(chunk)
# Signal the end of the response.
await response.write_eof()
return response
async def generate_tts_data(text: str, sample_rate: int):
# Replace this with your TTS model inference logic.
# This example reads audio data from a local PCM file.
file_path = '/your_dir/sample.pcm'
with open(file_path, 'rb') as f:
while True:
chunk = f.read(4096) # Read 4 KB per chunk.
if not chunk:
break
yield chunk
app = web.Application()
app.add_routes([web.post('/stream-audio', stream_audio)])
if __name__ == '__main__':
web.run_app(app)
In production, replace generate_tts_data with your model's inference logic and update /your_dir/sample.pcm to your actual audio path.