Terms-Intelligent Speech Interaction(ISI)-阿里云帮助中心

This page defines the key terms and concepts you need to understand before integrating Intelligent Speech Interaction (ISI) into your application.

Audio sample rate

The audio sample rate is the number of audio samples captured per second. A higher sample rate produces more natural, faithful audio reproduction.

ISI supports two sample rates:

Workload	Sample rate
Telephone	8 kHz
All others	16 kHz

If your audio is recorded at a rate higher than 16 kHz, convert it to 16 kHz before sending it to ISI. If your audio is at 8 kHz, do not convert it to 16 kHz — instead, configure your project to use an 8 kHz model.

Audio bit depth

Audio bit depth determines how many discrete amplitude values can represent a sound sample. More bits mean higher resolution and better sound quality.

ISI captures audio in 16-bit format by default. Each sample is stored as two 8-bit bytes, recording 16,000 samples per second at 2 bytes per sample:

Bit depth	Possible amplitude values
8-bit	256
16-bit	65,536

This 16-bit audio bit depth is applied to CDs.

Audio coding format

The audio coding format defines how audio data is encoded for storage and transmission. It is separate from the audio file format — for example, a WAV file can encode audio using PCM (pulse-code modulation) or AMR (Adaptive Multi-Rate), depending on what is specified in the file header.

Important

Before calling an ISI service, verify that the service supports your audio's coding format.

Sound channel

A sound channel carries audio from a single spatial source. The number of channels equals the number of independent audio sources recorded. Common audio is either mono (one channel) or stereo (two channels).

Note

Except for the recording file recognition service, all ISI services support mono audio only. Convert stereo or multi-channel audio to mono before sending it to ISI.

Inverse text normalization

Inverse text normalization (ITN) formats raw speech recognition output into readable text. ITN applies standard conventions for numbers, dates, currency, phone numbers, and similar patterns.

The following examples show how ITN transforms spoken input:

Spoken input	Recognition result with ITN
Twenty percent	20%
One thousand six hundred eighty yuan	CNY 1680
May the eleventh	May 11
Please dial one one zero.	Please dial 110.

Appkey

An appkey uniquely identifies a project in the ISI console. When calling an ISI service, provide the appkey for the project — the service uses it to retrieve that project's configuration.

ISI supports multiple business scenarios, such as customer service hotlines and mobile device inputs. Each scenario has different service capabilities. Configure your project to match the requirements of your specific scenario to get the best results.

AccessKey pair

An AccessKey pair is the identity credential your application uses to call Alibaba Cloud APIs. Create and view your AccessKey pair on the Security Management page.

An AccessKey pair consists of two components:

AccessKey ID: identifies your account
AccessKey secret: encrypts the signature on each API request, preventing tampering

Always use the AccessKey ID and AccessKey secret together. Treat the AccessKey secret like a password — keep it confidential.

Access token

An access token is a credential for calling ISI services. It has a validity period. Obtain one using your AccessKey ID and AccessKey secret.

For client-side applications (such as mobile apps), use the server-proxy pattern: generate the access token on your server, then pass it to the client. This prevents your AccessKey pair from being disclosed.

Intermediate result

By default, ISI returns a final recognition result after it finishes processing the entire utterance. Enable intermediate results to also receive partial results while the user is still speaking.

Intermediate results are useful in scenarios where you want to display real-time transcription as the user speaks — for example, live captions, real-time feedback for pronunciation, or customer service assist tools.

Parameter value	Behavior
`false` (default)	Returns the final result only
`true`	Returns intermediate results during speech, plus the final result

For example, if the final result is "Hello welcome to Alibaba Group", the server may return these intermediate results while the user speaks:

Hello
Hello welcome
Hello welcome to
Hello welcome to Alibaba
Hello welcome to Alibaba Group

Note

The server may revise a previous intermediate result when it returns the next one.
The number of new words added between consecutive intermediate results is not fixed.

task_id

Each API call to ISI is assigned a unique task ID, generated by the Alibaba Cloud SDK. If an error occurs, you can use the task ID for troubleshooting.