This page defines the key terms and concepts you need to understand before integrating Intelligent Speech Interaction (ISI) into your application.
Audio sample rate
The audio sample rate is the number of audio samples captured per second. A higher sample rate produces more natural, faithful audio reproduction.
ISI supports two sample rates:
| Workload | Sample rate |
|---|---|
| Telephone | 8 kHz |
| All others | 16 kHz |
If your audio is recorded at a rate higher than 16 kHz, convert it to 16 kHz before sending it to ISI. If your audio is at 8 kHz, do not convert it to 16 kHz — instead, configure your project to use an 8 kHz model.
Audio bit depth
Audio bit depth determines how many discrete amplitude values can represent a sound sample. More bits mean higher resolution and better sound quality.
ISI captures audio in 16-bit format by default. Each sample is stored as two 8-bit bytes, recording 16,000 samples per second at 2 bytes per sample:
| Bit depth | Possible amplitude values |
|---|---|
| 8-bit | 256 |
| 16-bit | 65,536 |
This 16-bit audio bit depth is applied to CDs.
Audio coding format
The audio coding format defines how audio data is encoded for storage and transmission. It is separate from the audio file format — for example, a WAV file can encode audio using PCM (pulse-code modulation) or AMR (Adaptive Multi-Rate), depending on what is specified in the file header.
Before calling an ISI service, verify that the service supports your audio's coding format.
Sound channel
A sound channel carries audio from a single spatial source. The number of channels equals the number of independent audio sources recorded. Common audio is either mono (one channel) or stereo (two channels).
Except for the recording file recognition service, all ISI services support mono audio only. Convert stereo or multi-channel audio to mono before sending it to ISI.
Inverse text normalization
Inverse text normalization (ITN) formats raw speech recognition output into readable text. ITN applies standard conventions for numbers, dates, currency, phone numbers, and similar patterns.
The following examples show how ITN transforms spoken input:
| Spoken input | Recognition result with ITN |
|---|---|
| Twenty percent | 20% |
One thousand six hundred eighty yuan | CNY 1680 |
| May the eleventh | May 11 |
| Please dial one one zero. | Please dial 110. |
Appkey
An appkey uniquely identifies a project in the ISI console. When calling an ISI service, provide the appkey for the project — the service uses it to retrieve that project's configuration.
ISI supports multiple business scenarios, such as customer service hotlines and mobile device inputs. Each scenario has different service capabilities. Configure your project to match the requirements of your specific scenario to get the best results.
AccessKey pair
An AccessKey pair is the identity credential your application uses to call Alibaba Cloud APIs. Create and view your AccessKey pair on the Security Management page.
An AccessKey pair consists of two components:
AccessKey ID: identifies your account
AccessKey secret: encrypts the signature on each API request, preventing tampering
Always use the AccessKey ID and AccessKey secret together. Treat the AccessKey secret like a password — keep it confidential.
Access token
An access token is a credential for calling ISI services. It has a validity period. Obtain one using your AccessKey ID and AccessKey secret.
For client-side applications (such as mobile apps), use the server-proxy pattern: generate the access token on your server, then pass it to the client. This prevents your AccessKey pair from being disclosed.
Intermediate result
By default, ISI returns a final recognition result after it finishes processing the entire utterance. Enable intermediate results to also receive partial results while the user is still speaking.
Intermediate results are useful in scenarios where you want to display real-time transcription as the user speaks — for example, live captions, real-time feedback for pronunciation, or customer service assist tools.
| Parameter value | Behavior |
|---|---|
false (default) | Returns the final result only |
true | Returns intermediate results during speech, plus the final result |
For example, if the final result is "Hello welcome to Alibaba Group", the server may return these intermediate results while the user speaks:
Hello
Hello welcome
Hello welcome to
Hello welcome to Alibaba
Hello welcome to Alibaba GroupThe server may revise a previous intermediate result when it returns the next one.
The number of new words added between consecutive intermediate results is not fixed.
task_id
Each API call to ISI is assigned a unique task ID, generated by the Alibaba Cloud SDK. If an error occurs, you can use the task ID for troubleshooting.