The recording file recognition service transcribes audio files submitted as HTTP or HTTPS URLs. It supports two retrieval methods—polling and callback—and processes files asynchronously rather than in real time.
Supported features
Single-track WAV and MP3 audio files
Two retrieval methods: polling and callback
Custom linguistic models and hotword vocabularies
Multiple languages: Chinese Mandarin, Chinese dialects, and English
Limitations
| Constraint | Details |
|---|---|
| URL format | Must be publicly accessible via domain name. IP addresses and spaces are not allowed. |
| File size | 512 MB maximum |
| Processing time (free trial) | Recognition completes within 24 hours |
| Processing time (Commercial Edition) | Recognition completes within 3 hours |
| Result retention | 72 hours |
| Daily quota (free trial) | Up to 2 hours of audio per calendar day |
Valid URL example
https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wavInvalid URL examples
http://127.0.0.1/sample.wav
D:\files\sample.wavThe 24-hour and 3-hour processing limits do not apply if files uploaded within 30 minutes exceed 500 hours in total length. For large-scale audio processing, contact Alibaba Cloud pre-sales.
How it works
The recording file recognition service uses the Alibaba Cloud POP API (remote procedure call style). The client sends requests over HTTP, and recording files must be accessible via a public URL.
Two operations are available:
Submit a recognition task (POST): Send the recording file URL and configuration parameters. The server returns a task ID.
Query the recognition result (GET): Use the task ID to poll the result, or receive it via callback if you enable the callback method.
The query operation supports up to 100 queries per second (QPS). If you exceed this limit, the server returns Throttling.User : Request was denied due to user flow control. Set a polling interval to stay within the limit.Choose a retrieval method
| Method | How it works | When to use |
|---|---|---|
| Polling | Submit a task, then periodically query the result using the task ID. | Default. Works in all environments. |
| Callback | Submit a task with a callback URL. The server POSTs the result to that URL when processing is complete. | Use when you can expose a publicly reachable HTTP or HTTPS endpoint. |
Prerequisites
Before submitting a recognition task:
Check the format and sampling rate of your audio file. Select an appropriate scenario and model in the Intelligent Speech Interaction console based on your use case.
Store the audio file in Alibaba Cloud Object Storage Service (OSS) or on a publicly accessible file server.
Public OSS file: Get the OSS URL directly. See Public read object.
Private OSS file: Generate a presigned URL with a validity period using the SDK. See Private object.
Custom file server: Make sure the
Content-Lengthfield in the HTTP response header matches the actual file size, or the download will fail.
Submit a recognition task
Method: POST
Set the request parameters as a JSON string in the request body.
Request body example
{
"appkey": "your-appkey",
"file_link": "https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav",
"version": "4.0",
"auto_split": false,
"enable_words": false,
"enable_sample_rate_adaptive": true,
"valid_times": [
{
"begin_time": 200,
"end_time": 2000,
"channel_id": 0
}
]
}Request parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
appkey | String | Yes | The appkey of your project in the Intelligent Speech Interaction console. |
file_link | String | Yes | The URL of the recording file. Make sure the project's scenario and model match the recording file. |
version | String | Yes | The service version. Default value: 2.0. Set to 4.0 for new integrations. |
enable_words | Boolean | No | Specifies whether to return word-level recognition results. Default value: false. Requires version: 4.0. |
enable_sample_rate_adaptive | Boolean | No | Specifies whether to automatically downsample audio with a sampling rate above 16,000 Hz. Default value: false. Requires version: 4.0. |
enable_callback | Boolean | No | Specifies whether to use the callback method. Default value: false. |
callback_url | String | No | The callback URL. Required when enable_callback is true. Must be an HTTP or HTTPS URL with a domain name (not an IP address). |
auto_split | Boolean | No | Specifies whether to enable automatic track splitting. When enabled, the server identifies the speaker of each sentence using the ChannelId field. Supports mono audio at 8,000 Hz only. Cannot be set to true when enable_unify_post is true. |
enable_unify_post | Boolean | No | Specifies whether to enable post-processing. Default value: false. Cannot be set to true when auto_split is true. |
enable_inverse_text_normalization | Boolean | No | Specifies whether to enable inverse text normalization (ITN), which converts Chinese numerals to Arabic numerals. Default value: false. Requires version: 4.0 and enable_unify_post: true. ITN does not apply to word-level results. |
enable_disfluency | Boolean | No | Specifies whether to enable disfluency detection. Default value: false. Requires version: 4.0 and enable_unify_post: true. |
valid_times | List\<ValidTime\> | No | The time ranges within the audio track that require speech recognition. |
max_end_silence | Integer | No | The maximum end-of-sentence silence duration. Default value: 450. Unit: milliseconds. |
max_single_segment_time | Integer | No | The maximum duration of a single sentence. Minimum value: 10000. Default value: 20000. Unit: milliseconds. |
customization_id | String | No | The ID of the custom linguistic model created via the POP API. |
class_vocabulary_id | String | No | The ID of the categorized hotword vocabulary. |
vocabulary_id | String | No | The ID of the extensive hotword vocabulary. |
ValidTime object
| Parameter | Type | Required | Description |
|---|---|---|---|
begin_time | Int | Yes | The start time offset of the time range. Unit: milliseconds. |
end_time | Int | Yes | The end time offset of the time range. Unit: milliseconds. |
channel_id | Int | Yes | The audio track to which the time range applies. Values start from 0. |
Response parameters
An HTTP 200 status code indicates the request was accepted.
| Parameter | Type | Description |
|---|---|---|
TaskId | String | The task ID. Use this to query the recognition result. |
RequestId | String | The request ID, for debugging. |
StatusCode | Int | The status code. |
StatusText | String | The status message. |
Response example
{
"TaskId": "4b56f0c4b7e611e88f34c33c2a60****",
"RequestId": "E4B183CC-6CFE-411E-A547-D877F7BD****",
"StatusText": "SUCCESS",
"StatusCode": 21050000
}Query the recognition result
Method: GET
Pass the task ID returned by the POST operation as a request parameter. Poll at a reasonable interval to stay within the 100 QPS limit.
Request parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
TaskId | String | Yes | The task ID returned by the submit operation. |
Response parameters
An HTTP 200 status code indicates the query request was received.
| Parameter | Type | Description |
|---|---|---|
TaskId | String | The task ID. |
StatusCode | Int | The status code. |
StatusText | String | The status message. |
RequestId | String | The request ID, for debugging. |
Result | Object | The recognition result. Present only when StatusText is SUCCESS. |
Sentences | List\<SentenceResult\> | The sentence-level recognition results. Present only when StatusText is SUCCESS. |
Words | List\<WordResult\> | The word-level recognition results. Present only when enable_words is true and version is 4.0. |
BizDuration | Long | The total duration of the recognized audio. Unit: milliseconds. |
SolveTime | Long | The timestamp when the recognition task completed. Unit: milliseconds. |
SentenceResult object
| Parameter | Type | Description |
|---|---|---|
ChannelId | Int | The audio track the sentence belongs to. |
BeginTime | Int | The start time offset of the sentence. Unit: milliseconds. |
EndTime | Int | The end time offset of the sentence. Unit: milliseconds. |
Text | String | The recognized text of the sentence. |
EmotionValue | Int | The emotion intensity, calculated as volume decibels divided by 10. Valid values: 1–10. Higher values indicate stronger emotion. |
SilenceDuration | Int | The silence duration between this sentence and the previous one. Unit: seconds. |
SpeechRate | Int | The average speech rate of the sentence. Unit: words per minute. |
WordResult object
| Parameter | Type | Description |
|---|---|---|
ChannelId | Int | The audio track the word belongs to. |
BeginTime | Int | The start time of the word. Unit: milliseconds. |
EndTime | Int | The end time of the word. Unit: milliseconds. |
Word | String | The recognized word. |
Response examples
Task completed successfully (single-track file nls-sample-16k.wav)
{
"TaskId": "d429dd7dd75711e89305ab6170fe****",
"RequestId": "9240D669-6485-4DCC-896A-F8B31F94****",
"StatusText": "SUCCESS",
"BizDuration": 2956,
"SolveTime": 1540363288472,
"StatusCode": 21050000,
"Result": {
"Sentences": [{
"EndTime": 2365,
"SilenceDuration": 0,
"BeginTime": 340,
"Text": "Weather in Beijing",
"ChannelId": 0,
"SpeechRate": 177,
"EmotionValue": 5.0
}]
}
}Callback response (version 4.0, enable_callback: true)
The callback response format matches the polling response format.
{
"Result": {
"Sentences": [{
"EndTime": 2365,
"SilenceDuration": 0,
"BeginTime": 340,
"Text": "Weather in Beijing",
"ChannelId": 0,
"SpeechRate": 177,
"EmotionValue": 5.0
}]
},
"TaskId": "36d01b244ad811e9952db7bb7ed2****",
"StatusCode": 21050000,
"StatusText": "SUCCESS",
"RequestTime": 1553062810452,
"SolveTime": 1553062810831,
"BizDuration": 2956
}RequestTimeis the timestamp when the recognition request was submitted, in milliseconds. For example, a value of1553062810452indicates 14:20:10 on March 20, 2019, UTC+8.SolveTimeis the timestamp when the task completed, in milliseconds.
Task queuing
{
"TaskId": "c7274235b7e611e88f34c33c2a60****",
"RequestId": "981AD922-0655-46B0-8C6A-5C836822****",
"StatusText": "QUEUEING",
"StatusCode": 21050002
}Task running
{
"TaskId": "c7274235b7e611e88f34c33c2a60****",
"RequestId": "8E908ED2-867F-457E-82BF-4756194A****",
"StatusText": "RUNNING",
"BizDuration": 0,
"StatusCode": 21050001
}File download failed
{
"TaskId": "4cf25b7eb7e711e88f34c33c2a60****",
"RequestId": "098BF27C-4CBA-45FF-BD11-3F532F26****",
"StatusText": "FILE_DOWNLOAD_FAILED",
"BizDuration": 0,
"SolveTime": 1536906469146,
"StatusCode": 41050002
}Word-level results (enable_words: true, version: 4.0)
Word-level results are included alongside sentence-level results. The polling and callback responses use the same format.
{
"StatusCode": 21050000,
"Result": {
"Sentences": [{
"SilenceDuration": 0,
"EmotionValue": 5.0,
"ChannelId": 0,
"Text": "Weather in Beijing",
"BeginTime": 340,
"EndTime": 2365,
"SpeechRate": 177
}],
"Words": [{
"ChannelId": 0,
"Word": "Weather",
"BeginTime": 640,
"EndTime": 940
}, {
"ChannelId": 0,
"Word": "in",
"BeginTime": 940,
"EndTime": 1120
}, {
"ChannelId": 0,
"Word": "Beijing",
"BeginTime": 1120,
"EndTime": 2020
}]
},
"SolveTime": 1553236968873,
"StatusText": "SUCCESS",
"RequestId": "027B126B-4AC8-4C98-9FEC-A031158F****",
"TaskId": "b505e78c4c6d11e9a213e11db149****",
"BizDuration": 2956
}Service status codes
Normal status codes
| Status code | Status message | Description | Action |
|---|---|---|---|
| 21050000 | SUCCESS | The task completed successfully. | None required. |
| 21050001 | RUNNING | The task is running. | Query again later. |
| 21050002 | QUEUEING | The task is queuing. | Query again later. |
| 21050003 | SUCCESS_WITH_NO_VALID_FRAGMENT | The task succeeded, but no speech data was detected. | Check whether the audio contains speech or whether the speech duration is too short. |
Error codes
Status codes starting with 4 are client errors. Status codes starting with 5 are server errors.
| Status code | Status message | Description | Solution |
|---|---|---|---|
| 41050001 | USER_BIZDURATION_QUOTA_EXCEED | Daily audio quota exceeded. | Email nls_support@service.aliyun.com to increase your quota. |
| 41050002 | FILE_DOWNLOAD_FAILED | The audio file could not be downloaded. | Check that the URL is correct and publicly accessible. |
| 41050003 | FILE_CHECK_FAILED | The audio file format is invalid. | Check whether the recording file is a single-track or dual-track file in WAV or MP3 format. |
| 41050004 | FILE_TOO_LARGE | The audio file exceeds the size limit. | Check that the file is no larger than 512 MB. |
| 41050005 | FILE_NORMALIZE_FAILED | Audio normalization failed. | Check that the file is not damaged and can be played. |
| 41050006 | FILE_PARSE_FAILED | Audio parsing failed. | Check that the file is not damaged and can be played. |
| 41050007 | MKV_PARSE_FAILED | MKV parsing failed. | Check that the file is not damaged and can be played. |
| 41050008 | UNSUPPORTED_SAMPLE_RATE | The audio sampling rate is not supported. | Check that the file's sampling rate matches the automatic speech recognition (ASR) model bound to your appkey in the Intelligent Speech Interaction console. |
| 41050009 | UNSUPPORTED_ASR_GROUP | The ASR group is not supported. | Check that the appkey belongs to the same Alibaba Cloud account as the AccessKey pair. |
| 41050010 | FILE_TRANS_TASK_EXPIRED | The recognition task has expired. | Check that the task ID is valid and has not expired. |
| 41050011 | REQUEST_INVALID_FILE_URL_VALUE | The file_link parameter is invalid. | Check the file_link format. |
| 41050012 | REQUEST_INVALID_CALLBACK_VALUE | The callback_url parameter is invalid. | Check the callback_url format. |
| 41050013 | REQUEST_PARAMETER_INVALID | The request parameters are invalid. | Check that the request body is a valid JSON string. |
| 41050014 | REQUEST_EMPTY_APPKEY_VALUE | The appkey parameter is missing. | Add the appkey parameter to the request. |
| 41050015 | REQUEST_APPKEY_UNREGISTERED | The appkey parameter is invalid. | Check that the appkey is valid and belongs to the same Alibaba Cloud account as the AccessKey ID. |
| 41050021 | RAM_CHECK_FAILED | RAM user authentication failed. | Check that the RAM user has permission to call the Intelligent Speech Interaction API. |
| 41050023 | CONTENT_LENGTH_CHECK_FAILED | The Content-Length header is invalid. | Check that the Content-Length value in the HTTP response header matches the actual file size. |
| 41050024 | FILE_404_NOT_FOUND | The audio file does not exist at the specified URL. | Check that the file exists at the URL. |
| 41050025 | FILE_403_FORBIDDEN | Access to the audio file is denied. | Check that the file is publicly accessible. |
| 41050026 | FILE_SERVER_ERROR | A file server error occurred. | Check that the file server is operating correctly. |
| 51050000 | INTERNAL_ERROR | An internal error occurred. | Ignore if intermittent. Submit a ticket if the error recurs. |
| 51050001 | VAD_FAILED | Voice activity detection (VAD) failed. | Ignore if intermittent. Submit a ticket if the error recurs. |
| 51050002 | RECOGNIZE_FAILED | ASR failed. | Ignore if intermittent. Submit a ticket if the error recurs. |
| 51050003 | RECOGNIZE_INTERRUPT | ASR was interrupted. | Ignore if intermittent. Submit a ticket if the error recurs. |
| 51050004 | OFFER_INTERRUPT | The task could not be written to the queue. | Ignore if intermittent. Submit a ticket if the error recurs. |
| 51050005 | FILE_TRANS_TIMEOUT | The task timed out. | Ignore if intermittent. Submit a ticket if the error recurs. |
| 51050006 | FRAGMENT_FAILED | Multi-channel audio conversion to mono failed. | Ignore if intermittent. Submit a ticket if the error recurs. |
Version notes
The recording file recognition service defaults to version 2.0 for existing integrations. If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. You can continue to use this version. If you are a new user, set version to 4.0.
Key difference: In version 2.0, the callback response uses a snake_case JSON format that differs from the polling response. In version 4.0, both use the same camelCase format.
Version 2.0 callback response example
{
"result": [{
"begin_time": 340,
"channel_id": 0,
"emotion_value": 5.0,
"end_time": 2365,
"silence_duration": 0,
"speech_rate": 177,
"text": "Weather in Beijing"
}],
"task_id": "3f5d4c0c399511e98dc025f34473****",
"status_code": 21050000,
"status_text": "SUCCESS",
"request_time": 1551164878830,
"solve_time": 1551164879230,
"biz_duration": 2956
}