API reference for recording file recognition-Intelligent Speech Interaction(ISI)-阿里云帮助中心

The recording file recognition service transcribes audio files submitted as HTTP or HTTPS URLs. It supports two retrieval methods—polling and callback—and processes files asynchronously rather than in real time.

Supported features

Single-track WAV and MP3 audio files
Two retrieval methods: polling and callback
Custom linguistic models and hotword vocabularies
Multiple languages: Chinese Mandarin, Chinese dialects, and English

Limitations

Constraint	Details
URL format	Must be publicly accessible via domain name. IP addresses and spaces are not allowed.
File size	512 MB maximum
Processing time (free trial)	Recognition completes within 24 hours
Processing time (Commercial Edition)	Recognition completes within 3 hours
Result retention	72 hours
Daily quota (free trial)	Up to 2 hours of audio per calendar day

Valid URL example

https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav

Invalid URL examples

http://127.0.0.1/sample.wav
D:\files\sample.wav

The 24-hour and 3-hour processing limits do not apply if files uploaded within 30 minutes exceed 500 hours in total length. For large-scale audio processing, contact Alibaba Cloud pre-sales.

How it works

The recording file recognition service uses the Alibaba Cloud POP API (remote procedure call style). The client sends requests over HTTP, and recording files must be accessible via a public URL.

Two operations are available:

Submit a recognition task (POST): Send the recording file URL and configuration parameters. The server returns a task ID.
Query the recognition result (GET): Use the task ID to poll the result, or receive it via callback if you enable the callback method.

The query operation supports up to 100 queries per second (QPS). If you exceed this limit, the server returns Throttling.User : Request was denied due to user flow control. Set a polling interval to stay within the limit.

Choose a retrieval method

Method	How it works	When to use
Polling	Submit a task, then periodically query the result using the task ID.	Default. Works in all environments.
Callback	Submit a task with a callback URL. The server POSTs the result to that URL when processing is complete.	Use when you can expose a publicly reachable HTTP or HTTPS endpoint.

Prerequisites

Before submitting a recognition task:

Check the format and sampling rate of your audio file. Select an appropriate scenario and model in the Intelligent Speech Interaction console based on your use case.
Store the audio file in Alibaba Cloud Object Storage Service (OSS) or on a publicly accessible file server.
- Public OSS file: Get the OSS URL directly. See Public read object.
- Private OSS file: Generate a presigned URL with a validity period using the SDK. See Private object.
- Custom file server: Make sure the Content-Length field in the HTTP response header matches the actual file size, or the download will fail.

Submit a recognition task

Method: POST

Set the request parameters as a JSON string in the request body.

Request body example

{
    "appkey": "your-appkey",
    "file_link": "https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav",
    "version": "4.0",
    "auto_split": false,
    "enable_words": false,
    "enable_sample_rate_adaptive": true,
    "valid_times": [
        {
            "begin_time": 200,
            "end_time": 2000,
            "channel_id": 0
        }
    ]
}

Request parameters

Parameter	Type	Required	Description
`appkey`	String	Yes	The appkey of your project in the Intelligent Speech Interaction console.
`file_link`	String	Yes	The URL of the recording file. Make sure the project's scenario and model match the recording file.
`version`	String	Yes	The service version. Default value: `2.0`. Set to `4.0` for new integrations.
`enable_words`	Boolean	No	Specifies whether to return word-level recognition results. Default value: `false`. Requires `version: 4.0`.
`enable_sample_rate_adaptive`	Boolean	No	Specifies whether to automatically downsample audio with a sampling rate above 16,000 Hz. Default value: `false`. Requires `version: 4.0`.
`enable_callback`	Boolean	No	Specifies whether to use the callback method. Default value: `false`.
`callback_url`	String	No	The callback URL. Required when `enable_callback` is `true`. Must be an HTTP or HTTPS URL with a domain name (not an IP address).
`auto_split`	Boolean	No	Specifies whether to enable automatic track splitting. When enabled, the server identifies the speaker of each sentence using the `ChannelId` field. Supports mono audio at 8,000 Hz only. Cannot be set to `true` when `enable_unify_post` is `true`.
`enable_unify_post`	Boolean	No	Specifies whether to enable post-processing. Default value: `false`. Cannot be set to `true` when `auto_split` is `true`.
`enable_inverse_text_normalization`	Boolean	No	Specifies whether to enable inverse text normalization (ITN), which converts Chinese numerals to Arabic numerals. Default value: `false`. Requires `version: 4.0` and `enable_unify_post: true`. ITN does not apply to word-level results.
`enable_disfluency`	Boolean	No	Specifies whether to enable disfluency detection. Default value: `false`. Requires `version: 4.0` and `enable_unify_post: true`.
`valid_times`	List\<ValidTime\>	No	The time ranges within the audio track that require speech recognition.
`max_end_silence`	Integer	No	The maximum end-of-sentence silence duration. Default value: `450`. Unit: milliseconds.
`max_single_segment_time`	Integer	No	The maximum duration of a single sentence. Minimum value: `10000`. Default value: `20000`. Unit: milliseconds.
`customization_id`	String	No	The ID of the custom linguistic model created via the POP API.
`class_vocabulary_id`	String	No	The ID of the categorized hotword vocabulary.
`vocabulary_id`	String	No	The ID of the extensive hotword vocabulary.

ValidTime object

Parameter	Type	Required	Description
`begin_time`	Int	Yes	The start time offset of the time range. Unit: milliseconds.
`end_time`	Int	Yes	The end time offset of the time range. Unit: milliseconds.
`channel_id`	Int	Yes	The audio track to which the time range applies. Values start from `0`.

Response parameters

An HTTP 200 status code indicates the request was accepted.

Parameter	Type	Description
`TaskId`	String	The task ID. Use this to query the recognition result.
`RequestId`	String	The request ID, for debugging.
`StatusCode`	Int	The status code.
`StatusText`	String	The status message.

Response example

{
    "TaskId": "4b56f0c4b7e611e88f34c33c2a60****",
    "RequestId": "E4B183CC-6CFE-411E-A547-D877F7BD****",
    "StatusText": "SUCCESS",
    "StatusCode": 21050000
}

Query the recognition result

Method: GET

Pass the task ID returned by the POST operation as a request parameter. Poll at a reasonable interval to stay within the 100 QPS limit.

Request parameters

Parameter	Type	Required	Description
`TaskId`	String	Yes	The task ID returned by the submit operation.

Response parameters

An HTTP 200 status code indicates the query request was received.

Parameter	Type	Description
`TaskId`	String	The task ID.
`StatusCode`	Int	The status code.
`StatusText`	String	The status message.
`RequestId`	String	The request ID, for debugging.
`Result`	Object	The recognition result. Present only when `StatusText` is `SUCCESS`.
`Sentences`	List\<SentenceResult\>	The sentence-level recognition results. Present only when `StatusText` is `SUCCESS`.
`Words`	List\<WordResult\>	The word-level recognition results. Present only when `enable_words` is `true` and `version` is `4.0`.
`BizDuration`	Long	The total duration of the recognized audio. Unit: milliseconds.
`SolveTime`	Long	The timestamp when the recognition task completed. Unit: milliseconds.

SentenceResult object

Parameter	Type	Description
`ChannelId`	Int	The audio track the sentence belongs to.
`BeginTime`	Int	The start time offset of the sentence. Unit: milliseconds.
`EndTime`	Int	The end time offset of the sentence. Unit: milliseconds.
`Text`	String	The recognized text of the sentence.
`EmotionValue`	Int	The emotion intensity, calculated as volume decibels divided by 10. Valid values: 1–10. Higher values indicate stronger emotion.
`SilenceDuration`	Int	The silence duration between this sentence and the previous one. Unit: seconds.
`SpeechRate`	Int	The average speech rate of the sentence. Unit: words per minute.

WordResult object

Parameter	Type	Description
`ChannelId`	Int	The audio track the word belongs to.
`BeginTime`	Int	The start time of the word. Unit: milliseconds.
`EndTime`	Int	The end time of the word. Unit: milliseconds.
`Word`	String	The recognized word.

Response examples

Task completed successfully (single-track file nls-sample-16k.wav)

{
    "TaskId": "d429dd7dd75711e89305ab6170fe****",
    "RequestId": "9240D669-6485-4DCC-896A-F8B31F94****",
    "StatusText": "SUCCESS",
    "BizDuration": 2956,
    "SolveTime": 1540363288472,
    "StatusCode": 21050000,
    "Result": {
        "Sentences": [{
            "EndTime": 2365,
            "SilenceDuration": 0,
            "BeginTime": 340,
            "Text": "Weather in Beijing",
            "ChannelId": 0,
            "SpeechRate": 177,
            "EmotionValue": 5.0
        }]
    }
}

Callback response (version 4.0, enable_callback: true)

The callback response format matches the polling response format.

{
    "Result": {
        "Sentences": [{
            "EndTime": 2365,
            "SilenceDuration": 0,
            "BeginTime": 340,
            "Text": "Weather in Beijing",
            "ChannelId": 0,
            "SpeechRate": 177,
            "EmotionValue": 5.0
        }]
    },
    "TaskId": "36d01b244ad811e9952db7bb7ed2****",
    "StatusCode": 21050000,
    "StatusText": "SUCCESS",
    "RequestTime": 1553062810452,
    "SolveTime": 1553062810831,
    "BizDuration": 2956
}

RequestTime is the timestamp when the recognition request was submitted, in milliseconds. For example, a value of 1553062810452 indicates 14:20:10 on March 20, 2019, UTC+8. SolveTime is the timestamp when the task completed, in milliseconds.

Task queuing

{
    "TaskId": "c7274235b7e611e88f34c33c2a60****",
    "RequestId": "981AD922-0655-46B0-8C6A-5C836822****",
    "StatusText": "QUEUEING",
    "StatusCode": 21050002
}

Task running

{
    "TaskId": "c7274235b7e611e88f34c33c2a60****",
    "RequestId": "8E908ED2-867F-457E-82BF-4756194A****",
    "StatusText": "RUNNING",
    "BizDuration": 0,
    "StatusCode": 21050001
}

File download failed

{
    "TaskId": "4cf25b7eb7e711e88f34c33c2a60****",
    "RequestId": "098BF27C-4CBA-45FF-BD11-3F532F26****",
    "StatusText": "FILE_DOWNLOAD_FAILED",
    "BizDuration": 0,
    "SolveTime": 1536906469146,
    "StatusCode": 41050002
}

Word-level results (enable_words: true, version: 4.0)

Word-level results are included alongside sentence-level results. The polling and callback responses use the same format.

{
    "StatusCode": 21050000,
    "Result": {
        "Sentences": [{
            "SilenceDuration": 0,
            "EmotionValue": 5.0,
            "ChannelId": 0,
            "Text": "Weather in Beijing",
            "BeginTime": 340,
            "EndTime": 2365,
            "SpeechRate": 177
        }],
        "Words": [{
            "ChannelId": 0,
            "Word": "Weather",
            "BeginTime": 640,
            "EndTime": 940
        }, {
            "ChannelId": 0,
            "Word": "in",
            "BeginTime": 940,
            "EndTime": 1120
        }, {
            "ChannelId": 0,
            "Word": "Beijing",
            "BeginTime": 1120,
            "EndTime": 2020
        }]
    },
    "SolveTime": 1553236968873,
    "StatusText": "SUCCESS",
    "RequestId": "027B126B-4AC8-4C98-9FEC-A031158F****",
    "TaskId": "b505e78c4c6d11e9a213e11db149****",
    "BizDuration": 2956
}

Service status codes

Normal status codes

Status code	Status message	Description	Action
21050000	SUCCESS	The task completed successfully.	None required.
21050001	RUNNING	The task is running.	Query again later.
21050002	QUEUEING	The task is queuing.	Query again later.
21050003	SUCCESS_WITH_NO_VALID_FRAGMENT	The task succeeded, but no speech data was detected.	Check whether the audio contains speech or whether the speech duration is too short.

Error codes

Status codes starting with 4 are client errors. Status codes starting with 5 are server errors.

Status code	Status message	Description	Solution
41050001	USER_BIZDURATION_QUOTA_EXCEED	Daily audio quota exceeded.	Email nls_support@service.aliyun.com to increase your quota.
41050002	FILE_DOWNLOAD_FAILED	The audio file could not be downloaded.	Check that the URL is correct and publicly accessible.
41050003	FILE_CHECK_FAILED	The audio file format is invalid.	Check whether the recording file is a single-track or dual-track file in WAV or MP3 format.
41050004	FILE_TOO_LARGE	The audio file exceeds the size limit.	Check that the file is no larger than 512 MB.
41050005	FILE_NORMALIZE_FAILED	Audio normalization failed.	Check that the file is not damaged and can be played.
41050006	FILE_PARSE_FAILED	Audio parsing failed.	Check that the file is not damaged and can be played.
41050007	MKV_PARSE_FAILED	MKV parsing failed.	Check that the file is not damaged and can be played.
41050008	UNSUPPORTED_SAMPLE_RATE	The audio sampling rate is not supported.	Check that the file's sampling rate matches the automatic speech recognition (ASR) model bound to your appkey in the Intelligent Speech Interaction console.
41050009	UNSUPPORTED_ASR_GROUP	The ASR group is not supported.	Check that the appkey belongs to the same Alibaba Cloud account as the AccessKey pair.
41050010	FILE_TRANS_TASK_EXPIRED	The recognition task has expired.	Check that the task ID is valid and has not expired.
41050011	REQUEST_INVALID_FILE_URL_VALUE	The `file_link` parameter is invalid.	Check the `file_link` format.
41050012	REQUEST_INVALID_CALLBACK_VALUE	The `callback_url` parameter is invalid.	Check the `callback_url` format.
41050013	REQUEST_PARAMETER_INVALID	The request parameters are invalid.	Check that the request body is a valid JSON string.
41050014	REQUEST_EMPTY_APPKEY_VALUE	The `appkey` parameter is missing.	Add the `appkey` parameter to the request.
41050015	REQUEST_APPKEY_UNREGISTERED	The `appkey` parameter is invalid.	Check that the appkey is valid and belongs to the same Alibaba Cloud account as the AccessKey ID.
41050021	RAM_CHECK_FAILED	RAM user authentication failed.	Check that the RAM user has permission to call the Intelligent Speech Interaction API.
41050023	CONTENT_LENGTH_CHECK_FAILED	The `Content-Length` header is invalid.	Check that the `Content-Length` value in the HTTP response header matches the actual file size.
41050024	FILE_404_NOT_FOUND	The audio file does not exist at the specified URL.	Check that the file exists at the URL.
41050025	FILE_403_FORBIDDEN	Access to the audio file is denied.	Check that the file is publicly accessible.
41050026	FILE_SERVER_ERROR	A file server error occurred.	Check that the file server is operating correctly.
51050000	INTERNAL_ERROR	An internal error occurred.	Ignore if intermittent. Submit a ticket if the error recurs.
51050001	VAD_FAILED	Voice activity detection (VAD) failed.	Ignore if intermittent. Submit a ticket if the error recurs.
51050002	RECOGNIZE_FAILED	ASR failed.	Ignore if intermittent. Submit a ticket if the error recurs.
51050003	RECOGNIZE_INTERRUPT	ASR was interrupted.	Ignore if intermittent. Submit a ticket if the error recurs.
51050004	OFFER_INTERRUPT	The task could not be written to the queue.	Ignore if intermittent. Submit a ticket if the error recurs.
51050005	FILE_TRANS_TIMEOUT	The task timed out.	Ignore if intermittent. Submit a ticket if the error recurs.
51050006	FRAGMENT_FAILED	Multi-channel audio conversion to mono failed.	Ignore if intermittent. Submit a ticket if the error recurs.

Version notes

The recording file recognition service defaults to version 2.0 for existing integrations. If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. You can continue to use this version. If you are a new user, set version to 4.0.

Key difference: In version 2.0, the callback response uses a snake_case JSON format that differs from the polling response. In version 4.0, both use the same camelCase format.

Version 2.0 callback response example

{
    "result": [{
        "begin_time": 340,
        "channel_id": 0,
        "emotion_value": 5.0,
        "end_time": 2365,
        "silence_duration": 0,
        "speech_rate": 177,
        "text": "Weather in Beijing"
    }],
    "task_id": "3f5d4c0c399511e98dc025f34473****",
    "status_code": 21050000,
    "status_text": "SUCCESS",
    "request_time": 1551164878830,
    "solve_time": 1551164879230,
    "biz_duration": 2956
}