API reference

更新时间:
复制 MD 格式

The recording file recognition service transcribes audio files submitted as HTTP or HTTPS URLs. It supports two retrieval methods—polling and callback—and processes files asynchronously rather than in real time.

Supported features

  • Single-track WAV and MP3 audio files

  • Two retrieval methods: polling and callback

  • Custom linguistic models and hotword vocabularies

  • Multiple languages: Chinese Mandarin, Chinese dialects, and English

Limitations

ConstraintDetails
URL formatMust be publicly accessible via domain name. IP addresses and spaces are not allowed.
File size512 MB maximum
Processing time (free trial)Recognition completes within 24 hours
Processing time (Commercial Edition)Recognition completes within 3 hours
Result retention72 hours
Daily quota (free trial)Up to 2 hours of audio per calendar day

Valid URL example

https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav

Invalid URL examples

http://127.0.0.1/sample.wav
D:\files\sample.wav
The 24-hour and 3-hour processing limits do not apply if files uploaded within 30 minutes exceed 500 hours in total length. For large-scale audio processing, contact Alibaba Cloud pre-sales.

How it works

The recording file recognition service uses the Alibaba Cloud POP API (remote procedure call style). The client sends requests over HTTP, and recording files must be accessible via a public URL.

Two operations are available:

  • Submit a recognition task (POST): Send the recording file URL and configuration parameters. The server returns a task ID.

  • Query the recognition result (GET): Use the task ID to poll the result, or receive it via callback if you enable the callback method.

The query operation supports up to 100 queries per second (QPS). If you exceed this limit, the server returns Throttling.User : Request was denied due to user flow control. Set a polling interval to stay within the limit.

Choose a retrieval method

MethodHow it worksWhen to use
PollingSubmit a task, then periodically query the result using the task ID.Default. Works in all environments.
CallbackSubmit a task with a callback URL. The server POSTs the result to that URL when processing is complete.Use when you can expose a publicly reachable HTTP or HTTPS endpoint.

Prerequisites

Before submitting a recognition task:

  1. Check the format and sampling rate of your audio file. Select an appropriate scenario and model in the Intelligent Speech Interaction console based on your use case.

  2. Store the audio file in Alibaba Cloud Object Storage Service (OSS) or on a publicly accessible file server.

    • Public OSS file: Get the OSS URL directly. See Public read object.

    • Private OSS file: Generate a presigned URL with a validity period using the SDK. See Private object.

    • Custom file server: Make sure the Content-Length field in the HTTP response header matches the actual file size, or the download will fail.

Submit a recognition task

Method: POST

Set the request parameters as a JSON string in the request body.

Request body example

{
    "appkey": "your-appkey",
    "file_link": "https://aliyun-nls.oss-cn-hangzhou.aliyuncs.com/asr/fileASR/examples/nls-sample-16k.wav",
    "version": "4.0",
    "auto_split": false,
    "enable_words": false,
    "enable_sample_rate_adaptive": true,
    "valid_times": [
        {
            "begin_time": 200,
            "end_time": 2000,
            "channel_id": 0
        }
    ]
}

Request parameters

ParameterTypeRequiredDescription
appkeyStringYesThe appkey of your project in the Intelligent Speech Interaction console.
file_linkStringYesThe URL of the recording file. Make sure the project's scenario and model match the recording file.
versionStringYesThe service version. Default value: 2.0. Set to 4.0 for new integrations.
enable_wordsBooleanNoSpecifies whether to return word-level recognition results. Default value: false. Requires version: 4.0.
enable_sample_rate_adaptiveBooleanNoSpecifies whether to automatically downsample audio with a sampling rate above 16,000 Hz. Default value: false. Requires version: 4.0.
enable_callbackBooleanNoSpecifies whether to use the callback method. Default value: false.
callback_urlStringNoThe callback URL. Required when enable_callback is true. Must be an HTTP or HTTPS URL with a domain name (not an IP address).
auto_splitBooleanNoSpecifies whether to enable automatic track splitting. When enabled, the server identifies the speaker of each sentence using the ChannelId field. Supports mono audio at 8,000 Hz only. Cannot be set to true when enable_unify_post is true.
enable_unify_postBooleanNoSpecifies whether to enable post-processing. Default value: false. Cannot be set to true when auto_split is true.
enable_inverse_text_normalizationBooleanNoSpecifies whether to enable inverse text normalization (ITN), which converts Chinese numerals to Arabic numerals. Default value: false. Requires version: 4.0 and enable_unify_post: true. ITN does not apply to word-level results.
enable_disfluencyBooleanNoSpecifies whether to enable disfluency detection. Default value: false. Requires version: 4.0 and enable_unify_post: true.
valid_timesList\<ValidTime\>NoThe time ranges within the audio track that require speech recognition.
max_end_silenceIntegerNoThe maximum end-of-sentence silence duration. Default value: 450. Unit: milliseconds.
max_single_segment_timeIntegerNoThe maximum duration of a single sentence. Minimum value: 10000. Default value: 20000. Unit: milliseconds.
customization_idStringNoThe ID of the custom linguistic model created via the POP API.
class_vocabulary_idStringNoThe ID of the categorized hotword vocabulary.
vocabulary_idStringNoThe ID of the extensive hotword vocabulary.

ValidTime object

ParameterTypeRequiredDescription
begin_timeIntYesThe start time offset of the time range. Unit: milliseconds.
end_timeIntYesThe end time offset of the time range. Unit: milliseconds.
channel_idIntYesThe audio track to which the time range applies. Values start from 0.

Response parameters

An HTTP 200 status code indicates the request was accepted.

ParameterTypeDescription
TaskIdStringThe task ID. Use this to query the recognition result.
RequestIdStringThe request ID, for debugging.
StatusCodeIntThe status code.
StatusTextStringThe status message.

Response example

{
    "TaskId": "4b56f0c4b7e611e88f34c33c2a60****",
    "RequestId": "E4B183CC-6CFE-411E-A547-D877F7BD****",
    "StatusText": "SUCCESS",
    "StatusCode": 21050000
}

Query the recognition result

Method: GET

Pass the task ID returned by the POST operation as a request parameter. Poll at a reasonable interval to stay within the 100 QPS limit.

Request parameters

ParameterTypeRequiredDescription
TaskIdStringYesThe task ID returned by the submit operation.

Response parameters

An HTTP 200 status code indicates the query request was received.

ParameterTypeDescription
TaskIdStringThe task ID.
StatusCodeIntThe status code.
StatusTextStringThe status message.
RequestIdStringThe request ID, for debugging.
ResultObjectThe recognition result. Present only when StatusText is SUCCESS.
SentencesList\<SentenceResult\>The sentence-level recognition results. Present only when StatusText is SUCCESS.
WordsList\<WordResult\>The word-level recognition results. Present only when enable_words is true and version is 4.0.
BizDurationLongThe total duration of the recognized audio. Unit: milliseconds.
SolveTimeLongThe timestamp when the recognition task completed. Unit: milliseconds.

SentenceResult object

ParameterTypeDescription
ChannelIdIntThe audio track the sentence belongs to.
BeginTimeIntThe start time offset of the sentence. Unit: milliseconds.
EndTimeIntThe end time offset of the sentence. Unit: milliseconds.
TextStringThe recognized text of the sentence.
EmotionValueIntThe emotion intensity, calculated as volume decibels divided by 10. Valid values: 1–10. Higher values indicate stronger emotion.
SilenceDurationIntThe silence duration between this sentence and the previous one. Unit: seconds.
SpeechRateIntThe average speech rate of the sentence. Unit: words per minute.

WordResult object

ParameterTypeDescription
ChannelIdIntThe audio track the word belongs to.
BeginTimeIntThe start time of the word. Unit: milliseconds.
EndTimeIntThe end time of the word. Unit: milliseconds.
WordStringThe recognized word.

Response examples

Task completed successfully (single-track file nls-sample-16k.wav)

{
    "TaskId": "d429dd7dd75711e89305ab6170fe****",
    "RequestId": "9240D669-6485-4DCC-896A-F8B31F94****",
    "StatusText": "SUCCESS",
    "BizDuration": 2956,
    "SolveTime": 1540363288472,
    "StatusCode": 21050000,
    "Result": {
        "Sentences": [{
            "EndTime": 2365,
            "SilenceDuration": 0,
            "BeginTime": 340,
            "Text": "Weather in Beijing",
            "ChannelId": 0,
            "SpeechRate": 177,
            "EmotionValue": 5.0
        }]
    }
}

Callback response (version 4.0, enable_callback: true)

The callback response format matches the polling response format.

{
    "Result": {
        "Sentences": [{
            "EndTime": 2365,
            "SilenceDuration": 0,
            "BeginTime": 340,
            "Text": "Weather in Beijing",
            "ChannelId": 0,
            "SpeechRate": 177,
            "EmotionValue": 5.0
        }]
    },
    "TaskId": "36d01b244ad811e9952db7bb7ed2****",
    "StatusCode": 21050000,
    "StatusText": "SUCCESS",
    "RequestTime": 1553062810452,
    "SolveTime": 1553062810831,
    "BizDuration": 2956
}
RequestTime is the timestamp when the recognition request was submitted, in milliseconds. For example, a value of 1553062810452 indicates 14:20:10 on March 20, 2019, UTC+8. SolveTime is the timestamp when the task completed, in milliseconds.

Task queuing

{
    "TaskId": "c7274235b7e611e88f34c33c2a60****",
    "RequestId": "981AD922-0655-46B0-8C6A-5C836822****",
    "StatusText": "QUEUEING",
    "StatusCode": 21050002
}

Task running

{
    "TaskId": "c7274235b7e611e88f34c33c2a60****",
    "RequestId": "8E908ED2-867F-457E-82BF-4756194A****",
    "StatusText": "RUNNING",
    "BizDuration": 0,
    "StatusCode": 21050001
}

File download failed

{
    "TaskId": "4cf25b7eb7e711e88f34c33c2a60****",
    "RequestId": "098BF27C-4CBA-45FF-BD11-3F532F26****",
    "StatusText": "FILE_DOWNLOAD_FAILED",
    "BizDuration": 0,
    "SolveTime": 1536906469146,
    "StatusCode": 41050002
}

Word-level results (enable_words: true, version: 4.0)

Word-level results are included alongside sentence-level results. The polling and callback responses use the same format.

{
    "StatusCode": 21050000,
    "Result": {
        "Sentences": [{
            "SilenceDuration": 0,
            "EmotionValue": 5.0,
            "ChannelId": 0,
            "Text": "Weather in Beijing",
            "BeginTime": 340,
            "EndTime": 2365,
            "SpeechRate": 177
        }],
        "Words": [{
            "ChannelId": 0,
            "Word": "Weather",
            "BeginTime": 640,
            "EndTime": 940
        }, {
            "ChannelId": 0,
            "Word": "in",
            "BeginTime": 940,
            "EndTime": 1120
        }, {
            "ChannelId": 0,
            "Word": "Beijing",
            "BeginTime": 1120,
            "EndTime": 2020
        }]
    },
    "SolveTime": 1553236968873,
    "StatusText": "SUCCESS",
    "RequestId": "027B126B-4AC8-4C98-9FEC-A031158F****",
    "TaskId": "b505e78c4c6d11e9a213e11db149****",
    "BizDuration": 2956
}

Service status codes

Normal status codes

Status codeStatus messageDescriptionAction
21050000SUCCESSThe task completed successfully.None required.
21050001RUNNINGThe task is running.Query again later.
21050002QUEUEINGThe task is queuing.Query again later.
21050003SUCCESS_WITH_NO_VALID_FRAGMENTThe task succeeded, but no speech data was detected.Check whether the audio contains speech or whether the speech duration is too short.

Error codes

Status codes starting with 4 are client errors. Status codes starting with 5 are server errors.

Status codeStatus messageDescriptionSolution
41050001USER_BIZDURATION_QUOTA_EXCEEDDaily audio quota exceeded.Email nls_support@service.aliyun.com to increase your quota.
41050002FILE_DOWNLOAD_FAILEDThe audio file could not be downloaded.Check that the URL is correct and publicly accessible.
41050003FILE_CHECK_FAILEDThe audio file format is invalid.Check whether the recording file is a single-track or dual-track file in WAV or MP3 format.
41050004FILE_TOO_LARGEThe audio file exceeds the size limit.Check that the file is no larger than 512 MB.
41050005FILE_NORMALIZE_FAILEDAudio normalization failed.Check that the file is not damaged and can be played.
41050006FILE_PARSE_FAILEDAudio parsing failed.Check that the file is not damaged and can be played.
41050007MKV_PARSE_FAILEDMKV parsing failed.Check that the file is not damaged and can be played.
41050008UNSUPPORTED_SAMPLE_RATEThe audio sampling rate is not supported.Check that the file's sampling rate matches the automatic speech recognition (ASR) model bound to your appkey in the Intelligent Speech Interaction console.
41050009UNSUPPORTED_ASR_GROUPThe ASR group is not supported.Check that the appkey belongs to the same Alibaba Cloud account as the AccessKey pair.
41050010FILE_TRANS_TASK_EXPIREDThe recognition task has expired.Check that the task ID is valid and has not expired.
41050011REQUEST_INVALID_FILE_URL_VALUEThe file_link parameter is invalid.Check the file_link format.
41050012REQUEST_INVALID_CALLBACK_VALUEThe callback_url parameter is invalid.Check the callback_url format.
41050013REQUEST_PARAMETER_INVALIDThe request parameters are invalid.Check that the request body is a valid JSON string.
41050014REQUEST_EMPTY_APPKEY_VALUEThe appkey parameter is missing.Add the appkey parameter to the request.
41050015REQUEST_APPKEY_UNREGISTEREDThe appkey parameter is invalid.Check that the appkey is valid and belongs to the same Alibaba Cloud account as the AccessKey ID.
41050021RAM_CHECK_FAILEDRAM user authentication failed.Check that the RAM user has permission to call the Intelligent Speech Interaction API.
41050023CONTENT_LENGTH_CHECK_FAILEDThe Content-Length header is invalid.Check that the Content-Length value in the HTTP response header matches the actual file size.
41050024FILE_404_NOT_FOUNDThe audio file does not exist at the specified URL.Check that the file exists at the URL.
41050025FILE_403_FORBIDDENAccess to the audio file is denied.Check that the file is publicly accessible.
41050026FILE_SERVER_ERRORA file server error occurred.Check that the file server is operating correctly.
51050000INTERNAL_ERRORAn internal error occurred.Ignore if intermittent. Submit a ticket if the error recurs.
51050001VAD_FAILEDVoice activity detection (VAD) failed.Ignore if intermittent. Submit a ticket if the error recurs.
51050002RECOGNIZE_FAILEDASR failed.Ignore if intermittent. Submit a ticket if the error recurs.
51050003RECOGNIZE_INTERRUPTASR was interrupted.Ignore if intermittent. Submit a ticket if the error recurs.
51050004OFFER_INTERRUPTThe task could not be written to the queue.Ignore if intermittent. Submit a ticket if the error recurs.
51050005FILE_TRANS_TIMEOUTThe task timed out.Ignore if intermittent. Submit a ticket if the error recurs.
51050006FRAGMENT_FAILEDMulti-channel audio conversion to mono failed.Ignore if intermittent. Submit a ticket if the error recurs.

Version notes

The recording file recognition service defaults to version 2.0 for existing integrations. If you have activated the recording file recognition service without setting the version to 4.0, its version is 2.0 by default. You can continue to use this version. If you are a new user, set version to 4.0.

Key difference: In version 2.0, the callback response uses a snake_case JSON format that differs from the polling response. In version 4.0, both use the same camelCase format.

Version 2.0 callback response example

{
    "result": [{
        "begin_time": 340,
        "channel_id": 0,
        "emotion_value": 5.0,
        "end_time": 2365,
        "silence_duration": 0,
        "speech_rate": 177,
        "text": "Weather in Beijing"
    }],
    "task_id": "3f5d4c0c399511e98dc025f34473****",
    "status_code": 21050000,
    "status_text": "SUCCESS",
    "request_time": 1551164878830,
    "solve_time": 1551164879230,
    "biz_duration": 2956
}