Frequently asked questions about the Speech Recognition service-Intelligent Speech Interaction(ISI)-阿里云帮助中心

This topic provides answers to frequently asked questions (FAQs) about the speech recognition service.

FAQs about speech recognition fall into the following categories:

Features
Performance
Recognition quality
SDK usage
Billing
- Is a trial available for the Flash Version of audio file transcription?

Features

Failed sentence breaking in real-time transcription

If you are using VAD sentence breaking, the real-time transcription service relies on silence data to detect pauses. If the upstream does not send silence data, the server cannot identify when the speaker has paused. To diagnose this, compare the audio duration processed on the server with the actual audio duration. If the processed duration is shorter, it indicates that the upstream did not send silence data during pauses.
If you use semantic sentence breaking, the performance of the post-processing model might be the cause.

Solution: Continuously send silence data to the server when the user pauses.

Sentence segmentation

The real-time speech recognition service performs sentence segmentation, while the one-shot speech recognition service processes only one sentence per request.

Offline feature support

No. This service does not support offline speech recognition. All audio data must be sent to the server for recognition.

Offline SDK and local offline recognition

No. The Alibaba Cloud speech recognition service does not support an offline SDK or local offline recognition. All audio data must be transmitted to the cloud server for processing. Local offline man-machine dialogue with speech-to-text is not supported.

If your use case requires fully offline operation, note that the current speech recognition product does not offer this capability. Audio must be connected to the internet and sent to the server for recognition.

Supported models

The Intelligent Speech Interaction console lists the available model types in the project function configuration. Models are available for 8 kHz and 16 kHz sampling rates, and each sampling rate includes multiple domain models.

Chinese-English mixed recognition

Yes. The Chinese Mandarin model can recognize audio that contains a mix of Chinese and English.

ITN conversion of Chinese numerals

The system's model determines whether to convert a number to an Arabic numeral. The model does not convert all numbers, as its goal is to match the formats commonly used in standard written text.

`enable_sample_rate_adaptive` vs. `sample_rate`

No. The sample_rate parameter for the Flash Version defines the sampling rate, while enable_sample_rate_adaptive is a switch. The Flash Version uses a default sampling rate of 16 kHz and does not require this switch.

Distinguishing agent and customer

A speech recognition engine can distinguish between different speakers, but it cannot determine their specific roles, such as an agent or a customer. You need to map the speakers to their roles based on your specific use case. We recommend storing recordings categorized by role.

Punctuation for short utterance recognition

The service adds punctuation by analyzing the audio's acoustic features and running linguistic analysis on the recognized text.

Distinguishing left and right channels

The speech recognition engine cannot distinguish between left and right channels. When you submit multi-channel audio to the speech recognition service, the response uses the channel_id field to identify the audio tracks. If the recording order is consistent, you can use channel_id to map each track to its corresponding channel. For details, see the API reference.

Multiple vocabularies

You can pass only one vocabulary_id per request. If you need to use multiple vocabularies, use a custom model. For more information, see Create business-specific hotwords by using the POP API.

Differences between versions 4.0 and 2.0

In early releases of the audio file recognition service (version 2.0 by default), recognition results from the callback and polling methods differed in JSON string style and fields. Version 4.0 updates the callback method, making its recognition results consistent with those of the polling method. Both methods now return a camel case JSON string. For more information, see the interface description.

Supported telephony languages

Currently, 8k telephony speech supports only English, while 16k non-telephony speech supports additional languages. See features for a list of supported language models.

Transcribe audio from a URL

Yes, you can use the audio file transcription feature. See the API reference for details.

Audio track ID in real-time transcription

No. The audio track ID is specific to audio file recognition. Real-time transcription processes only single-channel audio, so channel differentiation is unnecessary.

SRT subtitle file generation

Not currently. You need to splice the subtitle file yourself based on the timestamp information in the returned results.

To generate subtitles with accurate timestamps and control line length, use the following parameters:

Set enable_words to true to obtain word-level timestamps. Word-level timestamps are required to build individual SRT entries.
Set sentence_max_length to control the maximum number of characters per subtitle line. The valid range is [4, 50]; the default value of 0 disables this limit.
When enable_subtitle is set to True, the returned subtitle indices (begin_index / end_index) correspond to character positions in the plain-text output (SSML tags are stripped). These indices can be directly mapped back to positions in the original transcript.

Supported audio encoding formats

Supported formats vary by service. See the documentation for your specific service for details. You can also use common audio editing software, such as Audacity, to check an audio file's encoding format.

Supported sampling rates

The speech recognition service supports sampling rates of 8,000 Hz and 16,000 Hz. An 8,000 Hz sampling rate is common for call center scenarios. For mobile apps, PC tools, and H5 pages, a sampling rate of 16,000 Hz is standard. If your source audio is recorded at a different sampling rate, such as 32 kHz or 44 kHz, you must first transcode it to 16 kHz.

Audio file transcription also supports higher sampling rates, such as 32 kHz and 44 kHz.

Check audio file sampling rate

Use common audio editing software, such as Audacity, or the open-source command-line tool FFmpeg.

Supported languages and dialect models

Speech recognition currently supports the following languages and dialect models:

Language

Language	Model name	Sample rate	Punctuation	ITN	Output smoothing	Semantic segmentation	Word-level alignment
English	General - English, Education Livestream - English, Educational Content Analysis - English	16k	Supported	Supported	Supported	Not Supported	Supported
	General Customer Service - English	8k	Supported	Supported	Supported	Not Supported	Not Supported
	Southeast Asian multilingual	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Japanese	General - Japanese	16k	Supported	Supported	Not Supported	Not Supported	Supported
Spanish	General - Spanish	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
Spanish	General Customer Service - Spanish	8k	Supported	Supported	Not Supported	Not Supported	Not Supported
Arabic	General - Arabic	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Kazakh	General - Kazakh	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Korean	General - Korean	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
Thai	General - Thai	16k	Not Supported	Not Supported	Not Supported	Not Supported	Not Supported
	General Customer Service - Thai	8k	Not Supported	Not Supported	Not Supported	Not Supported	Not Supported
	Southeast Asian multilingual	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Indonesian	General - Indonesian	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
	General Customer Service - Indonesian	8k	Supported	Supported	Not Supported	Not Supported	Not Supported
	Southeast Asian multilingual	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Russian	General - Russian	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
Vietnamese	General - Vietnamese	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
	General Customer Service - Vietnamese	8k	Supported	Supported	Not Supported	Not Supported	Not Supported
	Southeast Asian multilingual	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
French	General - French	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
German	General - German	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
Italian	General - Italian	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Hindi	General - Hindi	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Malay	General - Malay	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
	General Customer Service - Malay	8k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
	Southeast Asian multilingual	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Filipino	General - Filipino	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
	General Customer Service - Filipino	8k	Supported	Supported	Not Supported	Not Supported	Not Supported
	Southeast Asian multilingual	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Tamil	General - Tamil	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Portuguese	General - Portuguese	16k	Supported	Supported	Not Supported	Not Supported	Not Supported
Turkish	General - Turkish	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Polish	General - Polish	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Ukrainian	General - Ukrainian	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Romanian	General - Romanian	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Dutch	General - Dutch	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Greek	General - Greek	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Hungarian	General - Hungarian	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Javanese	General - Javanese	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Bengali	General - Bengali	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Burmese	General - Burmese	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Lao	General - Lao	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Swahili	General - Swahili	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Azerbaijani	General - Azerbaijani	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Persian	General - Persian	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Sinhala	General - Sinhala	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Catalan	General - Catalan	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Khmer	General - Khmer	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Hebrew	General - Hebrew	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Croatian	General - Croatian	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Hausa	General - Hausa	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Marathi	General - Marathi	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Telugu	General - Telugu	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Punjabi	General - Punjabi	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Swedish	General - Swedish	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Bulgarian	General - Bulgarian	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Danish	General - Danish	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Norwegian	General - Norwegian	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Kannada	General - Kannada	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Malayalam	General - Malayalam	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Czech	General - Czech	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Urdu	General - Urdu	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Nepali	General - Nepali	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Mongolian (Cyrillic)	General - Mongolian (Cyrillic)	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Uzbek	General - Uzbek	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported

Dialect

Language	Model name	Sample rate	Punctuation	ITN	Smoothing	Semantic segmentation	Audio-text alignment
Cantonese	General - Cantonese	16k	Supported	Supported	Supported	Not Supported	Supported
	Telephony Customer Service (General)	8k	Supported	Supported	Supported	Not Supported	Supported
	Cantonese-Mandarin Code-switching	8k	Supported	Supported	Supported	Not Supported	Not Supported
	Southeast Asian Multilingual	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Cantonese (Traditional Chinese)	General - Cantonese (Traditional Chinese)	8k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Cantonese (Traditional Chinese)	General - Cantonese (Traditional Chinese)	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported
Sichuanese	General - Sichuanese	16k	Supported	Supported	Supported	Supported	Supported
Sichuanese	Telephony Customer Service (General)	8k	Supported	Supported	Supported	Supported	Supported
Hubei dialect	General - Hubei dialect	16k	Supported	Supported	Supported	Supported	Supported
Hubei dialect	General - Hubei dialect	8k	Supported	Supported	Supported	Supported	Supported
Shanghainese	General - Shanghainese	16k	Supported	Supported	Supported	Supported	Not Supported
Hunan dialect	General - Hunan dialect	16k	Supported	Supported	Supported	Supported	Supported
Henan dialect	General - Henan dialect	16k	Supported	Supported	Supported	Supported	Supported
Henan dialect	General - Henan dialect	8k	Supported	Supported	Supported	Supported	Supported
Zhejiang dialect	General - Zhejiang dialect	16k	Supported	Supported	Supported	Supported	Not Supported
Northeastern Mandarin	General - Northeastern Mandarin	16k	Supported	Supported	Supported	Supported	Supported
Shandong dialect	General - Shandong dialect	16k	Supported	Supported	Supported	Supported	Supported
Tianjin dialect	General - Tianjin dialect	16k	Supported	Supported	Supported	Supported	Supported
Shaanxi dialect	General - Shaanxi dialect	16k	Supported	Supported	Supported	Supported	Supported
Shanxi dialect	General - Shanxi dialect	16k	Supported	Supported	Supported	Supported	Supported
Guizhou dialect	General - Guizhou dialect	16k	Supported	Supported	Supported	Supported	Supported
Yunnan dialect	General - Yunnan dialect	16k	Supported	Supported	Supported	Supported	Supported
Gansu dialect	General - Gansu dialect	16k	Supported	Supported	Supported	Supported	Supported
Uyghur	General - Uyghur	16k	Not Supported	Not Supported	Not Supported	Not Supported	Not Supported
Uyghur	General - Uyghur	8k	Not Supported	Not Supported	Not Supported	Not Supported	Not Supported
Suzhou dialect	General - Suzhou dialect	16k	Supported	Supported	Supported	Supported	Not Supported
Minnan	General - Minnan	16k	Supported	Supported	Supported	Supported	Not Supported
Jiangxi dialect	General - Jiangxi dialect	16k	Supported	Supported	Supported	Supported	Supported
Ningxia dialect	General - Ningxia dialect	16k	Supported	Supported	Supported	Supported	Supported
Guangxi dialect	General - Guangxi dialect	16k	Supported	Supported	Supported	Supported	Supported
Guangxi dialect	General - Guangxi dialect	8k	Supported	Supported	Supported	Supported	Supported
Mandarin Chinese	Shiyinshi V1 - end-to-end model; Education Content Analysis; Medical Content Analysis; News Media Content Analysis; Entertainment Video Content Analysis; Offline Audio/Video Transcription (Upgraded); New Retail Domain Recognition Model; Travel Domain Recognition Model; Automotive Domain	16k	Supported	Supported	Supported	Supported	Supported
	Chinese-English Code-switching	16k	Supported	Supported	Supported	Supported	Not Supported
	Shiyinshi V1 - end-to-end model	8k	Supported	Supported	Supported	Supported	Supported
	Southeast Asian Multilingual	16k	Supported	Not Supported	Not Supported	Not Supported	Not Supported

To view the latest supported models, log on to the Intelligent Speech Interaction console. On the My Projects page, find your project and click Project Feature Configuration in the Actions column. On the Project Feature Configuration page, go to the ASR panel and click modify configuration to view or change the speech recognition model.

In addition to the model listings above, note the following capability boundaries:

The Shiyinshi V1 end-to-end model supports Chinese Mandarin only. Cantonese is not supported by this model.
Recognizing English audio using a Chinese Mandarin model is non-standard behavior and accuracy is not guaranteed. Conversely, English models cannot recognize Chinese.
Nantong city dialect is not currently supported. As an alternative, you may try similar Wu-dialect models such as Suzhou dialect or Shanghai dialect to evaluate suitability.
The fun-asr-far-field model is currently the best-performing far-field recognition model. The TINGWU test environment uses the same model set as the commercial environment (default model or the fun-asr series).

Automatic sentence breaking

Real-time speech recognition can identify sentence breaks. In contrast, short sentence recognition processes only one sentence per request and does not identify sentence breaks.

Supported audio formats

Audio file recognition supports all combinations of sampling rate and bit depth (8 kHz 8-bit, 8 kHz 16-bit, 16 kHz 8-bit, and 16 kHz 16-bit). It supports both mono and stereo channels and the PCM format (uncompressed PCM or WAV). real-time speech recognition supports only 8 kHz 16-bit and 16 kHz 16-bit audio, requires a mono channel, and accepts WAV and MP3 formats.

WebSocket reconnection and keep-alive for real-time speech recognition

The following guidance covers reconnection, keep-alive, and instance reuse:

Reconnection mechanism — After a task disconnects, the streaming URL remains valid for 24 hours. If the connection drops because no audio data was sent (or the task enters a PAUSED state), you can reconnect using the original URL within 24 hours to resume streaming without creating a new task.
Keep-alive method — An idle connection is automatically closed after 10 seconds. To maintain a long-running connection when no valid speech is being captured, continuously send silence data (an all-zero buffer array). This prevents idle-timeout disconnections.
Instance reuse — A single SpeechTranscriber instance supports multiple rounds of recognition. After calling stop(), as long as the connection has not been closed, you do not need to recreate the instance or call start() again — simply send the new audio stream to begin the next recognition round.

Recorded file recognition submission succeeded but results are missing or errors occur

The following are common causes and solutions:

Server cannot download the audio file — Ensure the audio URL is publicly accessible and does not involve a 302 redirect (OSS URLs are recommended). If a redirect URL must be used, add the parameter download_method: curl to your request.
Calling too soon after generating the URL — If the API is called immediately after the audio URL is generated, the server may not yet be ready and the request may fail. Wait at least 1 minute before making the request.
Empty result with status code 40270002 (punctagger silent) — This typically indicates that the audio consists entirely of background noise or an invalid voicemail recording that the model cannot recognize.
Cannot view queue status in the console — The console does not support viewing the number of queued recognition tasks. For paid users, results are normally returned within 3 hours.

Supported audio capture methods and speaker separation limitations

Web audio capture — Web browsers support only microphone capture. Direct capture of system audio (for example, audio playing through speakers) is not supported. For online meeting scenarios, you must mix the microphone and system audio streams before sending the combined stream to the recognition service.
Gender identification — Single-sentence recognition (one-shot recognition) does not support gender identification.
Single-channel speaker separation (diarization) — Speaker separation on single-channel recordings cannot guarantee 100% accuracy, especially for overlapping speech or very short utterances. For higher accuracy, use physical multi-channel recording (separate audio tracks per speaker) or the Model Studio Fun-ASR product.
Real-time vs. offline diarization — When real-time speaker role separation does not perform well, consider applying offline speaker separation after the recording is complete to improve accuracy.

Data privacy, logging, and audio storage policies

Recognition logs — Recognition logs contain the transcribed text. Alibaba Cloud staff access these logs only when a customer reports an issue and provides the task ID for troubleshooting. Proactive access to customer data does not occur.
Audio file storage — The real-time speech transcription service does not retain audio files on the server. If you need to keep a recording of the audio, you must capture and save it on the client side before sending it to the service.
Command filtering — The service does not support restricting recognition to specific commands or avatar names. Any such filtering must be implemented at the application layer by the developer.

Emotion recognition, segmented model switching, and edition differences for recorded file recognition

Emotion recognition — Yes. Emotion and intonation recognition is supported. The EmotionValue parameter is included in the recognition result.
Segmented model switching — Specifying different AppKeys for different time segments within the valid_times parameter is not supported. If you need to use different models for different segments, split the audio file into segments and submit each segment separately with the corresponding AppKey.
Standard vs. ultra-speed editions — The standard and ultra-speed editions of recorded file recognition are functionally equivalent. The primary difference is recognition speed — the ultra-speed edition processes audio significantly faster.
Shared model set — Real-time recognition and recorded file recognition use the same underlying model set.

Performance

Calculating recognition accuracy

The industry typically measures recognition performance using an error rate. This is the character error rate (CER) for Chinese and the word error rate (WER) for English. The formula is: (Number of insertions + Number of deletions + Number of substitutions) / Total number of words or characters. For example, consider the following data:

%WER  15.07  [  2165 /  14365,  74  ins,  385  del,  1706  sub  ]
%SER  67.75  [  645 /  952  ]
Scored  952  sentences,  0  not  present  in  hyp.

The recognition accuracy is (14365 - 74 - 385 - 1706) / 14365 = 84.93%. A quick calculation is: 100 - 15.07 = 84.93.

Speech recognition model accuracy

DAMO Academy's Intelligent Speech Interaction models are certified by the China National Software Testing Center (CNAS) to achieve over 98% recognition accuracy. These tests involved reading a 1,207-character Mandarin sample at 240 characters per minute into a headset microphone from 1 cm away, in an environment with noise levels below 60 dB. This result is the average of five test rounds.

In real-world scenarios, factors such as microphone quality, background noise, and accent variations can affect accuracy. For audio in PCM or WAV format with an 8 kHz sampling rate, 16-bit depth, dual-channel tracks (separate tracks for user and agent), and a signal-to-noise ratio (SNR) above 20 dB, we guarantee an accuracy of at least 85% in most commercial use cases, ensuring effective performance.

File transcription Express Edition latency

The file transcription Express Edition completes the recognition of a 30-minute audio file within 10 seconds. This duration is measured from when the service receives the entire audio file until recognition is complete. Total processing time may vary, as it is also affected by factors such as your client's network bandwidth. The latency field in the server response indicates the server-side processing time.

8 kHz models for 16 kHz audio

No. The 8 kHz and 16 kHz models only support audio with their corresponding sampling rates.

Call limits for file transcription Express Edition

There is no call frequency limit, but there is a concurrency limit. You can view your concurrency limit in the Console.

Cantonese recognition accuracy

The recognition accuracy for our 8 kHz and 16 kHz Cantonese models ranges from 80% to 95%. The corpus data and the speaker's pronunciation affect actual results. Accuracy may also vary slightly (by about 5%) between different services. Service accuracy is ranked as follows: file transcription > short speech recognition > real-time speech recognition.

Transcription time for short audio files

File transcription is an offline API. For free-tier users, the service completes recognition tasks and returns the transcript within 24 hours. For paid users, the service completes tasks within 3 hours. For short audio clips under 60 seconds, we recommend using short speech recognition for faster results.

Task prioritization

This feature is not currently supported. However, our file transcription service is designed to process tasks quickly.

Recommended model for two-party calls

We recommend using the new-generation end-to-end Shiyinshi recognition model, which offers excellent overall performance.

Service request duration limits

Short speech recognition supports real-time audio up to 60 seconds long.
Real-time speech recognition does not have a duration limit.

Streaming and non-streaming modes

Non-streaming mode, also known as standard mode, returns a single recognition result after the service detects that the speaker has finished a complete utterance. In contrast, streaming mode returns intermediate results as the user speaks, followed by a final result at the end of the utterance.

Endpoint latency

Endpoint latency is the time from sending the last byte of audio to receiving the final recognition result.

For real-time services like short speech recognition and real-time transcription, the latency is typically around 300 ms, though this can vary slightly based on the model and audio characteristics.

The RESTful API for short speech recognition processes audio in batches, so its recognition time is proportional to the audio duration. Excluding network overhead, you can estimate the processing time with the formula: Processing Time = Audio Duration × 0.2. For example, a 1-minute audio file takes approximately 12 seconds to process. Actual performance may vary depending on the model and server load.

Slow real-time speech recognition

Check if you have set the enable_semantic_sentence_detection parameter to true.

Setting this parameter to true enables semantic sentence detection, which improves recognition accuracy but slightly increases latency. If low latency is a high priority, set this parameter to false.

Recognition quality

Optimizing misrecognized words

First, consider optimization without labeled data:
1. Use business-specific text data to optimize a custom language model. This data can include business keywords, related sentences, and documents. In your training data, generalize the target words as much as possible. For example, add related phrases like "What is Yinshui-e-Dai?" and "How do I apply for Yinshui-e-Dai?" to the training data.
2. For business keywords that are still misrecognized, improve the custom language model by duplicating the terms in your corpus or increasing the model weight.
3. For individual problematic keywords, use generalized hotwords.
Next, consider optimization with labeled data:
1. If optimization without labeled data proves insufficient and overall recognition accuracy remains low (especially due to accents), optimize the acoustic model.
2. Acoustic model optimization requires labeled data. This labeled data can also be added to your business-related corpus to further optimize the language model.

Single-character recognition failure

Single characters are difficult to recognize because they have short pronunciation units and lack semantic context. Adding them as hotwords may not be effective. We recommend using a custom language model.

Adjusting hotword weights

You cannot set weights for hotwords for names and locations, but you can for business-specific hotwords. Currently, you cannot set weights when creating a business-specific hotword in the console. You must use the API to adjust the weight.

Fixing inaccurate timestamps

Set the enable_timestamp_alignment parameter to true to calibrate timestamps. For more information, see the API Reference.

Handling sensitive recognition and noise

You can adjust the VAD noise threshold by setting the speech_noise_threshold parameter.

The speech_noise_threshold parameter accepts values from -1 to 1. A lower value increases sensitivity, which may cause the service to misrecognize background noise as speech. A higher value may cause the service to misidentify speech segments as noise and fail to recognize them. For example, if you set the parameter to 0.6 and find it is still too sensitive, try increasing the value to 0.7. If you notice dropped words or missed speech, decrease the value to 0.5, 0.2, or even -0.2.

Code examples:

Java

transcriber.addCustomedParam("speech_noise_threshold", -0.1);

C++

request->setPayloadParam("speech_noise_threshold",-0.1);

Improving far-field recognition

This occurs because the VAD thresholds for far-field and near-field audio are different. We recommend that you adjust the speech_noise_threshold parameter. The value range for the speech_noise_threshold parameter is [-1, 1]. A smaller value increases sensitivity, which may cause more noise to be misidentified as speech. A larger value decreases sensitivity, which may cause more speech segments to be misidentified as noise and not be recognized. For example, if you set the parameter to -0.2 but speech is still frequently dropped, you can decrease the value to -0.3 or -0.4. If you find that a large amount of noise is being misidentified as speech, you can increase the value to -0.1 or 0.

Handling inaccurate ASR recognition

ASR models typically provide a stable level of accuracy. If speech recognition is consistently inaccurate or the accuracy rate is very low, check for configuration errors. Ensure that the following three values are consistent: the actual sample rate of your audio (ASR supports only 8 kHz 16-bit or 16 kHz 16-bit for real-time transcription), the sample rate parameter set in your API call (8000 or 16000), and the server-side ASR model (8 kHz or 16 kHz).

Important

If you are using the public cloud ASR service, check the model sample rate selected for your appkey in the Alibaba Cloud console. If you are in a private cloud ASR environment, check the sample rate defined in the service/resource/asr/default/models/readme.txt file.

Fixing consistently misrecognized words

Try the following solutions, using "Yinshui-e-Dai" as an example:

Optimize a custom language model by training it on the self-learning platform. For example, add related phrases like "What is Yinshui-e-Dai?" and "How do I apply for Yinshui-e-Dai?" to the text corpus.
Use hotword optimization on the self-learning platform. For example, add "Yinshui-e-Dai" as a hotword and set an appropriate weight.
In a private cloud environment, use the ASR allowlist. For example, if "Yinshui-e-Dai" is often misrecognized as "yin shui yi dai", you can add an entry to the service/resource/asr/default/models/nn_itn/correct.list file in the specified format. The first column contains the misrecognized text and the second column contains the correct text. Note: You must restart the ASR service for this change to take effect.

Improving punctuation and sentence segmentation

By default, the service uses VAD for sentence segmentation. To use semantic sentence detection, set the enable_semantic_sentence_detection parameter to true. For real-time transcription, also enable intermediate results by setting the enable_intermediate_result parameter to true.

Poor real-time punctuation and segmentation

Ensure that intermediate results are enabled. In real-time scenarios, semantic sentence detection requires intermediate results.

Handling duplicate transcription results

This typically occurs with dual-channel audio files where both channels contain identical content. This is expected behavior. To fix this, set the first_channel_only parameter to true to transcribe only the first channel.

Troubleshooting slow performance and timeouts

Troubleshooting steps:

Run the sample code provided by Alibaba Cloud, compare its behavior with your service, and collect the logs.
Record the taskid of the request for easier troubleshooting.
Use a packet capture tool on the client, such as TCPDump for Linux or Wireshark for Windows, to check the network status.

Handling low recognition accuracy

Ensure the audio sample rate matches the model selected for your application in the console, and that the audio is mono (single-channel).

Important

Only audio file transcription supports dual-channel audio.

Further steps for inaccurate recognition

You can improve recognition accuracy in two ways:

Use the hotword feature to quickly improve accuracy in real time. For more information, see Hotword Overview.
Enable model training on the self-learning platform to improve recognition accuracy for a large volume of text through model customization. For more information, see Language Model Customization Overview.

Streaming intermediate results differ from SentenceEnd final results, or hotwords not taking effect

Model difference (intermediate vs. final results) — The streaming intermediate results (TranscriptionResultChanged events) and the final sentence-end results (SentenceEnd events) are produced by different models. Discrepancies between them are normal behavior. The frontend should replace the intermediate display with the final result after the SentenceEnd event is received.
Hotwords not matched — Verify that the correct AppKey is associated with the active hotword configuration, and confirm that the spoken words are clearly enunciated. Note that hotwords may not take effect in Japanese-language scenarios or when speech contains heavy Chinese-English code-switching.
Model alignment is not supported — It is not possible to configure the streaming model and the final-result model to use the same underlying model.

Optimizing speech recognition in noisy environments

The following approaches can improve accuracy in noisy conditions:

Adjust the VAD noise threshold — Tune the speech_noise_threshold parameter (range: [-1, 1]). For far-field scenarios or frequent dropped words, decrease the value (for example, to -0.2 or -0.4). If excessive noise is being misrecognized as speech, increase the value.
Apply client-side audio pre-processing — Integrate noise reduction (for example, RNNoise) and acoustic echo cancellation (AEC) at the application layer. Place the microphone as close to the speaker as possible to improve the signal-to-noise ratio before the audio reaches the recognition service.
Use the Tongyi Multimodal Interactive Development Kit — This kit provides enhanced on-device audio algorithms that can improve recognition quality in challenging acoustic environments.
Verify audio channel configuration — If the sample rate configuration is correct but accuracy is still poor, check whether the audio is recorded in mono. Only recorded file recognition supports dual-channel audio; real-time recognition requires mono audio.

Abnormal recognition results, unknown values, or specific words consistently misrecognized

The following scenarios and solutions are provided:

score: -1000.0 and type: 0 returned — The audio is too short or the vocal characteristics are too ambiguous for the model to produce a reliable result. Try using a different audio sample with clearer, longer speech.
Digit recognition anomalies (for example, saying "1 through 9" returns "1 through 10") — This is a known issue with no current fix planned for the standard model. The Model Studio fun-asr-realtime model is recommended as an alternative.
Phonetic confusion between similar-sounding words (for example, a phrase is misrecognized as a phonetically similar phrase) — This may be caused by a mismatched audio bit depth. Consider using the recorded file recognition API, which may produce more accurate results in such cases.
voice parameter produces incorrect output — Cross-reference the official voice list in the documentation to verify that the parameter value is correct and matches the intended voice profile.

SDK usage

Does short-sentence recognition use WebSocket?

Yes. The SDK uses the WebSocket protocol, while the RESTful API uses the HTTP protocol. For more details, see the API Reference.

Python SDK for real-time speech recognition

Not at this time.

Meaning of endtime=-1 in JSON response

It indicates that the current utterance has not ended. The service returns intermediate results only when the speech recognition mode is set to "streaming".

HarmonyOS Next SDK: Change the sampling rate

The sample code uses a 16,000 Hz sampling rate for both recording and recognition. To change this to 8,000 Hz, you must modify the sampling rate parameter in two locations: the recording interface and the SDK configuration:

In the AudioCapture.ets file, change samplingRate:audio.AudioSamplingRate.SAMPLE_RATE_16000 to samplingRate:audio.AudioSamplingRate.SAMPLE_RATE_8000.
In the corresponding recognition sample code file (.ets), modify the genInitParams() function to add the sample_rate setting: object.set("sample_rate", "8000"). No other changes are required.

Errors or connection anomalies when stopping a real-time speech recognition task

The following error scenarios and solutions are provided:

Error: "Got stop directive while task is stopping" — A Stop command was sent more than once. Wait for the TranscriptionCompleted event to be received before closing the connection.
Error: "Got stream data while task is stopping" — Audio data was still being sent when the stop command was issued. Stop sending audio data before issuing the stop command.
TranscriptionCompleted not received and connection drops — The client closed the connection before the completion event was returned. Maintain the connection until TranscriptionCompleted is received.
Error: "Close received after close" — An idle-timeout auto-close occurred and the client did not correctly handle the resulting state. Ensure the connection is properly closed after the task ends, or use silence keep-alive packets to prevent idle-timeout disconnection in the first place.

Punctuation prediction and hotword configuration in Nui SDK, and hotword limitations of the old API

Punctuation prediction — To include punctuation such as commas in recognition results when using Nui SDK, enable punctuation prediction by setting enable_punctuation_prediction to true.
Hotword feature and SDK version — The hotword feature is independent of the SDK version and is supported across all NLS ASR services. Hotwords are created in the Intelligent Speech Interaction console and take effect per AppKey.
Old API (standard edition) hotword limitations — The old API (standard edition) does not support directly specifying model parameters for hotwords. For proper noun optimization in this scenario, use a combination of hotwords and language model customization to achieve the best results.

Billing

Trial for Audio File Transcription Express Edition

No, this is a paid service.

Features

Failed sentence breaking in real-time transcription

Sentence segmentation

Offline feature support

Offline SDK and local offline recognition

Supported models

Chinese-English mixed recognition

ITN conversion of Chinese numerals

enable_sample_rate_adaptive vs. sample_rate

Distinguishing agent and customer

Punctuation for short utterance recognition

Distinguishing left and right channels

Multiple vocabularies

Differences between versions 4.0 and 2.0

Supported telephony languages

Transcribe audio from a URL

Audio track ID in real-time transcription

SRT subtitle file generation

Supported audio encoding formats

Supported sampling rates

Check audio file sampling rate

Supported languages and dialect models

Automatic sentence breaking

Supported audio formats

WebSocket reconnection and keep-alive for real-time speech recognition

Recorded file recognition submission succeeded but results are missing or errors occur

Supported audio capture methods and speaker separation limitations

Data privacy, logging, and audio storage policies

Emotion recognition, segmented model switching, and edition differences for recorded file recognition

Performance

Calculating recognition accuracy

Speech recognition model accuracy

File transcription Express Edition latency

8 kHz models for 16 kHz audio

Call limits for file transcription Express Edition

Cantonese recognition accuracy

Transcription time for short audio files

Task prioritization

Recommended model for two-party calls

Service request duration limits

Streaming and non-streaming modes

Endpoint latency

Slow real-time speech recognition

Recognition quality

Optimizing misrecognized words

Single-character recognition failure

Adjusting hotword weights

Fixing inaccurate timestamps

Handling sensitive recognition and noise

Improving far-field recognition

Handling inaccurate ASR recognition

Fixing consistently misrecognized words

Improving punctuation and sentence segmentation

Poor real-time punctuation and segmentation

Handling duplicate transcription results

Troubleshooting slow performance and timeouts

Handling low recognition accuracy

Further steps for inaccurate recognition

Streaming intermediate results differ from SentenceEnd final results, or hotwords not taking effect

Optimizing speech recognition in noisy environments

Abnormal recognition results, unknown values, or specific words consistently misrecognized

SDK usage

Does short-sentence recognition use WebSocket?

Python SDK for real-time speech recognition

Meaning of endtime=-1 in JSON response

HarmonyOS Next SDK: Change the sampling rate

Errors or connection anomalies when stopping a real-time speech recognition task

Punctuation prediction and hotword configuration in Nui SDK, and hotword limitations of the old API

Billing

Trial for Audio File Transcription Express Edition

`enable_sample_rate_adaptive` vs. `sample_rate`