This topic provides answers to frequently asked questions (FAQs) about the speech recognition service.
FAQs about speech recognition fall into the following categories:
-
Features
-
Can speech recognition automatically segment multiple sentences?
-
Does the speech recognition service support offline functionality?
-
Can speech recognition handle audio that contains a few English words and letters?
-
Can audio file transcription distinguish between an agent and a customer?
-
How does offline file transcription differentiate between left and right audio channels?
-
Which countries are supported for speech recognition over the phone?
-
Can real-time transcription include a track ID, similar to audio file transcription?
-
What audio encoding formats does the speech recognition service support?
-
What sample rates does the speech recognition service support?
-
What dialect models and languages does the speech recognition service support?
-
Can speech recognition automatically segment multiple sentences?
-
What audio formats do real-time transcription and audio file transcription support?
-
Performance
-
What character accuracy rate can a speech recognition model achieve?
-
What is the latency of the Flash Version of audio file transcription?
-
Is there a request rate limit for the Flash Version of audio file transcription?
-
What is the difference between recognition in streaming mode and non-streaming mode?
-
Recognition quality
-
If the hotword feature performs poorly, can I adjust the weights myself?
-
How do I fix inaccurate timestamps in audio file transcription?
-
How can I prevent speech recognition from transcribing background noise?
-
In real-time scenarios, why is punctuation and sentence segmentation poor even when enabled?
-
Can a single audio file transcription request return the same result twice?
-
How do I troubleshoot slow performance and timeout issues in real-time transcription?
-
Why is speech recognition accuracy low, sometimes recognizing only a few words?
-
SDK usage
-
Billing
Features
Failed sentence breaking in real-time transcription
-
If you are using VAD sentence breaking, the real-time transcription service relies on silence data to detect pauses. If the upstream does not send silence data, the server cannot identify when the speaker has paused. To diagnose this, compare the audio duration processed on the server with the actual audio duration. If the processed duration is shorter, it indicates that the upstream did not send silence data during pauses.
-
If you use semantic sentence breaking, the performance of the post-processing model might be the cause.
Solution: Continuously send silence data to the server when the user pauses.
Sentence segmentation
The real-time speech recognition service performs sentence segmentation, while the one-shot speech recognition service processes only one sentence per request.
Offline feature support
No. This service does not support offline speech recognition. All audio data must be sent to the server for recognition.
Supported models
The Intelligent Speech Interaction console lists the available model types in the project function configuration. Models are available for 8 kHz and 16 kHz sampling rates, and each sampling rate includes multiple domain models.
Chinese-English mixed recognition
Yes. The Chinese Mandarin model can recognize audio that contains a mix of Chinese and English.
ITN conversion of Chinese numerals
The system's model determines whether to convert a number to an Arabic numeral. The model does not convert all numbers, as its goal is to match the formats commonly used in standard written text.
enable_sample_rate_adaptive vs. sample_rate
No. The sample_rate parameter for the Flash Version defines the sampling rate, while enable_sample_rate_adaptive is a switch. The Flash Version uses a default sampling rate of 16 kHz and does not require this switch.
Distinguishing agent and customer
A speech recognition engine can distinguish between different speakers, but it cannot determine their specific roles, such as an agent or a customer. You need to map the speakers to their roles based on your specific use case. We recommend storing recordings categorized by role.
Punctuation for short utterance recognition
The service adds punctuation by analyzing the audio's acoustic features and running linguistic analysis on the recognized text.
Distinguishing left and right channels
The speech recognition engine cannot distinguish between left and right channels. When you submit multi-channel audio to the speech recognition service, the response uses the channel_id field to identify the audio tracks. If the recording order is consistent, you can use channel_id to map each track to its corresponding channel. For details, see the API reference.
Multiple vocabularies
You can pass only one vocabulary_id per request. If you need to use multiple vocabularies, use a custom model. For more information, see Create business-specific hotwords by using the POP API.
Differences between versions 4.0 and 2.0
In early releases of the audio file recognition service (version 2.0 by default), recognition results from the callback and polling methods differed in JSON string style and fields. Version 4.0 updates the callback method, making its recognition results consistent with those of the polling method. Both methods now return a camel case JSON string. For more information, see the interface description.
Supported telephony languages
Currently, 8k telephony speech supports only English, while 16k non-telephony speech supports additional languages. See features for a list of supported language models.
Transcribe audio from a URL
Yes, you can use the audio file transcription feature. See the API reference for details.
Audio track ID in real-time transcription
No. The audio track ID is specific to audio file recognition. Real-time transcription processes only single-channel audio, so channel differentiation is unnecessary.
SRT subtitle file generation
Not at this time. You will need to assemble the file yourself using the timestamps in the response.
Supported audio encoding formats
Supported formats vary by service. See the documentation for your specific service for details. You can also use common audio editing software, such as Audacity, to check an audio file's encoding format.
Supported sampling rates
The speech recognition service supports sampling rates of 8,000 Hz and 16,000 Hz. An 8,000 Hz sampling rate is common for call center scenarios. For mobile apps, PC tools, and H5 pages, a sampling rate of 16,000 Hz is standard. If your source audio is recorded at a different sampling rate, such as 32 kHz or 44 kHz, you must first transcode it to 16 kHz.
Audio file transcription also supports higher sampling rates, such as 32 kHz and 44 kHz.
Check audio file sampling rate
Use common audio editing software, such as Audacity, or the open-source command-line tool FFmpeg.
Supported languages and dialect models
Speech recognition currently supports the following languages and dialect models:
To view the latest supported models, log on to the Intelligent Speech Interaction console. On the My Projects page, find your project and click Project Feature Configuration in the Actions column. On the Project Feature Configuration page, go to the ASR panel and click modify configuration to view or change the speech recognition model.
Automatic sentence breaking
Real-time speech recognition can identify sentence breaks. In contrast, short sentence recognition processes only one sentence per request and does not identify sentence breaks.
Supported audio formats
Audio file recognition supports all combinations of sampling rate and bit depth (8 kHz 8-bit, 8 kHz 16-bit, 16 kHz 8-bit, and 16 kHz 16-bit). It supports both mono and stereo channels and the PCM format (uncompressed PCM or WAV). real-time speech recognition supports only 8 kHz 16-bit and 16 kHz 16-bit audio, requires a mono channel, and accepts WAV and MP3 formats.
Performance
Calculating recognition accuracy
The industry typically measures recognition performance using an error rate. This is the character error rate (CER) for Chinese and the word error rate (WER) for English. The formula is: (Number of insertions + Number of deletions + Number of substitutions) / Total number of words or characters. For example, consider the following data:
%WER 15.07 [ 2165 / 14365, 74 ins, 385 del, 1706 sub ]
%SER 67.75 [ 645 / 952 ]
Scored 952 sentences, 0 not present in hyp.
The recognition accuracy is (14365 - 74 - 385 - 1706) / 14365 = 84.93%. A quick calculation is: 100 - 15.07 = 84.93.
Speech recognition model accuracy
DAMO Academy's Intelligent Speech Interaction models are certified by the China National Software Testing Center (CNAS) to achieve over 98% recognition accuracy. These tests involved reading a 1,207-character Mandarin sample at 240 characters per minute into a headset microphone from 1 cm away, in an environment with noise levels below 60 dB. This result is the average of five test rounds.
In real-world scenarios, factors such as microphone quality, background noise, and accent variations can affect accuracy. For audio in PCM or WAV format with an 8 kHz sampling rate, 16-bit depth, dual-channel tracks (separate tracks for user and agent), and a signal-to-noise ratio (SNR) above 20 dB, we guarantee an accuracy of at least 85% in most commercial use cases, ensuring effective performance.
File transcription Express Edition latency
The file transcription Express Edition completes the recognition of a 30-minute audio file within 10 seconds. This duration is measured from when the service receives the entire audio file until recognition is complete. Total processing time may vary, as it is also affected by factors such as your client's network bandwidth. The latency field in the server response indicates the server-side processing time.
8 kHz models for 16 kHz audio
No. The 8 kHz and 16 kHz models only support audio with their corresponding sampling rates.
Call limits for file transcription Express Edition
There is no call frequency limit, but there is a concurrency limit. You can view your concurrency limit in the Console.
Cantonese recognition accuracy
The recognition accuracy for our 8 kHz and 16 kHz Cantonese models ranges from 80% to 95%. The corpus data and the speaker's pronunciation affect actual results. Accuracy may also vary slightly (by about 5%) between different services. Service accuracy is ranked as follows: file transcription > short speech recognition > real-time speech recognition.
Transcription time for short audio files
File transcription is an offline API. For free-tier users, the service completes recognition tasks and returns the transcript within 24 hours. For paid users, the service completes tasks within 3 hours. For short audio clips under 60 seconds, we recommend using short speech recognition for faster results.
Task prioritization
This feature is not currently supported. However, our file transcription service is designed to process tasks quickly.
Recommended model for two-party calls
We recommend using the new-generation end-to-end Shiyinshi recognition model, which offers excellent overall performance.
Service request duration limits
-
Short speech recognition supports real-time audio up to 60 seconds long.
-
Real-time speech recognition does not have a duration limit.
Streaming and non-streaming modes
Non-streaming mode, also known as standard mode, returns a single recognition result after the service detects that the speaker has finished a complete utterance. In contrast, streaming mode returns intermediate results as the user speaks, followed by a final result at the end of the utterance.
Endpoint latency
Endpoint latency is the time from sending the last byte of audio to receiving the final recognition result.
For real-time services like short speech recognition and real-time transcription, the latency is typically around 300 ms, though this can vary slightly based on the model and audio characteristics.
The RESTful API for short speech recognition processes audio in batches, so its recognition time is proportional to the audio duration. Excluding network overhead, you can estimate the processing time with the formula: Processing Time = Audio Duration × 0.2. For example, a 1-minute audio file takes approximately 12 seconds to process. Actual performance may vary depending on the model and server load.
Slow real-time speech recognition
Check if you have set the enable_semantic_sentence_detection parameter to true.
Setting this parameter to true enables semantic sentence detection, which improves recognition accuracy but slightly increases latency. If low latency is a high priority, set this parameter to false.
Recognition quality
Optimizing misrecognized words
-
First, consider optimization without labeled data:
-
Use business-specific text data to optimize a custom language model. This data can include business keywords, related sentences, and documents. In your training data, generalize the target words as much as possible. For example, add related phrases like "What is Yinshui-e-Dai?" and "How do I apply for Yinshui-e-Dai?" to the training data.
-
For business keywords that are still misrecognized, improve the custom language model by duplicating the terms in your corpus or increasing the model weight.
-
For individual problematic keywords, use generalized hotwords.
-
-
Next, consider optimization with labeled data:
-
If optimization without labeled data proves insufficient and overall recognition accuracy remains low (especially due to accents), optimize the acoustic model.
-
Acoustic model optimization requires labeled data. This labeled data can also be added to your business-related corpus to further optimize the language model.
-
Single-character recognition failure
Single characters are difficult to recognize because they have short pronunciation units and lack semantic context. Adding them as hotwords may not be effective. We recommend using a custom language model.
Adjusting hotword weights
You cannot set weights for hotwords for names and locations, but you can for business-specific hotwords. Currently, you cannot set weights when creating a business-specific hotword in the console. You must use the API to adjust the weight.
Fixing inaccurate timestamps
Set the enable_timestamp_alignment parameter to true to calibrate timestamps. For more information, see the API Reference.
Handling sensitive recognition and noise
You can adjust the VAD noise threshold by setting the speech_noise_threshold parameter.
The speech_noise_threshold parameter accepts values from -1 to 1. A lower value increases sensitivity, which may cause the service to misrecognize background noise as speech. A higher value may cause the service to misidentify speech segments as noise and fail to recognize them. For example, if you set the parameter to 0.6 and find it is still too sensitive, try increasing the value to 0.7. If you notice dropped words or missed speech, decrease the value to 0.5, 0.2, or even -0.2.
Code examples:
-
Java
transcriber.addCustomedParam("speech_noise_threshold", -0.1); -
C++
request->setPayloadParam("speech_noise_threshold",-0.1);
Improving far-field recognition
This occurs because the VAD thresholds for far-field and near-field audio are different. We recommend that you adjust the speech_noise_threshold parameter. The value range for the speech_noise_threshold parameter is [-1, 1]. A smaller value increases sensitivity, which may cause more noise to be misidentified as speech. A larger value decreases sensitivity, which may cause more speech segments to be misidentified as noise and not be recognized. For example, if you set the parameter to -0.2 but speech is still frequently dropped, you can decrease the value to -0.3 or -0.4. If you find that a large amount of noise is being misidentified as speech, you can increase the value to -0.1 or 0.
Handling inaccurate ASR recognition
ASR models typically provide a stable level of accuracy. If speech recognition is consistently inaccurate or the accuracy rate is very low, check for configuration errors. Ensure that the following three values are consistent: the actual sample rate of your audio (ASR supports only 8 kHz 16-bit or 16 kHz 16-bit for real-time transcription), the sample rate parameter set in your API call (8000 or 16000), and the server-side ASR model (8 kHz or 16 kHz).
If you are using the public cloud ASR service, check the model sample rate selected for your appkey in the Alibaba Cloud console. If you are in a private cloud ASR environment, check the sample rate defined in the service/resource/asr/default/models/readme.txt file.
Fixing consistently misrecognized words
Try the following solutions, using "Yinshui-e-Dai" as an example:
-
Optimize a custom language model by training it on the self-learning platform. For example, add related phrases like "What is Yinshui-e-Dai?" and "How do I apply for Yinshui-e-Dai?" to the text corpus.
-
Use hotword optimization on the self-learning platform. For example, add "Yinshui-e-Dai" as a hotword and set an appropriate weight.
-
In a private cloud environment, use the ASR allowlist. For example, if "Yinshui-e-Dai" is often misrecognized as "yin shui yi dai", you can add an entry to the
service/resource/asr/default/models/nn_itn/correct.listfile in the specified format. The first column contains the misrecognized text and the second column contains the correct text. Note: You must restart the ASR service for this change to take effect.
Improving punctuation and sentence segmentation
By default, the service uses VAD for sentence segmentation. To use semantic sentence detection, set the enable_semantic_sentence_detection parameter to true. For real-time transcription, also enable intermediate results by setting the enable_intermediate_result parameter to true.
Poor real-time punctuation and segmentation
Ensure that intermediate results are enabled. In real-time scenarios, semantic sentence detection requires intermediate results.
Handling duplicate transcription results
This typically occurs with dual-channel audio files where both channels contain identical content. This is expected behavior. To fix this, set the first_channel_only parameter to true to transcribe only the first channel.
Troubleshooting slow performance and timeouts
Troubleshooting steps:
-
Run the sample code provided by Alibaba Cloud, compare its behavior with your service, and collect the logs.
-
Record the
taskidof the request for easier troubleshooting. -
Use a packet capture tool on the client, such as TCPDump for Linux or Wireshark for Windows, to check the network status.
Handling low recognition accuracy
Ensure the audio sample rate matches the model selected for your application in the console, and that the audio is mono (single-channel).
Only audio file transcription supports dual-channel audio.
Further steps for inaccurate recognition
You can improve recognition accuracy in two ways:
-
Use the hotword feature to quickly improve accuracy in real time. For more information, see Hotword Overview.
-
Enable model training on the self-learning platform to improve recognition accuracy for a large volume of text through model customization. For more information, see Language Model Customization Overview.
SDK usage
Does short-sentence recognition use WebSocket?
Yes. The SDK uses the WebSocket protocol, while the RESTful API uses the HTTP protocol. For more details, see the API Reference.
Python SDK for real-time speech recognition
Not at this time.
Meaning of endtime=-1 in JSON response
It indicates that the current utterance has not ended. The service returns intermediate results only when the speech recognition mode is set to "streaming".
HarmonyOS Next SDK: Change the sampling rate
The sample code uses a 16,000 Hz sampling rate for both recording and recognition. To change this to 8,000 Hz, you must modify the sampling rate parameter in two locations: the recording interface and the SDK configuration:
-
In the
AudioCapture.etsfile, changesamplingRate:audio.AudioSamplingRate.SAMPLE_RATE_16000tosamplingRate:audio.AudioSamplingRate.SAMPLE_RATE_8000. -
In the corresponding recognition sample code file (.ets), modify the
genInitParams()function to add the sample_rate setting:object.set("sample_rate", "8000"). No other changes are required.
Billing
Trial for Audio File Transcription Express Edition
No, this is a paid service.