This topic describes the audio input format requirements for the Speech Recognition services of Intelligent Speech Interaction. It also provides solutions for common issues with incompatible audio formats.
Supported audio formats
|
Service |
Requirements |
|
Short Speech Recognition |
|
|
Real-time Speech Recognition |
|
|
Audio File Transcription |
|
|
Audio File Transcription (Offline Edition) |
|
|
Audio File Transcription Express Edition |
|
Check your audio format
For definitions of common audio format terms, see Key concepts.
-
Sample rate: 8000 Hz (8 kHz) or 16000 Hz (16 kHz), which means 8,000 or 16,000 samples per second.
-
Bit depth: 16 bit, which means each audio sample is stored using 16 bits (2 bytes).
-
Channel: mono or stereo.
Calculate file size from audio duration:
File size (MiB) =
(sample rate × bit depth × number of channels × audio duration in seconds) / (8 × 1024 × 1024) = (16000 Hz × 16 bit × 1 channel × 60 s) / (8 × 1024 × 1024) ≈ 1.83 MiB
-
Check the audio format on Linux
-
Run the following command:
file input.wav -
Example output:
-
For an uncompressed WAV file with an 8000 Hz sample rate, 16-bit bit depth, and a single channel (mono), the output is:
input.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz -
For an uncompressed WAV file with a 16000 Hz sample rate, 16-bit bit depth, and a single channel (mono), the output is:
input.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
-
-
-
Check the audio format on Windows
-
In Windows, right-click the audio file and select Properties.
-
Example:
-
For an uncompressed WAV file with an 8000 Hz sample rate, 16-bit bit depth, and a single channel (mono), you can view the audio properties on the Details tab of the Properties window.
-
The same details are available for files with other supported sample rates, such as 16000 Hz.
-
-
Convert your audio format
If your audio file uses an unsupported sample rate, bit depth, channel, or encoding, the service returns an error. You must convert the file to a supported format before testing.
-
Convert the audio format on Linux
Use the following common FFmpeg commands to convert your audio. For more information, see Download FFmpeg.
# Check audio format details like sample rate, channels, and codec
ffmpeg -i input.mp3
# Convert a WAV file to 8 kHz, 16-bit, mono WAV
ffmpeg -i input.wav -ar 8000 -ac 1 -acodec pcm_s16le -f s16le output.wav
# Convert a WAV file to 16 kHz, 16-bit, mono WAV
ffmpeg -i input.wav -ar 16000 -ac 1 -acodec pcm_s16le -f s16le output.wav
# Convert a PCM file to 16 kHz, 16-bit, mono WAV
ffmpeg -i input.pcm -f s16le -ar 16000 -ac 1 -acodec pcm_s16le output.wav
# Convert a WAV file to 16 kHz, 16-bit, mono PCM
ffmpeg -y -i input.wav -acodec pcm_s16le -f s16le -ac 1 -ar 16000 output.pcm
# Convert an MP3 file to 16 kHz, 16-bit, mono WAV
ffmpeg -y -i input.mp3 -acodec pcm_s16le -f s16le -ac 1 -ar 16000 output.wav
# Convert a 44.1 kHz, 16-bit WAV file to 16 kHz, 16-bit, mono WAV
ffmpeg -y -f s16le -ar 44100 -ac 1 -i input.wav -acodec pcm_s16le -f s16le -ac 1 -ar 16000 output.wav
# Convert an 8 kHz ALaw file to 8 kHz, 16-bit, mono WAV
ffmpeg -f alaw -ar 8000 -i input.wav -ar 8000 -ac 1 -acodec pcm_s16le -f s16le output.wav
# Convert an 8 kHz μ-law file to 8 kHz, 16-bit, mono WAV
ffmpeg -f mulaw -ar 8000 -i input.wav -ar 8000 -ac 1 -acodec pcm_s16le -f s16le output.wav
# Convert an AMR file to 16 kHz, 16-bit, mono WAV
ffmpeg -i input.wav -ar 16000 -ac 1 -acodec pcm_s16le -f s16le output.wav
-
Convert the audio format on Windows
On Windows, you can use an audio conversion tool such as Adobe Audition, Cool Edit Pro, or other online or offline converters.
Use the conversion tool to open an audio file, modify the format in Export Settings, and then run the export. The following example shows how to export 16 kHz data. In the Export Audio Mixdown dialog box, under Sample Type, set Sample Rate to 16000 Hz, Channel to Mono, and Bit Depth to 16 Bit.
FAQ
Sample audio works, but my file returns no result
Check if your audio file's format meets the input requirements. For more information, see Supported audio formats for Speech Recognition services.
Convert your audio to an uncompressed WAV file with an 8 kHz or 16 kHz sample rate, a 16-bit bit depth, and a single channel (mono). For instructions, see Convert your audio format.
UNSUPPORT_SAMPLE_RATE error
-
Make sure the sample rate of the model you selected in the console matches the sample rate of your audio file. An 8 kHz model requires 8 kHz audio, and a 16 kHz model requires 16 kHz audio.
-
Add the sample rate adaptation parameter to the request: enable_sample_rate_adaptive=true. For more information, see the API reference.
-
If the issue persists, convert your audio format and try again. For instructions, see Convert your audio format.
Real-time Speech Recognition returns null
-
Real-time Speech Recognition is designed for streaming audio and only supports uncompressed PCM or WAV files with an 8 kHz or 16 kHz sample rate, a 16-bit bit depth, and a single channel (mono). For more information, see Supported audio formats for Speech Recognition services.
-
If you are testing the service with a pre-recorded audio file, you must first convert it to a supported format. For instructions, see Convert your audio format.
MP3 file error in Short Speech Recognition
Short Speech Recognition does support the MP3 format, but this error can occur if other properties like sample rate or channel count are incorrect. If the error persists, use the Audio File Transcription or Audio File Transcription (Offline Edition) service and add the enable_sample_rate_adaptive=true parameter to your request. For more information on format requirements, see Supported audio formats for Speech Recognition services.
TOO_LONG_SPEECH error in Short Speech Recognition
As described in the service requirements, Short Speech Recognition supports audio up to 60 seconds long. If your audio is longer than 60 seconds, use Real-time Speech Recognition, Audio File Transcription, or Audio File Transcription (Offline Edition). For more information, see Supported audio formats for Speech Recognition services.
Audio file fails to transcribe in the console demo
Convert your audio to an uncompressed WAV file with an 8 kHz or 16 kHz sample rate, a 16-bit bit depth, and a single channel (mono). For instructions, see Convert your audio format.
AUDIO_DURATION_TOO_LONG error
The maximum supported audio duration for Audio File Transcription and Audio File Transcription (Offline Edition) is 12 hours. This error occurs if your audio file exceeds this limit. You can use an FFmpeg command to split the long audio file into smaller segments and transcribe them individually.
Download FFmpeg from its official website: https://ffmpeg.org/download.html
Example FFmpeg command:
ffmpeg -i input_audio.wav -ss 00:10:00 -to 5:10:00 -c copy output_audio.wav
Parameter descriptions:
-i input_audio.wav: Specifies the input file, input_audio.wav.
-ss 00:10:00: Sets the start time. Processing begins at 10 minutes into the original audio.
-to 5:10:00: Sets the end time. Processing stops at 5 hours and 10 minutes into the original audio.
-c copy: Specifies the codec option. The copy value instructs FFmpeg to copy the audio stream directly, bypassing re-encoding.
output_audio.wav: Specifies the output file, output_audio.wav. The duration of the resulting audio is 5 hours.
Note: If you are transcribing a dual-channel audio file and need to process both channels, the maximum supported audio duration is 6 hours. This logic applies to other multi-channel audio formats.