Speech Recognition input format FAQ

更新时间:
复制 MD 格式

This topic describes the audio input format requirements for the Speech Recognition services of Intelligent Speech Interaction. It also provides solutions for common issues with incompatible audio formats.

Supported audio formats

Service

Requirements

Short Speech Recognition

  • Supported formats: PCM, WAV (PCM-encoded), OPUS (OGG container), SPEEX (OGG container), AMR, MP3, and AAC. All audio must have a single channel (mono) and a 16-bit bit depth.

  • Sample rates: 8000 Hz and 16000 Hz.

  • Audio duration: Must not exceed 60 seconds.

  • File size: Must not exceed 2 MiB.

Real-time Speech Recognition

  • Supported formats: PCM, WAV (PCM-encoded), OPUS (OGG container), SPEEX (OGG container), AMR, MP3, and AAC. All audio must have a single channel (mono) and a 16-bit bit depth.

  • Sample rates: 8000 Hz and 16000 Hz.

Audio File Transcription

  • Supports single-channel (mono) and dual-channel audio in .wav, .mp3, .m4a, .wma, .aac, .ogg, .amr, and .flac formats.

  • File size must not exceed 512 MiB.

Audio File Transcription (Offline Edition)

  • Supports single-channel (mono) and dual-channel audio in .wav, .mp3, .m4a, .wma, .aac, .ogg, .amr, and .flac formats.

  • File size must not exceed 512 MiB.

Audio File Transcription Express Edition

  • Audio formats: Supports audio encoded in AAC, MP3, OPUS, and WAV formats.

  • Limitations: Supports audio files up to 100 MiB and 2 hours in duration. For files longer than 2 hours, use the Audio File Transcription service.

  • Model types: 8000 (telephony) and 16000 (non-telephony).

Check your audio format

Note

For definitions of common audio format terms, see Key concepts.

  • Sample rate: 8000 Hz (8 kHz) or 16000 Hz (16 kHz), which means 8,000 or 16,000 samples per second.

  • Bit depth: 16 bit, which means each audio sample is stored using 16 bits (2 bytes).

  • Channel: mono or stereo.

Calculate file size from audio duration:

File size (MiB) =

(sample rate × bit depth × number of channels × audio duration in seconds) / (8 × 1024 × 1024) = (16000 Hz × 16 bit × 1 channel × 60 s) / (8 × 1024 × 1024) ≈ 1.83 MiB

  • Check the audio format on Linux

    1. Run the following command:

      file input.wav
    2. Example output:

      • For an uncompressed WAV file with an 8000 Hz sample rate, 16-bit bit depth, and a single channel (mono), the output is:

        input.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz
      • For an uncompressed WAV file with a 16000 Hz sample rate, 16-bit bit depth, and a single channel (mono), the output is:

        input.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
  • Check the audio format on Windows

    1. In Windows, right-click the audio file and select Properties.

    2. Example:

      • For an uncompressed WAV file with an 8000 Hz sample rate, 16-bit bit depth, and a single channel (mono), you can view the audio properties on the Details tab of the Properties window.

      • The same details are available for files with other supported sample rates, such as 16000 Hz.

Convert your audio format

If your audio file uses an unsupported sample rate, bit depth, channel, or encoding, the service returns an error. You must convert the file to a supported format before testing.

  • Convert the audio format on Linux

Use the following common FFmpeg commands to convert your audio. For more information, see Download FFmpeg.

# Check audio format details like sample rate, channels, and codec
ffmpeg -i input.mp3

# Convert a WAV file to 8 kHz, 16-bit, mono WAV
ffmpeg -i input.wav  -ar 8000 -ac 1 -acodec pcm_s16le -f s16le output.wav

# Convert a WAV file to 16 kHz, 16-bit, mono WAV
ffmpeg -i input.wav  -ar 16000 -ac 1 -acodec pcm_s16le -f s16le output.wav

# Convert a PCM file to 16 kHz, 16-bit, mono WAV
ffmpeg -i input.pcm -f s16le -ar 16000 -ac 1 -acodec pcm_s16le  output.wav

# Convert a WAV file to 16 kHz, 16-bit, mono PCM
ffmpeg -y -i input.wav -acodec pcm_s16le -f s16le -ac 1 -ar 16000 output.pcm

# Convert an MP3 file to 16 kHz, 16-bit, mono WAV
ffmpeg -y -i input.mp3 -acodec pcm_s16le -f s16le -ac 1 -ar 16000 output.wav

# Convert a 44.1 kHz, 16-bit WAV file to 16 kHz, 16-bit, mono WAV
ffmpeg -y -f s16le -ar 44100 -ac 1 -i input.wav -acodec pcm_s16le -f s16le -ac 1 -ar 16000 output.wav

# Convert an 8 kHz ALaw file to 8 kHz, 16-bit, mono WAV
ffmpeg -f alaw -ar 8000 -i input.wav -ar 8000 -ac 1 -acodec pcm_s16le -f s16le output.wav

# Convert an 8 kHz μ-law file to 8 kHz, 16-bit, mono WAV
ffmpeg -f mulaw -ar 8000 -i input.wav -ar 8000 -ac 1 -acodec pcm_s16le -f s16le output.wav

# Convert an AMR file to 16 kHz, 16-bit, mono WAV
ffmpeg -i input.wav -ar 16000 -ac 1 -acodec pcm_s16le -f s16le output.wav
  • Convert the audio format on Windows

On Windows, you can use an audio conversion tool such as Adobe Audition, Cool Edit Pro, or other online or offline converters.

Use the conversion tool to open an audio file, modify the format in Export Settings, and then run the export. The following example shows how to export 16 kHz data. In the Export Audio Mixdown dialog box, under Sample Type, set Sample Rate to 16000 Hz, Channel to Mono, and Bit Depth to 16 Bit.

FAQ

Sample audio works, but my file returns no result

Check if your audio file's format meets the input requirements. For more information, see Supported audio formats for Speech Recognition services.

Convert your audio to an uncompressed WAV file with an 8 kHz or 16 kHz sample rate, a 16-bit bit depth, and a single channel (mono). For instructions, see Convert your audio format.

UNSUPPORT_SAMPLE_RATE error

  • Make sure the sample rate of the model you selected in the console matches the sample rate of your audio file. An 8 kHz model requires 8 kHz audio, and a 16 kHz model requires 16 kHz audio.

  • Add the sample rate adaptation parameter to the request: enable_sample_rate_adaptive=true. For more information, see the API reference.

  • If the issue persists, convert your audio format and try again. For instructions, see Convert your audio format.

Real-time Speech Recognition returns null

  • Real-time Speech Recognition is designed for streaming audio and only supports uncompressed PCM or WAV files with an 8 kHz or 16 kHz sample rate, a 16-bit bit depth, and a single channel (mono). For more information, see Supported audio formats for Speech Recognition services.

  • If you are testing the service with a pre-recorded audio file, you must first convert it to a supported format. For instructions, see Convert your audio format.

MP3 file error in Short Speech Recognition

Short Speech Recognition does support the MP3 format, but this error can occur if other properties like sample rate or channel count are incorrect. If the error persists, use the Audio File Transcription or Audio File Transcription (Offline Edition) service and add the enable_sample_rate_adaptive=true parameter to your request. For more information on format requirements, see Supported audio formats for Speech Recognition services.

TOO_LONG_SPEECH error in Short Speech Recognition

As described in the service requirements, Short Speech Recognition supports audio up to 60 seconds long. If your audio is longer than 60 seconds, use Real-time Speech Recognition, Audio File Transcription, or Audio File Transcription (Offline Edition). For more information, see Supported audio formats for Speech Recognition services.

Audio file fails to transcribe in the console demo

Convert your audio to an uncompressed WAV file with an 8 kHz or 16 kHz sample rate, a 16-bit bit depth, and a single channel (mono). For instructions, see Convert your audio format.

AUDIO_DURATION_TOO_LONG error

The maximum supported audio duration for Audio File Transcription and Audio File Transcription (Offline Edition) is 12 hours. This error occurs if your audio file exceeds this limit. You can use an FFmpeg command to split the long audio file into smaller segments and transcribe them individually.

Download FFmpeg from its official website: https://ffmpeg.org/download.html

Example FFmpeg command:

ffmpeg -i input_audio.wav -ss 00:10:00 -to 5:10:00 -c copy output_audio.wav

Parameter descriptions:

-i input_audio.wav: Specifies the input file, input_audio.wav.

-ss 00:10:00: Sets the start time. Processing begins at 10 minutes into the original audio.

-to 5:10:00: Sets the end time. Processing stops at 5 hours and 10 minutes into the original audio.

-c copy: Specifies the codec option. The copy value instructs FFmpeg to copy the audio stream directly, bypassing re-encoding.

output_audio.wav: Specifies the output file, output_audio.wav. The duration of the resulting audio is 5 hours.

Note: If you are transcribing a dual-channel audio file and need to process both channels, the maximum supported audio duration is 6 hours. This logic applies to other multi-channel audio formats.