video format, codec, video transcoding-ApsaraVideo VOD(VOD)-阿里云帮助中心

Common video and audio terms used in ApsaraVideo VOD, including file format, container format, codec, and transcoding.

File format

Every file on your computer has an extension—such as .doc, .jpg, or .avi—that tells the operating system which application can open it. Common video file extensions include .avi, .mpg, and .mp4.

Container format

A container format—also called a multimedia container—bundles compressed video streams, audio streams, and metadata (such as titles and captions) into a single file according to a defined specification.

Container formats fall into two categories:

Storage-oriented formats include AVI, ASF (WMA or WMV), MP4, MKV, and RMVB (RM or RA).
Streaming-oriented formats include Flash Video (FLV), Transport Stream (TS), and MP4. TS requires a streaming protocol such as HTTP Live Streaming (HLS) or Real-Time Messaging Protocol (RTMP). MP4 supports streaming over HTTP.

Common streaming-oriented formats and their associated streaming protocols include:

MP4: a widely supported container format compatible with mobile devices (iOS and Android) and desktop browsers. MP4 files store all media metadata—including arrangement and timing information—in a header. For long videos, this header grows large and slows loading. MP4 is best suited for short videos.

An MP4 file is organized into boxes (formerly called atoms). All media metadata is stored in these boxes and references the actual media data, such as video frames. As video duration increases, the header grows and loading becomes slower.
HLS (HTTP Live Streaming): an HTTP-based streaming protocol developed by Apple. HLS uses the TS container format by default, splitting the stream into small TS fragments and using an M3U8 index file to control playback. HLS avoids long header-buffering delays and works well for on-demand video. It has broad support on mobile devices (iOS and Android), though compatibility with Internet Explorer on PCs can be limited. Use ApsaraVideo Player for Web for best browser compatibility.
FLV: a format developed by Adobe with strong Flash Player support on PCs. On mobile devices, FLV requires a dedicated player app and is not supported by most mobile browsers, including those on Apple devices. Use ApsaraVideo Player for FLV playback.
DASH (Dynamic Adaptive Streaming over HTTP): uses fragmented MP4 (fMP4) to split a video into multiple independently encodable segments. Each segment can use different encoding settings—such as resolution or bitrate—letting players select segments dynamically for adaptive bitrate streaming and seamless quality switching. The Media Presentation Description (MPD) file in DASH serves the same role as the M3U8 file in HLS. DASH is widely used by major streaming platforms such as YouTube and Netflix.
HLS with fMP4: an extension announced by Apple at WWDC 2016, enabling HLS to use fMP4 in addition to TS. This means a single transcoding job can produce outputs compatible with both DASH and HLS.

HLS (including HLS with fMP4) and DASH are the most widely adopted adaptive streaming technologies. Use one of these formats for production deployments.

Codec

A codec—short for coder-decoder—is a program or device that compresses and decompresses digital video or audio. Compression is typically lossy. Codecs also define the compression technology used when converting a video from one format to another. Common codec families include:

H.26X series: developed by the International Telecommunication Union (ITU). Includes H.261, H.262, H.263, H.264, and H.265.
- H.261: used in early video conferencing and video calling products.
- H.262: used in broadcasting, DVD, and digital TV for standard definition (SD) video.
- H.263: used in video conferencing, video calls, and online video.
- H.264: also known as MPEG-4 Part 10 or Advanced Video Coding (AVC). The most widely adopted video compression standard for high-precision recording, compression, and distribution.
- H.265: also known as High Efficiency Video Coding (HEVC) and the successor to H.264. HEVC delivers twice the compression ratio of H.264—a 50% bitrate reduction at equivalent visual quality—and supports resolutions up to 8192 × 4320 (8K). This is the current industry direction.
MPEG series: developed by the Moving Picture Experts Group (MPEG), a working group under the International Organization for Standardization (ISO). Key video coding standards include:
- MPEG-1 Part 2: used on VCDs and some early online videos, with quality roughly equivalent to VHS.
- MPEG-2 Part 2: equivalent to H.262. Used in DVDs, SVCDs, digital video broadcasting, and cable distribution.
- MPEG-4 Part 2: used for network transmission, broadcasting, and media storage. Offers better compression than MPEG-2 and early H.263.
- MPEG-4 Part 10: technically identical to ITU-T H.264, developed jointly by ITU-T and MPEG. ITU-T named it H.264; ISO/IEC named it MPEG-4 AVC.
Audio Video Coding Standard (AVS): a family of digital audio and video coding standards developed by China's Audio Video Coding Standards Workgroup. Two generations have been finalized.
- The first generation includes AVS1 ("Information Technology—Advanced Audio and Video Coding—Part 2: Video") and AVS+ ("Part 16: Broadcasting Video"). AVS+ achieves compression efficiency comparable to H.264/MPEG-4 AVC High Profile.
- The second generation (AVS2) targets ultra-high-definition (UHD) video (4K and beyond) and high dynamic range (HDR) content. Its compression efficiency is double that of AVS+ and H.264/MPEG-4 AVC, and exceeds HEVC/H.265.
Other codecs—such as VP8 and VP9 (Google) and RealVideo (RealNetworks)—are rarely used for online video and are not covered here.

When choosing a codec, prioritize compatibility with your target playback clients—mobile apps and web browsers. Use widely supported codecs. ApsaraVideo VOD supports the following codecs:

Video: H.264/AVC (default) and H.265/HEVC
Audio: MP3 (default), AAC, VORBIS, and FLAC

Transcoding

Transcoding converts a compressed video stream into another video stream to match different network bandwidths, device capabilities, or output requirements. The process involves decoding the input stream and re-encoding it into the target format. In VOD scenarios, transcoding is commonly used to generate multiple renditions at different bitrates and resolutions to support adaptive bitrate (ABR) streaming. The input and output streams may use the same or different codecs.

Container format conversion

Container format conversion changes the container of a video or audio file—for example, converting an AVI file to MP4—without decoding or re-encoding the audio and video streams. The process extracts the compressed streams from the source container and repackages them into the target container.

Compared to transcoding, container format conversion has two key advantages:

Fast processing: encoding and decoding are the most time-consuming steps in transcoding. Container format conversion skips both entirely.
Lossless quality: without decompression or re-compression, no audio or video quality is lost.

The converted file retains nearly identical resolution and bitrate to the original and is considered original quality.

Bitrate

Bitrate is the amount of data a video uses per unit of time, measured in bits per second (bps)—commonly expressed as kilobits per second (Kbps) or megabits per second (Mbps). Bitrate is the primary lever for controlling video quality during encoding.

For videos at the same resolution, higher bitrate means less compression and higher quality, but also larger file size. Use this formula to estimate file size:

File size = Duration (seconds) × Bitrate (bps) / 8

For example, a 60-minute (3,600-second) 720p video encoded at 1 Mbps produces a file of approximately 3,600 × 1,000,000 / 8 = 450 MB.

Each resolution has a recommended bitrate range. Below this range, visual quality degrades noticeably. Above it, quality gains are minimal while storage and bandwidth costs increase.

Resolution

Resolution describes a video's level of detail, expressed as pixel count per dimension—such as 1280 × 720. Higher resolution means more pixels and sharper images.

Resolution directly influences bitrate requirements. Higher resolutions generally need higher bitrates to maintain acceptable quality, though the relationship is not strictly linear.

Frame rate

Frame rate is the number of video frames displayed per second, measured in frames per second (FPS) or hertz (Hz).

Higher frame rates produce smoother, more lifelike motion. 25–30 FPS is acceptable for most use cases. 60 FPS significantly improves perceived interactivity and realism. Beyond 75 FPS, improvements are imperceptible to most viewers. At a fixed resolution, higher frame rates demand more GPU processing power.

Group of Pictures (GOP)

A Group of Pictures (GOP) is a sequence of consecutive frames in an MPEG-encoded video. Each GOP starts with an I-frame and contains three frame types:

I-frame (intra coded picture): the keyframe. An I-frame is self-contained—it encodes a complete image without referencing other frames, similar to a standalone JPEG. Every GOP begins with an I-frame, and every video sequence starts with one.
P-frame (predictive coded picture): stores only the difference from the previous I-frame or P-frame. During decoding, this difference is added to the cached reference frame to reconstruct the image. P-frames use fewer bits than I-frames but are sensitive to transmission errors due to their dependency on prior frames.
B-frame (bidirectionally predictive coded picture): depends on both the preceding and the next frame. Decoding a B-frame requires both the cached prior frame and the decoded next frame. B-frames achieve high compression but require stronger decoding performance.

The GOP value is the number of frames between two Instantaneous Decoding Refresh (IDR) frames—in other words, the keyframe interval. The time interval equals the GOP value divided by the frame rate. For example, ApsaraVideo VOD defaults to a GOP of 250 frames at 25 FPS, yielding a 10-second interval.

Balance GOP size against three competing factors:

File size: larger GOPs reduce file size, but frames near the end of an oversized GOP may distort, lowering quality.
Seek speed: when seeking, players jump to the nearest keyframe before the target position. Larger GOPs mean more predictive frames to decode before reaching that position, increasing buffering time.
Encoding efficiency: P- and B-frames are more complex to encode than I-frames. Too many of them degrade overall encoding efficiency.
Bandwidth: GOPs that are too small force a higher bitrate to maintain quality, increasing bandwidth usage.

Use at least one keyframe per second as a baseline. More keyframes improve quality but increase bandwidth consumption.

Scan mode

Progressive scan: each frame is rendered in a single pass by scanning all lines sequentially from top to bottom.
Interlaced scanning: each frame is divided into two fields. The first field scans odd-numbered lines; the second field scans even-numbered lines. The two fields are combined to form a complete image.

IDR frame alignment

An IDR (Instantaneous Decoding Refresh) frame is a special type of I-frame with one key distinction: after a normal I-frame, later P- and B-frames may still reference earlier frames. After an IDR frame, no frame can reference anything that came before it.

IDR frames force an immediate refresh of the decoder's reference buffer, preventing errors from propagating beyond the IDR boundary. They also enable true random access—when a player seeks to a position in the video, jumping to the nearest IDR frame is fastest because no backward parsing is required.

When transcoding a video into multiple bitrates, enable IDR frame alignment to synchronize IDR frames across all output renditions at the same timestamps. Players can then switch between bitrates without visible stuttering or frame artifacts.

Encoding profile

A profile is a named set of encoding features designed for specific use cases. H.264 defines three main profiles:

Baseline: supports I- and P-frames with progressive scan and context-adaptive variable-length coding (CAVLC). Designed for low-power or fault-tolerant applications such as video calls on mobile devices.
Main: adds B-frame support, interlaced scanning, and context-adaptive binary arithmetic coding (CABAC). Used in mainstream consumer electronics such as MP4 players, portable video players, PSPs, and iPods.
High: extends Main with 8 × 8 inter-prediction, custom quantization, lossless video coding, and extended YUV formats (e.g., 4:4:4). Used in broadcast, Blu-ray Disc, and high-definition television.

Bit rate

Bit rate is the number of bits transmitted per second, measured in bits per second (bps). In the context of encoded video and audio, bit rate is equivalent to bitrate—it indicates how many bits represent each second of compressed content. Higher bit rate means better quality but larger files; lower bit rate means smaller files. See Bitrate for details.

Bitrate control method

Bitrate control determines how the encoder allocates bits during encoding. Three methods are commonly used:

Variable bitrate (VBR): the encoder adjusts bitrate dynamically based on scene complexity—using more bits for complex scenes and fewer for simple ones. VBR prioritizes quality while managing file size.
Constant bitrate (CBR): the bitrate stays fixed throughout the entire file, regardless of content complexity. CBR files are larger than VBR or ABR files for the same quality level.
Average bitrate (ABR): a middle ground between VBR and CBR. ABR divides the stream into approximately 50-frame segments (about one second at 30 FPS) and allocates lower bitrates to simpler segments and higher bitrates to complex ones. The average bitrate across the file stays at or near the target value, while local peaks may exceed it.

ABR delivers predictable file sizes while adapting to content complexity. ApsaraVideo VOD uses ABR as the default bitrate control method.

Encoding format

Audio encoding formats fall into two categories: lossless and lossy. In practice, all audio encoding methods approximate the original analog signal—achieving exact reproduction is theoretically impossible per the sampling theorem. PCM encoding, which reaches the highest practical fidelity, is conventionally treated as lossless. All common internet audio formats—including MP3 and AAC—use lossy encoding.

Sample rate

Sample rate—also called sampling frequency—is the number of audio samples taken per second from a continuous analog signal to produce a digital signal, measured in hertz (Hz). Higher sample rates capture more detail and reproduce sound more accurately.

Bitrate

See the Bitrate section above.

Sound channel

A sound channel is an independent audio signal captured or reproduced from a distinct spatial position. The number of channels corresponds to the number of audio sources during recording or speakers during playback.

UTC (ISO 8601 standard time format)

Coordinated Universal Time (UTC) is the primary international time standard, based on atomic seconds and closely aligned with universal time. The acronym UTC is a compromise between the English abbreviation CUT and the French abbreviation TUC.

Unless otherwise specified, ApsaraVideo VOD returns all time fields and expects all API time parameters in UTC, using the ISO 8601 format: YYYY-MM-DDThh:mm:ssZ. For example, 2017-01-11T12:00:00Z equals 20:00:00 on January 11, 2017 in China Standard Time (UTC+8).

Short Video SDK terms

Multi-source recording

Multi-source recording lets you combine multiple video sources—such as a camera feed and a screen recording—into a single video. Sources are arranged using layouts such as side-by-side, top-and-bottom, or Picture-in-Picture (PiP). Every frame in the output contains data from all selected sources simultaneously.

Track

In duet recording, two video streams are abstracted as Track A and Track B. Track A holds camera-captured video. Track B holds the sample video. This abstraction helps developers understand track layout concepts.
In multi-source recording, multiple video sources are abstracted as separate tracks. For example, Track A holds camera video and Track B holds screen-recorded video. This abstraction helps developers understand track layout concepts.

Layout

Layout is a track property that defines where the track's video appears in the final output. It uses a normalized coordinate system with two dimensions: the center point coordinates and the track size (width and height).

Duet recording layout:

In this layout, Track A and Track B each occupy half the screen. Both tracks have width 0.5 and height 1.0. Track A's center point is (0.25, 0.5). Track B's center point is (0.75, 0.5).
Two core classes handle layout in duet recording APIs:
- AliyunMixTrackLayoutParam: describes the center point and size.
- AliyunMixRecorderDisplayParam: includes AliyunMixTrackLayoutParam, plus displayMode and layoutLevel. DisplayMode controls whether to fill or clip when aspect ratios differ. LayoutLevel sets the display order—higher values appear on top. When tracks overlap, the track with a higher layoutLevel covers the other.
Multi-source recording layout:

In this layout, Track A and Track B each occupy half the screen. Both tracks have width 0.5 and height 1.0. Track A's center point is (0.25, 0.5). Track B's center point is (0.75, 0.5).