Use cases and workflow for video translation-Intelligent Media Services(IMS)-阿里云帮助中心

Video translation uses AI and machine learning algorithms to efficiently and accurately convert video content from a source language into one or more target languages. This technology supports subtitle, speech, and lip-sync translation, ensuring the output is visually and audibly natural. This helps overcome language barriers, enrich educational content, enhance entertainment experiences, and greatly promote cross-cultural communication.

Note

The service is available in the following regions:

subtitle translation: China (Shanghai), China (Beijing), China (Shenzhen), China (Hangzhou), Asia Pacific SE 1 (Singapore), and US West 1 (Silicon Valley)
speech translation: China (Shanghai), China (Beijing), China (Shenzhen), China (Hangzhou), Asia Pacific SE 1 (Singapore), and US West 1 (Silicon Valley)
lip-sync translation: China (Shanghai) and Asia Pacific SE 1 (Singapore). This feature is not available in other regions.

Try the demo

You can try video translation at the AI Playground.

Advantages

Multiple languages and dialects:

Supports translation for over 40 languages.
Supports over 10 Chinese dialects to meet diverse speech needs.
A single translation task can generate outputs in more than 40 target languages.

Wide video format compatibility:

Supports various mainstream video formats, including MP4, WebM, MOV, and M3U8.

Rich audio formats and customization options:

Supports multiple audio formats such as MP3 and WAV.
Supports custom configurations for specific use cases.

Features

The video translation service provided by Intelligent Media Services (IMS) supports subtitle translation, speech translation, and lip-sync translation. The key features include:

Feature	Description	Highlights
Subtitle translation	Extracts existing subtitles from a video. Erases original subtitles. Translates subtitles into multiple languages. Outputs videos in multiple target languages in a single task. Adds the new subtitles to the video.	Efficient and accurate text translation, ideal for scenarios that require quick multi-language support.
Speech translation	In addition to the features of subtitle translation, this method also supports: Voice cloning Replicates the original speaker's voice for the translated audio. Outputs videos in multiple target languages in a single task. Replaces the original audio track with the translated audio.	Adds an audio dimension to text translation, preserving the authenticity and emotional delivery of the original voice to enhance the viewing experience.
Lip-sync translation	In addition to the features of subtitle and speech translation, this method also supports lip movement synchronization with the translated audio.	The highest tier of translation, ensuring audio-visual consistency. It is ideal for highly realistic interactive or promotional content.
Post-editing	You can perform post-editing on speech and subtitle translation results by using Cloud Editing (a visual interface) or OpenAPI. Supports re-editing of the translated video.	Allows for flexible adjustments to fine-tune the translation output.

Billing

For billing details, see Video translation billing.

Get started

To help you create and manage video translation tasks, Intelligent Media Services provides three methods: the console, Cloud Editing, and OpenAPI.

Intelligent Media Services console: Ideal for users who prefer to use an intuitive graphical interface.
Cloud Editing: Ideal for users who are familiar with video editing and want more control. You can add materials in an editing project, use AI translation tools, and perform post-editing on the results.
OpenAPI: Ideal for developers and technical staff who want to integrate video translation into third-party systems and automate translation for large volumes of videos.

Supported methods vary by translation level:

subtitle translation: Intelligent Media Services console, Cloud Editing (WebSDK), and OpenAPI.
speech translation: Intelligent Media Services console, Cloud Editing (WebSDK), and OpenAPI.
lip-sync translation: Intelligent Media Services console and OpenAPI.

Create a translation task

Method 1: Use the console

Go to the Video Translation page in the Intelligent Media Services console.
In the upper-left corner, select a region based on your requirements.
Click Create Translation Task to go to the Create Translation Task page.
Configure the following parameters:
- Translation Method: Select Subtitle Translation, Speech Translation, or Lip-sync Translation.
- Select Source File: Upload the video file that you want to translate. MP4, WebM, and MOV formats are supported.
- Subtitle Source: Configure options such as whether to erase the original subtitles and the subtitle source. Supported sources are OCR, ASR, and Specified SRT File.
  - OCR: If your video contains hardcoded subtitles but you do not have a subtitle file, use OCR to extract subtitle text from the video frames. To improve efficiency and accuracy, you can also specify an OCR Range.
  - ASR: If your video does not have subtitles, use ASR to generate subtitles by transcribing speech from the audio track.
  - OCR/ASR: This is a hybrid method that combines OCR and ASR. It prioritizes OCR for subtitle extraction and falls back to ASR if OCR fails.
  - Specified SRT File: If you have an existing subtitle file (such as in .srt format), you can upload it to be used as the source.
- Target Language: You can select multiple target languages at a time. After you submit the translation task, the system generates a video file for each selected language.
- Storage Directory and File Name: Specify the storage location and name for the translated files.
After you confirm that all parameters are correct, click Submit Translation Task to create the task.
You can monitor the task status, parameters, and results in the task list. When the task status changes to Processed, click View Details to view the task details.
The details page has three sections: Basic Settings, Advanced Settings, and Output. The Basic Settings section displays information such as the source file type, translation method, target languages (for example, Chinese > English, Japanese), creation time, and task status. The Advanced Settings section displays configurations such as whether to erase source subtitles, the target subtitle type, the subtitle source, and whether post-editing is enabled. The Output section lists the output path for each target language and provides Edit and Download options.

Method 2: Use Cloud Editing

Before you begin

If you are unfamiliar with Cloud Editing, we recommend that you first review the Cloud Editing operation guides.

Procedure

Go to the Cloud Editing page.
In the upper-left corner, select a region based on your requirements.
On the Video Editing Project tab, click Create Editing Project. Follow the on-screen instructions to create the project, and then click Edit in the project list to open the Cloud Editing workspace.
In the upper-left Materials panel, click Import. In the Add Material dialog box on the right, select the files you want to translate and add them to the library. Then, add a file to a track in the timeline below by clicking the icon or by dragging the file directly. The Cloud Editing workspace provides several function menus on the left, such as Materials, Stock Media, Digital Avatars, Smart Dubbing, Subtitles, and Stickers. You can add video, audio, or image files by clicking the + Import button in the upper-right corner.
On the timeline, select the audio or video track that you want to translate. In the properties panel on the right, click AI translation to open the video translation panel. Follow the on-screen instructions to configure settings such as translation type, subtitle extraction, and target languages, and then click Submit. This example uses speech translation.
Wait a few minutes. The translation results will appear on the timeline. After the task is complete, three tracks appear in the timeline area: a subtitle track with the target-language subtitles, a video track with the original subtitles removed, and an audio track with the translated audio.
When the task is complete, click Export > Export Video in the upper-right corner. The system displays a dialog box for composing the video. Configure the parameters as prompted and click OK to compose and export the translated video. You can manage the translated clips in batches in the Clip Settings panel on the right.

Method 3: Create a video translation via OpenAPI

Create a translation task
Set the API parameters based on your business requirements and call the SubmitVideoTranslationJob operation. Before you submit the request, make sure you have read Parameters and examples for video translation to correctly configure the parameters.
Query the result of a single translation task:
Call the GetSmartHandleJob operation to get the status and result of a specific video translation task. This API allows you to retrieve detailed information about a specified task, including its processing progress, completion time, and final output URL.
Query the list of translation tasks
If you need to view all ongoing or completed video translation tasks, you can call the ListSmartJobs operation to list these tasks.
Delete a translation task
Call the DeleteSmartJob operation to delete completed tasks and release system resources.

Post-editing for speech translation (optional)

Important

To perform post-editing, you must enable the "Enable post-editing" option before submitting the task. This section describes two methods for post-editing speech translation results.
Note: For lip-sync translation, post-editing is supported for audio only, not for the synchronized lip movements.

Method 1: Use OpenAPI

You can correct the results of speech translation by calling the relevant API operation. For more information, see Speech translation - Manual correction.

Method 2: Use Cloud Editing (Web SDK)

Before you begin

If you are unfamiliar with Cloud Editing, we recommend that you first review the Cloud Editing operation guides.

Procedure

Go to the Video Translation page, and select the task that you want to correct.
In the Actions column, click Edit to open the corresponding Cloud Editing project. For subsequent operations, refer to the video tutorial below:

FAQ

Align subtitle timing with the audio waveform

For example, if you want to split the translated subtitle "Great where are you" into two segments, "Great" and "where are you", adjust the start and end times to align the subtitle start and end points with the troughs of the audio waveform. This optimizes the post-editing result for speech translation.

Control subtitle word count during post-editing

When post-editing translated subtitles, ensure the word count of the edited subtitle does not exceed 1.5 times the word count of the original translation. Otherwise, the speech in the post-edited audio may be too fast.

For example, a line from the initial translation is: Let's talk about this later. We need to go home now

Incorrect example: Let's discuss this matter in more detail at a later time. Right now, we should focus on heading back home as it's important to ensure we get there safely and in good time. We can revisit this conversation when we are both more relaxed and have ample opportunity to explore all the aspects thoroughly.

Correct example: Let's pick this up another time. We should be going home now.

References

To submit translation tasks by using OpenAPI, see Parameters and examples for video translation.