DataWorks official images

更新时间:
复制 MD 格式

DataWorks official images provide common runtime environments for different node types in Data Studio to meet the execution environment requirements of various tasks. You can use these official images directly in Data Studio or build custom images based on them. This topic describes the official images.

Overview

In Data Studio, if you do not configure a runtime environment image for a node, the Default standard image is used. The Default standard image provides only a basic runtime environment, which may not meet the requirements of specific tasks. You can use base images configured for official images to provide standardized environments for different task types in Data Studio. Official images are pre-configured base images. You can create custom images based on them and apply additional configurations to extend support for more environment scenarios and meet the needs of different task types.

Image list

Important

Supported versions and regions are subject to the DataWorks console. Images may have multiple versions. The following table only describes the capabilities of the latest image versions.

DataWorks provides the following images:

Image name

Image description

Applicable tasks

dataworks_pyodps_py311_task_pod

The official image for DataWorks PyODPS nodes. This image uses Python 3.11.

PyODPS 3

dataworks_pairec_task_pod

The official DataWorks PAI-Rec image. It is used to run algorithms generated by PAI-Rec. For the versions of the feature_store SDK and pyfg, see the console.

dataworks_pyodps_task_pod

The official image for DataWorks PyODPS nodes. This image uses Python 3.7.

PyODPS 2

PyODPS 3

dataworks_emr_base_task_pod

The base image for EMR clusters. It supports the EMR Serverless Spark, EMR on ECS DataLake, and EMR on ECS Custom cluster types.

  • This image includes only the basic components for submitting EMR tasks from DataWorks but does not include the execution environment for EMR base components. For semi-managed clusters such as DataLake and Custom, you must install components matching the corresponding EMR cluster version through Custom images.

  • When using the CUSTOM and DATALAKE cluster types, you must first initialize the EMR Gateway environment by entering the cluster type and version number.

    sh /home/admin/init_emr_component.sh DATALAKE EMR-VERSION_NUMBER
    Note

    If the EMR Gateway environment fails to initialize, this is generally because some cluster versions are not in the image repository. Submit a ticket to contact the platform for assistance.

dataworks_shell_jdk17_task_pod

The official image for DataWorks Shell nodes. This image uses JDK 17.

Shell

dataworks_shell_task_pod

The official image for DataWorks Shell nodes. This image uses JDK 7. If you need to customize the runtime environment and require Subprocess parameter passing, you can build a Custom images based on this image.

dataworks_python_task_pod

The official image for DataWorks Python nodes. System info: py3.11-ubuntu22.04.

Python

dataworks_cdh_custom_task_pod

The base image for DataWorks CDH clusters. It cannot be used directly. You must install CDH parcel through Custom images before using it in Data Studio.

CDH

dataworks_controller_task_pod

The official image for DataWorks assignment nodes. If you need to customize the runtime environment and use the assignment node or assignment parameters to pass parameters to downstream nodes, build a Custom images based on this image.

Assignment node

dataworks-mcp

Used for DataWorks Agent for third-party clients task development. System info: py3.11-ubuntu22.04.

Personal development environment

dataworks-notebook

Used for Basic notebook development task development. System info: py3.11-ubuntu22.04.

dataworks_notebook_task_pod

The official image for DataWorks Notebook nodes. System info: py3.11-ubuntu22.04. The dataworks-notebook and dataworks-mcp images for the Python environment and personal development environment are identical.

dataworks-maxcompute

Used for Build a MaxCompute image in a personal environment. System info: py3.11-ubuntu20.04.

Use images

In Data Studio, in addition to using official images, you can also use custom images that are associated with the workspace.

  • Use the image in new DataStudio: In the Run Configuration and Scheduling Settings sections on the right side of the node development page, configure the resource group and image for test runs and post-deployment runs.

  • Use the image in legacy DataStudio: In the dialog that appears after you click Run with Parameters on the node development page, or in the Scheduling Settings panel on the right side of the node development page, configure the resource group and image for the node's test run and post-deployment run.

  • Use an image in the personal development environment: When you create a personal development environment instance, you can select a different official image from the Image Configuration drop-down list.

Note

When configuring the resource group and image, note the following:

  • Resource Group for Scheduling: Select a serverless resource group.

  • Image: Select an official image or a deployed custom image.