Custom image

更新时间:
复制 MD 格式

When the default DataWorks execution environment cannot satisfy the dependencies for PyODPS or Shell tasks—for example, if you need to install Python libraries like pandas and jieba—you can create a custom image. By packaging all dependencies, a custom image provides a reusable, standardized execution environment, ensuring consistency and significantly improving development and deployment efficiency.

Limits

  • Edition limits:

    • All editions support creating and using custom images.

    • Only the Professional Edition or higher supports image building.

  • Resource group limits: The custom image feature supports only serverless resource groups.

    For legacy resource groups, use Cloud Assistant to install external dependencies.
  • Permission limits: You need the AliyunDataWorksFullAccess or ModifyResourceGroup permission.

    For authorization details, see RAM policy for service and console permissions.

Quotas and limits

  • Image quantity: The number of custom images you can create depends on your DataWorks edition.

    • Basic Edition and Standard Edition: 10

    • Professional Edition: 50

    • Enterprise Edition: 100

  • Build concurrency: You can build up to two images concurrently in each region.

  • ACR image requirements:

    • Instance edition: Only Enterprise Edition instances of Alibaba Cloud Container Registry (ACR) are supported.

    • Instance architecture: Only the AMD64 architecture is supported.

    • Image size: A single image cannot exceed 5 GB.

    • Time zone configuration: You must install the tzdata time zone package to prevent container failures due to a time zone mismatch with DataWorks.

  • Image building: Persistent builds are available only for custom images created based on DataWorks official images. Custom images that reference an ACR image do not support persistent builds and must be re-pulled and deployed each time a task runs.

  • Supported node types and build methods:

    Node type

    Build from official images

    Build from ACR images

    PyODPS2

    Supported

    Unsupported

    PyODPS3

    Supported

    Unsupported

    EMR Spark

    Supported

    Unsupported

    EMR Spark SQL

    Supported

    Unsupported

    EMR SHELL

    Supported

    Unsupported

    Shell

    Supported

    Supported

    Python

    Supported

    Supported

    Notebook

    Unsupported

    Supported

    CDH

    Supported

    Unsupported

    Assignment Node

    Supported

    Unsupported

Procedure

1. Create a custom image

You can create a custom image in DataWorks based on a DataWorks Official Images or an Alibaba Cloud Container Registry Image. The configuration parameters vary based on the reference type you select.

From a DataWorks official image

  1. Log on to the DataWorks console. In the left-side navigation pane, click Image Management.

  2. On the DataWorks Official Images tab, select a target image to use as a base and click Create Custom Image in the Operation column. In the dialog box that appears, the system automatically populates information about the target image. The following table describes the remaining parameters.

    Reference type: DataWorks official image is selected by default. Image namespace: DataWorks Default is selected by default. Image repository: DataWorks Default is selected by default.

    Parameter

    Description

    Image Name/ID

    The target official image is selected by default. You can switch to another image as needed.

    Visible Scope

    Sets the visibility of the custom image. The options are Visible Only to Creator and Visible to all.

    Module

    Currently, custom images can be used only in DataStudio.

    Supported Task Type

    Select the task types that this image will support. When a task of a supported type runs in DataStudio, this image can be selected as its runtime image.

    Installation Package

    Add third-party packages as needed. You can use multiple methods and install multiple packages in a single configuration. The following installation methods are supported:

    • Quick Install: From the Installation Package drop-down list, select Python2, Python3, or Yum to directly select the environment or resource that you want to install.

      If the required third-party package is not in the drop-down list, switch to Script mode to manually install it.
    • Manual Input: From the Installation Package drop-down list, select Script. Enter installation commands in the script box. You can use the following example commands to download third-party packages.

      • pip example command: pip install xx. This command is for Python 2.

      • pip3 example command: /home/tops/bin/pip3 install 'urllib3<2.0' . This command is for Python 3.

      • yum example command: yum install -y git.

      • wget example command: wget git.

        For more information about installation commands, see Appendix: Installation command reference.
    Important

    If you need to install third-party packages or their dependencies from the internet, the VPC bound to the serverless resource group must have public internet access.

  3. Click Determine to create the image.

From an Alibaba Cloud Container Registry image

To create a custom image based on an ACR image, you must activate Container Registry. You can create DataWorks images only from Enterprise Edition ACR instances that use the AMD64 architecture.

  1. Log on to the DataWorks console. In the left-side navigation pane, click Image Management.

  2. On the Custom Images tab, click Create Image. In the dialog box that appears, configure the following key parameters:

    Parameter

    Description

    Reference Type

    Select Alibaba Cloud Container Registry Image.

    Image Instance ID

    Select an Enterprise Edition instance that you created in Container Registry.

    Image Namespace

    Select a namespace under the image instance.

    Image Repository

    Select an image repository under the image instance.

    Image Version

    Select an image version from the selected image repository.

    VPC to Associate

    Select the VPC that is bound to the image instance. For more information about how to configure a VPC, see Configure access control over a VPC.

    Important

    DataWorks allows you to select only one VPC to access an ACR image instance.

    Synchronize to MaxCompute

    Defaults to No. You can set this to Yes only after you meet both of the following prerequisites. Otherwise, the option is disabled by default.

    • The Instance Specification of the selected image instance is Standard, Advanced, or Enterprise Edition.

    • You have active MaxCompute compute resources.

    After you meet the prerequisites, the effects of different values are as follows:

    • Select Yes: A DataWorks custom image is generated. When this image is published, it is also synchronously built into a MaxCompute image.

      For more information, see Create a MaxCompute image in a personal development environment.
    • Select No: Only a DataWorks custom image is generated. It is not built into a MaxCompute image.

    Visible Scope

    Sets the visibility of the custom image. The options are Visible Only to Creator and Visible to all.

    Module

    Currently, custom images can be used only in DataStudio.

    Supported Task Type

    ACR images are started by using the format Startup command + User task code file path. The different task types and their default startup commands are as follows:

    • Shell

    • Python: To use a custom image created from an ACR image for Python tasks, ensure that your ACR image instance contains a Python environment. Otherwise, Python tasks are not supported.

    • Notebook

      • To use a custom image created from an ACR image for Notebook tasks, you must use the Notebook base image provided by DataWorks as the base for your ACR image. This provides the required runtime environment. The DataWorks-provided Notebook base image is dataworks-public-registry.cn-shanghai.cr.aliyuncs.com/public/dataworks-notebook:py3.11-ubuntu22.04-20241202.

      • Ensure your build environment has public internet access to pull the DataWorks-provided Notebook base image.

  3. Click Determine to create the image.

From a personal development environment instance

In the new version of DataStudio, you can create a new image from a personal development environment. For more information, see Create a DataWorks image from a personal development environment.

2. Test and publish a custom image

On the Image Management > Custom Images tab of the DataWorks console, Publish the target image. You can only publish images that have passed the test. If the test fails, you can click image > Modify in the Operation column of the target custom image to modify its configuration.

Perform the following steps:

  1. On the Image Management > Custom Images tab, click Publish in the Operation column of the target image to open the Publish Image dialog box.

  2. Configure the test parameters and click Test.

    Parameter

    Description

    Test Resource Group

    Select the serverless resource group to be used for the test.

    Test CU

    The compute resources to allocate for the test. Default: 0.5 CU. Minimum: 0.25 CU. If the image is large or the test takes a long time to pass, you can increase the CU value and retry.

  3. View the Test Result and Test Log.

    • After the test starts, the Test Result shows Testing. You can click Refresh to view the latest status, or click Cancel Test to end the current test. After canceling the test, you can select a different resource group or CU allocation and test again.

    • The Test Log section streams the command-line logs of the image build process in real time and provides the following operations: Maximize (view long logs in full screen), Copy (copy the full log to the clipboard with one click), Download (download as image-test-log-<imageID>.log), and Collapse/Expand.

    • If the test fails, the Test Result shows Test Failed, and an AI diagnosis panel automatically appears. The system provides a failure analysis and recommended solutions based on test logs and image layer information. Adjust the image configuration or installation commands based on the diagnosis and click Test Again.

    • After the test is successful, the Test Result shows Test Successful. If you need to change the test conditions and verify again, click Test Again.

  4. After the test passes, click Publish at the bottom of the dialog box. Published images can be used by DataWorks task nodes.

    Note

    The Publish button is enabled only when the Test result is Test Successful or Publish Failed. The button is disabled in other states, such as Testing, Test Failed, and Published.

Note the following when you test and publish an image:

  • When you test a custom image, select a serverless resource group.

  • If you create an image based on an Alibaba Cloud Container Registry image or create an image from a personal development environment, ensure that the VPC bound to the serverless resource group for testing is the same as the VPC bound to the ACR image instance.

  • If the custom image you configured retrieves third-party packages from the internet and the test does not pass for a long time, check whether the VPC bound to the Test Resource Group has public internet access.

  • If a build failure occurs during image testing or publishing (for example, the publish status shows Published (build failed)), the console may only display Build Failed without providing detailed reasons, which makes self-service troubleshooting difficult. A common scenario is insufficient disk space in the build environment. The disk space required during the build phase may be slightly higher than during the test phase. As a result, the test may pass, but the process fails when the image artifacts are published or generated. Try increasing the Calculate CU (for example, by 0.5 CU) and retrying. If the issue persists, submit a ticket to contact Alibaba Cloud Technical Support for assistance.

3. Bind the image to a workspace

After an image is published, you can assign it to a different workspace to make it available there.

  1. On the Image Management > Custom Images tab of the DataWorks console, find the Published custom image.

  2. In the Operation column of the target image, click image > Change Workspace to bind the custom image to a workspace.

4. Use an image in a task

Use an image in the new DataStudio

  1. Go to DataStudio: Go to the DataWorks Workspaces page, switch to the target region at the top, find the target workspace, and then click Shortcuts > DataStudio in the Operation column.

  2. Configure the image: In DataStudio, find the task node for which you want to test the custom image, click Scheduling Settings on the right side, and then configure resource properties.

    • Resource Group: Select a serverless resource group.

      If the target resource group is not displayed, check whether it is bound to the current workspace. You can go to the Resource Groups page, find the target resource group, and then click Associate Workspace in the Operation column to complete the binding.
      Important

      To ensure that task nodes run as expected, make sure that the Resource Group is the same as the Test Resource Group that you selected when Publish Image.

    • Image: Select a published custom image.

      If you switch images, you must republish the node for the change to take effect.

      In the Scheduling configurations > Scheduling properties panel, configure Resource group and compute CU (for example, 0.5), and select an image from the Image drop-down list.

  3. Debug the node: In the Run Configuration panel on the right side of the node, configure Computing Resources, Resource Group, Calculate CU, Image, and Script Parameters, and then click Run in the top toolbar.

  4. Publish the node: In the top toolbar, click Publish to publish the node to the production environment.

Use an image in legacy DataStudio

  1. Go to DataStudio: Log on to the DataWorks console, switch to the target region, and click Data Development and O&M > DataStudio in the left-side navigation pane. Select the desired workspace from the drop-down list and click Data Analytics.

  2. Configure the image: In DataStudio, find the task node for which you want to test the custom image, click Scheduling Settings on the right side, and then configure resource properties in the Scheduling Settings section.

    • Resource Group for Scheduling: Select a serverless resource group.

      If the target resource group is not displayed, check whether it is bound to the current workspace. You can go to the Resource Groups page, find the target resource group, and then click Associate Workspace in the Operation column to complete the binding.
      Important

      To ensure that task nodes run as expected, make sure that the Resource Group for Scheduling is the same as the Test Resource Group that you selected when you Publish Image.

    • Image: Select a published custom image.

      If you switch images, you must republish the node for the change to take effect.
  3. Debug the node: In the top toolbar, click Run with Parameters (image). In the dialog box that appears, configure Resource Group Name, CUs for Node Running, and Image, and then click Run.

  4. Publish the node: In the top toolbar, click Save and Commit to publish the node to the production environment.

5. Build a persistent image

Important

We strongly recommend that you make an image persistent after it is published and verified to work as expected. This practice prevents runtime failures that can occur if a task downloads an unexpected package version during initialization. Such issues can result from tampered source libraries or unspecified version dependencies.

A standard custom image is redeployed each time it runs. This increases the node run time and may incur higher compute costs. The persistent image feature of DataWorks requires only one build for unlimited reuse. This improves task running efficiency, reduces compute and traffic costs, and ensures a consistent environment. You can build persistent images only for custom images created from DataWorks official images.

  1. On the Image Management > Custom Images tab of the DataWorks console, find the published custom image.

  2. In the Operation column of the target image, click image > Create to build the custom image into a persistent image.

  3. In the Resource Group for Which You Want to Create Image dialog box, configure the following parameters and click Continue.

    • Build resource group: Select the serverless resource group to be used for this build.

    • Build CU: The compute resources to allocate for the build. The default value is 0.5 CU and the minimum value is 0.25 CU. The value must be in increments of 0.25. If the image is large or the build takes a long time, you can increase this value.

    Important

    To avoid build failures due to network issues, ensure the build resource group is the same as the Test Resource Group you used when you published the custom image.

  4. It takes about 5 to 10 minutes to build an image, depending on its size. After a successful build, the image's status changes to Published (Build Succeeded).

6. Other operations

On the Image Management > Custom Images tab, you can also perform the following routine O&M operations on an image:

Actions

Description

Disable / Enable

In the Operation column, click Disable. After the image is disabled, it is no longer displayed or referenced in modules. Running tasks that use this image are not affected. The option then changes to Enable, which you can click to re-enable the image. The Disable operation is unavailable if the image status is Expired.

Modify

In the Operation column, click Modify to modify properties such as the image description, visible scope, supported modules, node task types, and installation packages. You cannot modify an image that is in the Publishing or Building state.

View Version

In the Operation column, click View Version to view all historical versions of the image. This facilitates tracking and rollbacks.

Delete Image

In the Operation column, click Delete Image.

Warning

Deletion does not affect running tasks, but a deleted image cannot be restored. It will no longer be available for new tasks or visible on the Image Management page.

Tags

In the Tags column of the custom image list, you can add and manage tags to group and search for images by criteria such as business line or environment. This column is not displayed in the DataWorks official images list.

Billing

Image building incurs a compute cost calculated as Number of CUs × Build duration. The system allocates 0.5 CU by default. For more information about billing, see Billing standards for Serverless resource groups.

Production usage

To ensure your custom images are stable, efficient, and cost-effective in the production environment, follow these recommendations.

  • Persistent images: Build your stable configurations into persistent images. This avoids reinstalling dependencies for each task run, which reduces startup time, lowers compute costs, and improves stability.

  • Environment consistency: Ensure that the VPC and network configurations are consistent across the serverless resource groups used for testing, building, and production scheduling, especially when accessing a private ACR repository or the public network.

  • Version pinning: When installing dependencies using the Script method, we strongly recommend specifying exact version numbers (for example, pip install pandas==1.5.3). This practice prevents unexpected behavior caused by upstream library updates.

  • Rollback plan: If a production task fails after an image update, roll back to a previous version using the task deployment history, or set the image to an older, stable version in the schedule settings.

Use case

This tutorial demonstrates how to use a custom image with a PyODPS node to perform Chinese word segmentation. Assume you need to segment Chinese text from a column in a MaxCompute table and store the results in another table for downstream tasks. You can pre-install the jieba word segmentation toolkit in a custom image, then use a PyODPS task with this image to process the text. The results are stored in a new table and seamlessly integrated into your downstream scheduling workflow.

  1. Create test data.

    1. Create a DataWorks workspace and associate it with a MaxCompute compute resource. For details, see Create a workspace and Manage compute resources.

    2. In Data Studio, create an ODPS node (in legacy Data Studio) or a MaxCompute SQL node (in the new Data Studio) to create and populate a test table.

      Note

      The following example uses a scheduling parameter. In the Scheduling Settings on the right, set the parameter name to bday and the value to $[yyyymmdd].

      Create a test table.

      -- Create a test table.
      CREATE TABLE IF NOT EXISTS custom_img_test_tb
      (
          c_customer_id BIGINT NOT NULL,
          c_customer_text STRING NOT NULL,
          PRIMARY KEY (c_customer_id)
      )
      COMMENT 'TABLE COMMENT'
      PARTITIONED BY (ds STRING COMMENT 'partition')
      LIFECYCLE 90;
      -- Insert test data into the table.
      INSERT INTO custom_img_test_tb PARTITION (ds='${bday}') (c_customer_id, c_customer_text) VALUES
      (1, 'As evening snow is due, won''t you stay for a drink?'),
      (2, 'Moon sets, crows caw, frost fills the sky; river maples and fishing fires sleeplessly sigh.'),
      (3, 'Where mountains and rivers seemed to end, a village appears amidst the willows and flowers.'),
      (4, 'In spring slumber, unaware of dawn, bird songs fill the air all around.'),
      (5, 'Quiet night thoughts: bright moonlight before my bed, like frost upon the ground.'),
      (6, 'The bright moon rises over the sea; from the farthest shores, we share this moment.'),
      (7, 'The swallows of the noble Wang and Xie clans of old now fly into the homes of common folk.'),
      (8, 'A line of white egrets ascends the blue sky; the window frames the ancient snows of the Western Range.'),
      (9, 'Life is for joy, so live it to the fullest; let not your golden cup be empty under the moon.'),
      (10, 'Heaven gave me talent, so it must be put to use; a thousand pieces of gold, once spent, will come back again.');
    3. Save and deploy the node.

  2. Create a custom image.

    For details, see Create a custom image. Configure the key parameters as follows:

    • Image Name/ID: Select dataworks_pyodps_task_pod, the official DataWorks image for PyODPS nodes.

    • Supported Task Types: Select PyODPS2 and PyODPS 3.

    • Installation Package: Select Python3 and jieba.

  3. Publish the custom image and associate it with your workspace. For details, see Publish a custom image and Modify the workspace association of an image.

  4. Use the custom image in a scheduled task.

    1. In Data Studio, create a PyODPS 3 node and add the following code:

      Use the custom image.

      import jieba
      from odps import ODPS
      from odps.models import TableSchema as Schema, Column, Partition
      # Read data from the table.
      table = o.get_table('custom_img_test_tb')
      partition_spec = f"ds={args['bday']}"
      with table.open_reader(partition=partition_spec) as reader:
          records = [record for record in reader]
      # Segment the extracted text.
      participles = [' | '.join(jieba.cut(record['c_customer_text'])) for record in records]
      # Create the destination table.
      if not o.exist_table("participle_tb"):
          schema = Schema(columns=[Column(name='word_segment', type='string', comment='Word segmentation result')], partitions=[Column(name='ds', type='string', comment='Partition field')])
          o.create_table("participle_tb", schema)
      # Write the segmentation results to the destination table.
      # Define the output partition and table.
      output_partition = f"ds={args['bday']}"
      output_table = o.get_table("participle_tb")
      # If the partition does not exist, create it.
      if not output_table.exist_partition(output_partition):
          output_table.create_partition(output_partition)
      # Write the segmentation results to the table.
      record = output_table.new_record()
      with output_table.open_writer(partition=output_partition, create_partition=True) as writer:
          for participle in participles:
              record['word_segment'] = participle
              writer.write(record)
    2. In the scheduling settings on the right, configure the following key parameters:

      • Scheduling Parameter: Set the parameter name to bday and the value to $[yyyymmdd].

      • Resource Group for Scheduling: Select the same serverless resource group that you specified for the Publish Image during Test Resource Group.

      • Image: Select the published custom image associated with the current workspace.

    3. Debug the node.

      • If you use legacy Data Studio, click Run with Parameters (image) in the node toolbar. Configure Resource Group Name, CUs for Node Running, Image, and Custom Parameters, and then click Run.

      • If you use the new Data Studio, in the Run Configuration panel on the right, configure Computing Resources, Resource Group, Calculate CU, Image, and Script Parameters. Then, click Run in the node toolbar.

    4. (Optional) Create an ad hoc query (in legacy Data Studio) or a SQL file in your personal directory (in the new Data Studio), and then run the following SQL statement to check the output table for data.

      -- Replace <partition_date> with the actual partition date.
      SELECT * FROM participle_tb WHERE ds=<partition_date>;

      If the query returns data, the result includes the word_segment column (segmentation results, with words separated by vertical bars |) and the ds column (partition date).

    5. Deploy the PyODPS node to the production environment.

      Note

      Image changes made in Data Studio are not automatically synchronized to the production environment. You must deploy the task for the changes to take effect. For details, see Deploy tasks (legacy Data Studio) or Deploy nodes/workflows (new Data Studio).

  5. Build a persistent image from the custom image. For details, see Build a persistent image.

FAQ

Q: A Python task reports the error "urllib3 v2.0 only supports OpenSSL 1.1.1+".

A: The urllib3 v2.0 package requires OpenSSL 1.1.1 or later. To resolve this, downgrade urllib3 to a compatible version. For example: /home/tops/bin/pip3 install urllib3==1.26.16.

Related documentation

Appendix: Installation commands

To install packages in a custom image using the script method, use the following commands.

  • For PyODPS 2 nodes, run the following commands.

    pip install <package-name> -i  https://pypi.tuna.tsinghua.edu.cn/simple
    pip install <package-name>
    Note

    If prompted to upgrade PIP, run the following command: pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple.

  • For PyODPS 3 nodes, run the following commands.

    /home/tops/bin/pip3 install <package-name> -i https://pypi.tuna.tsinghua.edu.cn/simple
    /home/tops/bin/pip3 install <package-name>
    Note
    • If prompted to upgrade PIP, run the following command: /home/tops/bin/pip3 install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple.

    • If the error /home/admin/usertools/tools/cmd-0.sh: line 3: /home/tops/bin/python3: No such file or directory occurs, submit a ticket to request the required permissions.

    The following table lists public Python mirror sources.

    Organization

    Mirror url

    Alibaba Cloud

    https://mirrors.aliyun.com/pypi/simple/

    Important

    You can get Python packages from Alibaba Cloud without enabling internet access.

    Tsinghua University

    https://pypi.tuna.tsinghua.edu.cn/simple

    University of Science and Technology of China (USTC)

    https://pypi.mirrors.ustc.edu.cn/simple/