Create a MaxCompute custom image

更新时间:
复制 MD 格式

DataWorks allows you to create a MaxCompute custom image when creating a DataWorks custom image from a personal development environment. This streamlines development by allowing you to install third-party dependencies once and publish the environment as a reusable image. You can reference this image directly in DataWorks nodes, such as PyODPS 3 nodes and Notebook nodes, without needing to package and upload resources for each job.

Background information

The MaxCompute image management feature allows you to create custom images. You can directly reference existing images in scenarios such as SQL UDF, PyODPS, and MaxFrame development, eliminating the need to package and upload resources. In DataWorks, you can build a DataWorks image and a MaxCompute custom image simultaneously from a personal development environment.

Prerequisites

Create a MaxCompute custom image

Preparations

Limitations

When creating a MaxCompute custom image:

  • Image size: The maximum size for a single MaxCompute image is 10 GB.

  • Image quantity: Each MaxCompute tenant can upload a maximum of 10 images.

When using a MaxCompute image: DataWorks builds MaxCompute images based on a Python 3.11 environment. To run a MaxCompute custom image built in DataWorks, ensure your Python environment is version 3.11.

Create a personal development environment instance

Go to Data Studio and create a personal development environment instance using the dataworks-maxcompute:py3.11-ubuntu20.04 image. This is required to create a MaxCompute custom image when you save the environment.

  1. Go to Data Studio.

    1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

    2. In the left navigation pane of the Data Studio page, click the image icon to go to the DataStudio page.

  2. Go to the page for creating a personal development environment. At the top of the page, click Personal development environment to create an instance as needed.

    • If you do not have a personal development environment instance, click Go to New.

    • If you already have an instance, click Manage instances. In the list of personal development environment instances, click Create Instance.

  3. Configure the personal development environment. To create a MaxCompute custom image in DataWorks, you must configure the following settings. For other parameters, see Create a personal development environment instance.

    • Image configuration: Select dataworks-maxcompute:py3.11-ubuntu20.04.

      Note
      • You can create a MaxCompute custom image only if you select the dataworks-maxcompute:py3.11-ubuntu20.04image.

      • A DataWorks custom image built from dataworks-maxcompute:py3.11-ubuntu20.04as the base image can be used for MaxFrame job development in DataWorks Notebook, Python, and Shell nodes.

    • Network settings: Select the VPC that you configured for ACR access. This ensures that the personal development environment instance can push the image to the ACR instance during the build process.

Install dependencies

Follow these steps to install the third-party dependencies required for MaxCompute development in the terminal of your personal development environment instance. This topic uses jieba as an example.

  1. At the top of the Data Studio page, click Personal development environment and select the personal development environment instance that you created in Create a personal development environment instance.

  2. In the toolbar at the bottom of the Data Studio page, click the image icon to open the terminal.

  3. In the terminal of the personal development environment, run the following commands to download and verify the installation of the third-party dependency jieba.

    ## Install the third-party dependency.
    pip install jieba;
    ## View the third-party dependency.
    pip show jieba;

Save the custom image

Follow these steps to save the personal development environment as a DataWorks image and simultaneously create a MaxCompute custom image. DataWorks automatically uploads the generated image to the ACR instance under the same account.

  1. Go to the personal development environment instance management page.

    1. Click Personal Development Environment · Please Select at the top of the page, and then click the name of the personal development environment instance that you created.

    2. In the dialog box that appears, select Management Environment to go to the Personal Development Environment Instances page.

  2. Go to the Create Image page.

    1. On the page of personal development environment instances, find the instance that you created.

    2. In the Operation column of the instance, click Create Image.

  3. Configure the image parameters as described in the following table, and then click Confirm.

    Parameter

    Description

    Image Name

    The name of the custom DataWorks image. If the image is synced to MaxCompute, this name is also used as the MaxCompute custom image name. Example: image_jieba.

    Image instance

    Select an ACR instance of Standard Edition or higher. For more information, see Create an Enterprise instance.

    Note

    Only ACR instances of Standard Edition or higher can be used to build MaxCompute custom images.

    Namespace

    Select a namespace for the ACR instance. For more information, see Create a namespace.

    Image Repository

    Select an image repository for the ACR instance. For more information, see Create an image repository.

    Image Version

    The custom image version.

    Synchronize to MaxCompute

    In this example, select Yes. When this option is enabled, publishing the image creates a MaxCompute image in addition to the DataWorks image.

    Note

    This option is available only if the selected Image instance is Standard Edition or higher. For other instance types, this option is unavailable by default.

    Task Type

    Select the task types that can use the DataWorks image. In this example, you can select Notebook.

    • Notebook

    • Python

    • Shell

  4. Confirm the image save status.

    On the instance list page, you can check the image save status in the Image column of the personal development environment.

  5. Click Confirm to create the image.

  6. If the Image column is not visible, click the image icon on the right side of the instance and select the Image checkbox to display it.

  7. Wait for the image to be created. When the status changes to Save successfully, hover over the image icon next to the status. In the pop-up that appears, click here to go to the Image Management page.

Publish the custom image

After saving the image from your personal development environment, follow these steps to publish it. This operation syncs the image from the ACR instance to DataWorks and MaxCompute, generating both a DataWorks custom image and a MaxCompute custom image.

  1. Go to the DataWorks Workspaces page and select the target region in the top navigation bar.

  2. In the left navigation pane, go to the Image Management > Custom Image tab. For your image, click Test. After the test is successful, click Publish.

    Note
    • When you test the custom image, select a Serverless resource group for Test Resource Group.

    • The VPC of the Serverless resource group used for the test must match the VPC of your ACR instance.

    • If your custom image test times out while fetching third-party packages, check whether the VPC attached to the Test Resource Group has internet access. To configure public access for the VPC, see Use the SNAT feature of an Internet NAT gateway to access the internet.

  3. Refresh the page and confirm that the Publish Status of the image in the list changes to Published.

  4. In the Operation column for the target image, click image > Change Workspace to bind the custom image to a workspace.

Confirm the MaxCompute image status

After you publish the DataWorks image, a corresponding MaxCompute image is also created. When the image status on the Image Management > Custom Image tab in the DataWorks console changes to Published, go to the MaxCompute console and follow the steps in Add a custom image to MaxCompute to view the created MaxCompute custom image.

Use a MaxCompute custom image

Usage notes

  • To develop with MaxFrame, the image must contain the MaxFrame service. To run a MaxCompute custom image in DataWorks, the image must be built in a Python 3.11 environment.

  • To use a MaxCompute custom image for MaxFrame job development in DataWorks, ensure that the task runs in a DataWorks image that includes a MaxFrame runtime environment. The specific requirements are as follows:

Go to Data Studio

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left navigation pane of the Data Studio page, click the image icon to go to the DataStudio page.

Notebook node

This section uses a Notebook node as an example to show how to use a MaxCompute custom image in MaxFrame. The example uses the jieba package from the MaxCompute custom image.

  1. Create a Notebook node.

    1. At the top of the page, click Personal development environment and select the created personal development environment instance.

    2. To the right of Project Directory, click the image icon and choose New Node > Notebook to open the New Node dialog box.

    3. In the New Node dialog box, enter a Name for the node and click Confirm to open the node editor.

  2. Edit the Notebook node code.

    # -*- coding: utf-8 -*-
    from odps import ODPS
    from maxframe.session import new_session
    import maxframe.dataframe as md  # Make sure that the maxframe.dataframe module is correctly imported.
    from maxframe import config
    # Prepare the dataset.
    test_data = [
        "Grass growing on the old plain"
    ]
    # Define a function to process data by using the jieba package from the MaxCompute custom image.
    # Use the MaxCompute custom image.
    def image_test():
        config.options.sql.settings = {
            "odps.session.image": "image_jieba"  # In this example, the MaxCompute image name is image_jieba. You can view the image name in the MaxCompute console.
        }
        def process(row):
            import jieba
            result = jieba.cut(row, cut_all=False)
            return "/".join(result)
        # Establish a MaxFrame connection.
        odps = %odps
        session = new_session(odps) 
        # Print the logview URL to view execution details.
        logview = session.get_logview_address()
        print("logview:", logview)
        # Create a MaxFrame DataFrame.
        # Encapsulate local test data, such as ["Grass growing on the old plain"], into a MaxFrame DataFrame object.
        df = md.DataFrame(test_data, columns=["raw_text"])
        # Apply the tokenization function to process the data in the DataFrame object.
        df["processed_text"] = df["raw_text"].map(process, dtype='object')
        print("Output:",df.execute().fetch())
    image_test()
    print("Data processing completed!")
  3. On the left side of the node editor, click the image icon. In the dialog box that appears, select a Python 3.11 kernel, run the node, and view the log information.

PyODPS 3 node

This section uses a PyODPS 3 node as an example to show how to use a MaxCompute custom image in MaxFrame. The example uses the jieba package from the MaxCompute custom image.

  1. Create a PyODPS 3 node.

    1. To the right of Project Directory, click the image icon and choose New Node > MaxCompute > PyODPS 3 to open the New Node dialog box.

    2. In the New Node dialog box, enter a Name for the node and click Confirm to open the node editor.

  2. Edit the PyODPS 3 node code.

    # -*- coding: utf-8 -*-
    from odps import ODPS, options
    from odps.df import DataFrame
    import pandas as pd
    # Prepare the table data.
    options.sql.settings = {"odps.isolation.session.enable": True}
    # Create a test table.
    table = o.create_table('jieba_work_tb', 'col string', if_not_exists=True)
    # Add sample data.
    instance = o.run_sql("insert into table jieba_work_tb values ('Grass growing on the old plain')")
    instance.wait_for_success()
    # Define a function to process data by using the jieba package from the MaxCompute custom image.
    def image_test():
        def process(row):
            import jieba
            result = jieba.cut(row, cut_all=False)
            return "/".join(result)
        # Encapsulate the table as a DataFrame object.
        df = o.get_table("jieba_work_tb").to_df()
        # Apply the tokenization function to process the data in the DataFrame object.
        df = df.col.map(process).execute(image='image_jieba') # In this example, the MaxCompute image name is image_jieba. You can view the image name in the MaxCompute console.
        print("Output:",df)
    image_test()
    print("Data processing completed!")
  3. Configure the PyODPS 3 node.

    On the right side of the node editing page, click Run Configuration and configure the node as described in the following table.

    Parameter

    Description

    Compute Resource

    Select the MaxCompute computing resources that you attached.

    Resource Group

    Select the Serverless resource group that you attached.

    Image

    Select dataworks_pyodps_py311_task_pod:prod_20241210.

  4. In the toolbar at the top of the node editing page, click the image icon to run the node.