DataWorks allows you to create a MaxCompute custom image when creating a DataWorks custom image from a personal development environment. This streamlines development by allowing you to install third-party dependencies once and publish the environment as a reusable image. You can reference this image directly in DataWorks nodes, such as PyODPS 3 nodes and Notebook nodes, without needing to package and upload resources for each job.
Background information
The MaxCompute image management feature allows you to create custom images. You can directly reference existing images in scenarios such as SQL UDF, PyODPS, and MaxFrame development, eliminating the need to package and upload resources. In DataWorks, you can build a DataWorks image and a MaxCompute custom image simultaneously from a personal development environment.
Prerequisites
-
You have created a workspace that uses the new version of Data Studio and attached MaxCompute computing resources.
-
You have created a Serverless resource group and attached it to the workspace.
Create a MaxCompute custom image
Preparations
-
You have activated Alibaba Cloud Container Registry (ACR) and created an ACR instance of Standard Edition or higher. For more information, see Create an Enterprise instance, Create a namespace, and Create an image repository.
-
VPC access control has been configured for the ACR instance. For more information, see Configure VPC access control.
-
You have the necessary permissions to manage ACR and MaxCompute custom images. For more information, see Custom images.
Limitations
When creating a MaxCompute custom image:
-
Image size: The maximum size for a single MaxCompute image is
10 GB. -
Image quantity: Each MaxCompute tenant can upload a maximum of
10images.
When using a MaxCompute image: DataWorks builds MaxCompute images based on a Python 3.11 environment. To run a MaxCompute custom image built in DataWorks, ensure your Python environment is version 3.11.
Create a personal development environment instance
Go to Data Studio and create a personal development environment instance using the dataworks-maxcompute:py3.11-ubuntu20.04 image. This is required to create a MaxCompute custom image when you save the environment.
-
Go to Data Studio.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
-
In the left navigation pane of the Data Studio page, click the
icon to go to the DataStudio page.
-
Go to the page for creating a personal development environment. At the top of the page, click Personal development environment to create an instance as needed.
-
If you do not have a personal development environment instance, click Go to New.
-
If you already have an instance, click Manage instances. In the list of personal development environment instances, click Create Instance.
-
-
Configure the personal development environment. To create a MaxCompute custom image in DataWorks, you must configure the following settings. For other parameters, see Create a personal development environment instance.
-
Image configuration: Select
dataworks-maxcompute:py3.11-ubuntu20.04.Note-
You can create a MaxCompute custom image only if you select the
dataworks-maxcompute:py3.11-ubuntu20.04image. -
A DataWorks custom image built from
dataworks-maxcompute:py3.11-ubuntu20.04as the base image can be used for MaxFrame job development in DataWorks Notebook, Python, and Shell nodes.
-
-
Network settings: Select the VPC that you configured for ACR access. This ensures that the personal development environment instance can push the image to the ACR instance during the build process.
-
Install dependencies
Follow these steps to install the third-party dependencies required for MaxCompute development in the terminal of your personal development environment instance. This topic uses jieba as an example.
-
At the top of the Data Studio page, click Personal development environment and select the personal development environment instance that you created in Create a personal development environment instance.
-
In the toolbar at the bottom of the Data Studio page, click the
icon to open the terminal. -
In the terminal of the personal development environment, run the following commands to download and verify the installation of the third-party dependency
jieba.## Install the third-party dependency. pip install jieba; ## View the third-party dependency. pip show jieba;
Save the custom image
Follow these steps to save the personal development environment as a DataWorks image and simultaneously create a MaxCompute custom image. DataWorks automatically uploads the generated image to the ACR instance under the same account.
-
Go to the personal development environment instance management page.
-
Click Personal Development Environment · Please Select at the top of the page, and then click the name of the personal development environment instance that you created.
-
In the dialog box that appears, select Management Environment to go to the Personal Development Environment Instances page.
-
-
Go to the Create Image page.
-
On the page of personal development environment instances, find the instance that you created.
-
In the Operation column of the instance, click Create Image.
-
-
Configure the image parameters as described in the following table, and then click Confirm.
Parameter
Description
Image Name
The name of the custom DataWorks image. If the image is synced to MaxCompute, this name is also used as the MaxCompute custom image name. Example:
image_jieba.Image instance
Select an ACR instance of Standard Edition or higher. For more information, see Create an Enterprise instance.
NoteOnly ACR instances of Standard Edition or higher can be used to build MaxCompute custom images.
Namespace
Select a namespace for the ACR instance. For more information, see Create a namespace.
Image Repository
Select an image repository for the ACR instance. For more information, see Create an image repository.
Image Version
The custom image version.
Synchronize to MaxCompute
In this example, select Yes. When this option is enabled, publishing the image creates a MaxCompute image in addition to the DataWorks image.
NoteThis option is available only if the selected Image instance is Standard Edition or higher. For other instance types, this option is unavailable by default.
Task Type
Select the task types that can use the DataWorks image. In this example, you can select Notebook.
-
Notebook
-
Python
-
Shell
-
-
Confirm the image save status.
On the instance list page, you can check the image save status in the Image column of the personal development environment.
-
Click Confirm to create the image.
-
If the Image column is not visible, click the
icon on the right side of the instance and select the Image checkbox to display it. -
Wait for the image to be created. When the status changes to Save successfully, hover over the
icon next to the status. In the pop-up that appears, click here to go to the Image Management page.
Publish the custom image
After saving the image from your personal development environment, follow these steps to publish it. This operation syncs the image from the ACR instance to DataWorks and MaxCompute, generating both a DataWorks custom image and a MaxCompute custom image.
-
Go to the DataWorks Workspaces page and select the target region in the top navigation bar.
-
In the left navigation pane, go to the tab. For your image, click Test. After the test is successful, click Publish.
Note-
When you test the custom image, select a Serverless resource group for Test Resource Group.
-
The VPC of the Serverless resource group used for the test must match the VPC of your ACR instance.
-
If your custom image test times out while fetching third-party packages, check whether the VPC attached to the Test Resource Group has internet access. To configure public access for the VPC, see Use the SNAT feature of an Internet NAT gateway to access the internet.
-
-
Refresh the page and confirm that the Publish Status of the image in the list changes to Published.
-
In the Operation column for the target image, click to bind the custom image to a workspace.
Confirm the MaxCompute image status
After you publish the DataWorks image, a corresponding MaxCompute image is also created. When the image status on the tab in the DataWorks console changes to Published, go to the MaxCompute console and follow the steps in Add a custom image to MaxCompute to view the created MaxCompute custom image.
Use a MaxCompute custom image
Usage notes
-
To develop with MaxFrame, the image must contain the
MaxFrameservice. To run a MaxCompute custom image in DataWorks, the image must be built in aPython 3.11environment. -
To use a MaxCompute custom image for MaxFrame job development in DataWorks, ensure that the task runs in a DataWorks image that includes a MaxFrame runtime environment. The specific requirements are as follows:
-
Notebook node: Select the official image
dataworks-notebook:py3.11-ubuntu22.04or a DataWorks custom image built from this official image or thedataworks-maxcompute:py3.11-ubuntu20.04image. -
PyODPS 3 node: Select the official image
dataworks_pyodps_py311_task_podor a DataWorks custom image built from it. -
Python node: Create a personal development environment instance with the MaxFrame service based on the
dataworks-maxcompute:py3.11-ubuntu20.04image, and then save it as a DataWorks custom image that supports Python tasks. -
Other nodes: Ensure that the DataWorks custom image contains a MaxFrame runtime environment and is built in a
Python 3.11environment.
-
Go to Data Studio
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
-
In the left navigation pane of the Data Studio page, click the
icon to go to the DataStudio page.
> Change Workspace
icon and choose
icon. In the dialog box that appears, select a