In Data Studio, you can create an image from a personal development environment to use for data development or in other personal development environments. This topic describes how to create an image from one such instance.
Background
During development and testing in a personal development environment, you may need to use different third-party dependencies. You can install the required dependencies in your environment and then create a custom image from it. This image provides the dependencies to other personal development environments and workspaces.
Images created from a personal development environment support Notebook, Python, and Shell task types, but you cannot modify the task type or other configurations of an image after it is created.
Prerequisites
-
VPC: Create a VPC.
-
DataWorks: Create a personal development environment instance and attach it to a VPC.
-
Alibaba Cloud Container Registry (ACR):
-
Set up ACR by creating an Enterprise Edition instance, a namespace, and an image repository, and by configuring VPC access control.
-
Activate Cloud DNS PrivateZone. For billing information, see Product Billing.
-
-
The VPC attached to the personal development environment instance, the VPC attached to Alibaba Cloud Container Registry (ACR), and the VPC attached to the test resource group for image publishing must be the same VPC.
-
If your personal development environment must fetch third-party dependencies from the internet, you must configure internet access for the VPC. For more information, see Use the SNAT feature of an Internet NAT gateway to access the internet.
Step 1: Access the personal development environment
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
-
At the top of the page, click Personal development environment · Select and choose an existing personal development environment instance.
Step 2: Create an image from the instance
-
Enhance the personal development environment before you create an image from the instance.
ImportantWhen you enhance the personal development environment, you can install open-source dependencies or install third-party dependencies.
-
After you configure the personal development environment, click the Personal development environment · Select drop-down list at the top of the page and select Instance Management to open the instance list panel.
-
Create a custom image.
In the instance list, find the target instance and click Operation in the Actions column. Configure the ACR Instance, Namespace, Image Repository, Image Version, and Task Type using the resources from the Prerequisites section.
Parameter
Description
Image Name
Enter a custom name for the DataWorks image.
ACR Instance
Select an ACR instance. For information about how to create an ACR instance, see Create an Enterprise Edition instance.
Namespace
Select a namespace of the ACR instance. For more information, see Create a namespace.
Image Repository
Select an image repository of the ACR instance. For more information, see Create an image repository.
Image Version
Enter a custom image version.
Synchronize to MaxCompute
The default value is No.
NoteThis option is available only when the selected ACR Instance is Standard Edition or higher.
-
If you select Yes, a DataWorks custom image is generated. When the image is published in DataWorks, it is also built as a MaxCompute image. For more information, see Create a MaxCompute image from a personal development environment.
-
If you select No, only a DataWorks custom image is generated. The image is not built as a MaxCompute image.
Task Type
Select the task types that can use the image.
-
Notebook
-
Python
-
Shell
-
-
After you complete the configuration, click Confirm to start creating the image.
Important-
When you create the image, ensure that the VPC attached to the personal development environment instance and the VPC attached to Alibaba Cloud Container Registry (ACR) are the same.
-
The image creation process may take 1 to 5 minutes, depending on the image size and network conditions.
-
Once created, you cannot modify these images in Image Management.
-
-
Wait for the image creation to complete.
Step 3: Publish the custom image
After the custom image is created, go to the DataWorks console and navigate to the tab. Find the target image, then Test and Publish it. Note the following during testing and publishing:
-
When you test the custom image, select a Serverless resource group.
-
The VPC attached to the Serverless resource group that you use for testing and publishing must be the same as the VPC configured in Alibaba Cloud Container Registry (ACR).
-
Only images that have passed the test can be published.
-
If your custom image needs to fetch third-party packages from the internet and the test times out or fails, check whether the VPC attached to the Test Resource Group has internet access. To configure internet access for the VPC, see Use the SNAT feature of an Internet NAT gateway to access the internet.
Step 4: Assign image to workspace
-
On the tab of the DataWorks console, find the Published custom image.
-
In the Operation column of the target image, click to bind the custom image to a workspace.
Step 5: Use the custom image
After you assign the image to a workspace, you can use it to develop Notebook, Python, or Shell nodes. The following steps use a Python node as an example:
-
In the Project Directory pane on the left side of the data development page, click the
icon and choose to create a Python node. -
After you finish developing the node, click Run Configuration on the right. Configure the Resource Group and select the Image that your Python code requires.
Also, set Compute CU (for example,
0.5). If no image is available, you can click New Image to create one. -
Click the
icon to debug the Python code. -
After the debug run succeeds, click Scheduling Settings. On the Scheduling Strategy tab, configure the Image for the node's periodic schedule.
Note-
The image configured in Scheduling Settings must be the same as the image configured in Run Configuration.
-
For Notebook nodes, you can configure an image only in Scheduling Settings.
-
-
After you configure the scheduling settings, you can Save and Publish the Python node.
Next steps
Persistent image: DataWorks allows you to build a custom image into a persistent image. This eliminates the need to redeploy the image environment for each run. Using the same image environment for every task run ensures consistency and reduces runtime, computing, and traffic costs. For more information, see Build a persistent image.
Appendix: Enhance a personal development environment
The default dependencies in a personal development environment created by DataWorks may be insufficient for your development needs. You can install additional dependencies to enhance your personal development environment.
Install open-source dependencies
You can install required open-source dependencies in your personal development environment instance. The following example shows how to install the jieba dependency.
-
Click the
icon in the lower-left corner of the data development page to open the terminal. -
In the terminal, run the following command to install the jieba library.
pip install jieba● /mnt/workspace> pip install jieba Looking in indexes: https://mirrors.cloud.aliyuncs.com/pypi/simple Collecting jieba Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/c6/cb/18eeb235f833b726522d7ebed54f2278ce28ba9438e3135ab0278d9792a2/jieba-0.42.1.tar.gz (19.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.2/19.2 MB 101.0 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Building wheels for collected packages: jieba Building wheel for jieba (pyproject.toml) ... done Created wheel for jieba: filename=jieba-0.42.1-py3-none-any.whl size=19314508 sha256=5d2a212c7aaa36057e04dde6c183c1d23da40d2452831dd8b2d00482d1a963cd Stored in directory: /root/.cache/pip/wheels/70/b3/5f/b9d1821b0e0b90640bf536c9f79f039cafbec7de737a9b8c49 Successfully built jieba Installing collected packages: jieba Successfully installed jieba-0.42.1 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv -
After the jieba library is installed, create a
.pyfile in the directory. Edit and save the following code in the Python file.import sys import jieba '''Get the system input parameter arg''' for arg in sys.argv: print(f"argv: {arg}") '''Call the jieba class to tokenize the input data and print the output''' seg_list = jieba.cut(sys.argv[1], cut_all=False) print("Default Mode: " + "/ ".join(seg_list)) print('finish')After you finish editing, click
to save the Python code. -
In the terminal, run the following command to execute the Python file.
python file_name.py "我是大数据治理开发平台文档"/mnt/workspace> python xxx.py "我是大数据治理开发平台文档" argv: xxx.py argv: 我是大数据治理开发平台文档 Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.914 seconds. Prefix dict has been built successfully. Default Mode: 我/ 是/ 大/ 数据/ 治理/ 开发/ 平台/ 文档 finish
A successful run indicates that the jieba library is installed in the personal development environment.
Install third-party dependencies
Clone with Git
To clone a Python project using git clone, you must configure public network access for the VPC. For more information, see Configure an Internet NAT gateway.
-
Click the
icon in the lower-left corner of the data development page to open the terminal. -
In the terminal, run the following command to go to the workspace folder:
cd /mnt/workspace -
Clone the Python project from Git to the workspace folder.
# Replace the URL with your own repository URL. git clone https://github.com/example/Example-Python.git -
Install the cloned Python project.
-
Go to the cloned Python project directory.
cd Example-Python -
Install the Python project.
pip install ./mnt/workspace/xxx>> pip install . Looking in indexes: https://mirrors.cloud.aliyuncs.com/pypi/simple Processing /mnt/workspace/xxx Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: requests>=2.24.0 in /usr/local/lib/python3.11/site-packages (from my-python-lib==0.1) (2.32.3) Requirement already satisfied: numpy in /usr/local/lib/python3.11/site-packages (from my-python-lib==0.1) (1.26.4) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/site-packages (from requests>=2.24.0->my-python-lib==0.1) (3.4.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/site-packages (from requests>=2.24.0->my-python-lib==0.1) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/site-packages (from requests>=2.24.0->my-python-lib==0.1) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/site-packages (from requests>=2.24.0->my-python-lib==0.1) (2024.8.30)
-
Upload a project
-
Upload the local Python project to the directory and use the terminal to go to the project folder.
cd /mnt/workspace/"python_project_folder" -
Run the following command to install the Python project.
pip install ./mnt/workspace/xxx>> pip install . Looking in indexes: https://mirrors.cloud.aliyuncs.com/pypi/simple Processing /mnt/workspace/xxx Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: requests>=2.24.0 in /usr/local/lib/python3.11/site-packages (from my-python-lib==0.1) (2.32.3) Requirement already satisfied: numpy in /usr/local/lib/python3.11/site-packages (from my-python-lib==0.1) (1.26.4) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/site-packages (from requests>=2.24.0->my-python-lib==0.1) (3.4.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/site-packages (from requests>=2.24.0->my-python-lib==0.1) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/site-packages (from requests>=2.24.0->my-python-lib==0.1) (2.2.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/site-packages (from requests>=2.24.0->my-python-lib==0.1) (2024.8.30)
Upload a program
-
Upload the compressed package of the local Python program to the directory. Then, use the terminal to decompress the package and view the Python compile path.
cat 'decompressed_python_project_name'/bin/pip -
Create the Python compile path.
# Create the Python compile path that you found. mkdir -p 'python_compile_path_found' -
Move the decompressed folder to the compile path that Python used at build time.
mv 'decompressed_python_project_name' /'python_compile_path_found' -
Replace the default Python packages with your Python program.
for src in idle3 pydoc3 python3 python3-config pip3; do \ dst="$(echo "$src" | tr -d 3)"; \ [ -s "/usr/local/bin/$src" ]; \ [ ! -e "/usr/local/bin/$dst" ]; \ mv /usr/local/bin/$dst /usr/local/bin/${dst}_bak ln -svT "python_compile_path_found/bin/$src" "/usr/local/bin/$dst"; \ done
After installation, verify that the third-party dependencies work as expected by running and debugging your code in the personal development environment.
> Change Workspace