Create a DSW instance

更新时间:
复制 MD 格式

Data Science Workshop (DSW) provides a cloud-based integrated development environment (IDE) for AI development. Developers who are familiar with Notebook or VS Code can quickly begin developing models. This topic describes how to create a DSW instance and provides solutions for common issues that may occur during instance startup and deletion.

Create a basic DSW instance quickly

  1. Log on to the PAI console and select the target Region. In the navigation pane on the left, click Workspaces and select the target workspace.

  2. In the navigation pane on the left, click Interactive Modeling (DSW) > Create Instance. Configure the following key parameters and use the default settings for the remaining parameters. For more information about the console parameters, see List of all console parameters.

    Parameter

    Description

    Instance Name

    Example: dsw_test.

    Resource Type

    Select Public Resources (pay-as-you-go billing).

    Instance Type

    Example: ecs.gn7i-c8g1.2xlarge (1 × A10 GPU, 8 vCPUs, 30 GiB memory).

    If this instance type is out of stock, try to select another instance type from the list or switch to another region.
    Warning
    • If you use a DSW free trial resource plan, make sure that the selected instance type is within the deductible range (ecs.g6.xlarge, ecs.gn7i-c8g1.2xlarge, or ecs.gn6v-c8g1.2xlarge). If you select an unsupported instance type, the fees cannot be deducted. For more information, see Claim, use, and release free trial resources.

    • After the free quota is used up or expires, if the DSW instance is still in the Running state, the system automatically switches to the pay-as-you-go billing method and deducts fees from your account balance. Make sure to release the instance in a timely manner.

    Image config

    Select Alibaba Cloud Image. Search for and select modelscope:1.31.0-pytorch2.8.0-gpu-py311-cu124-ubuntu22.04 (Python 3.11, CUDA 12.4).

    ModelScope images are recommended because they offer good compatibility and a comprehensive set of third-party libraries.

    Click OK to create the instance. The instance is successfully created when the instance status is Running.

    If the instance fails to start, see DSW instance startup.
  3. On the DSW instance list page, click Open in the Actions column to open the DSW instance and start model development.

    For more information about the DSW instance UI and how to stop, delete, or change the configuration of a DSW instance, see Access and manage DSW instances in the console.

    Important
    • You are billed by the hour for DSW instances that use public resources once the instance status changes to Running, even if you do not open the WebIDE or run code.

    • Closing the browser or logging out does not stop the instance or stop billing.

    • If you use a free trial resource plan, the system automatically switches to the pay-as-you-go billing method after the quota of the resource plan is used up. The instance is not automatically stopped.

  4. Stop the instance. When you finish your development task and no longer need the DSW instance, return to the DSW instance list page.

    • To pause the instance, click the Stop button to the right of the instance. Note: You will continue to be billed for a scaled-out system disk after the instance is stopped.

      Important

      By default, data is stored on a free cloud disk. If the instance is stopped for more than 15 days, the content on the cloud disk is deleted and cannot be recovered. This restriction does not apply if the system disk has been expanded.

    • If you no longer need the instance, click More > Delete on the right side of the instance to stop all billing. Before you delete the instance, make sure to back up important data. Data cannot be recovered after the instance is deleted.

Configurations for typical application scenarios

A basic DSW instance may not meet all your AI development needs. The following table summarizes configurations for typical application scenarios.

Scenario

Need/Pain point

Configuration points

References

Persistently storing code and data

The system disk of a DSW instance provides only temporary storage. Data is cleared if the instance is deleted or stopped for an extended period.

You may need to save important files for long-term use or share data among multiple instances.

You can mount cloud storage, such as Object Storage Service (OSS), to a specified folder on the instance via Dataset Mounting or Mount storage.

Mount a dataset, OSS bucket, NAS file system, or CPFS file system

Increasing the Internet download speed

By default, DSW instances use a shared gateway with limited bandwidth. The network speed may be insufficient for downloading large files.

In the Network Information section, configure the VPC to use a Private Gateway. You must also create a NAT Gateway and an EIP for the VPC.

Use a dedicated gateway to increase the public network access speed

Developing remotely over SSH

You are accustomed to using local tools, such as VSCode and PyCharm, for development and debugging and do not want to be limited to the web IDE.

In the Access Configuration section, select Enable SSH, enter your SSH Public Key, select Access over Internet, and associate an existing NAT Gateway and EIP.

Remote connection: Direct SSH

Accessing web services within the instance

You want to publish a web application that runs in the instance to the Internet so it can be accessed or shared through a URL.

In the access configuration, add and configure a Custom Services by specifying a service port and enabling public network access. You must also add a corresponding inbound rule to the security group to allow traffic on that port.

Access services in an instance over the Internet

List of all console parameters

Basic information

Parameter

Description

Instance Name

Configure the instance name based on the on-screen instructions.

Tag

Add tags to the instance as needed to facilitate multi-dimensional search, positioning, batch operations, and cost allocation for resources.

Resource information

Parameter

Description

Resource Type

  • Public Resources: Pay-as-you-go billing. Cannot be converted to subscription.

    Note

    GPU card limit: When you use public resources, each Alibaba Cloud account is limited to two GPUs in each region. An error may be reported if the resource usage exceeds the quota. To increase the quota, submit a ticket.

    • Instance Type: Choose GPU, CPU, or free trial resources. See Instance families for specifications.

      Warning
      • If you use a DSW free trial resource plan, make sure that the selected instance type is within the deductible range (ecs.g6.xlarge, ecs.gn7i-c8g1.2xlarge, or ecs.gn6v-c8g1.2xlarge). If you select an unsupported instance type, the fees cannot be deducted. For more information, see Claim, use, and release free trial resources.

      • After the free quota is used up or expires, if the DSW instance is still in the Running state, the system automatically switches to the pay-as-you-go billing method and deducts fees from your account balance. Make sure to release the instance in a timely manner.

    • Bidding Purchase: You can use spot instances to reduce running costs. If the message No preemptible instances in stock appears, try a different instance type.

      This parameter is supported only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), China (Guangzhou), Japan (Tokyo), and Singapore regions.

    • Driver Settings: You can set the driver version for GPU-accelerated instances that use public resources. The drop-down list displays the major driver versions supported by the GPU card model.

  • Resource Quota: The billing method is subscription.

    • Resource Quota: Select general computing resources or Lingjun resources. If no resources are available, click Associate Resource Quota to configure.

    • Instance Type: Configure GPU, CPU, and memory as needed.

    • Priority: The priority level ranges from 1 to 9. A larger value indicates a higher priority.

    • CPU Affinity: Allows you to bind processes in a container or pod to execute on specific CPU cores. This reduces CPU cache misses and context switches, which improves CPU utilization and application performance. This feature is suitable for performance-sensitive workloads and those with high real-time requirements. Currently, this parameter can be configured only in the China (Beijing) and China (Shenzhen) regions.

    • Driver Settings: Sets the driver version for instances within a Lingjun resources quota. The drop-down list displays the major driver versions supported by the GPU type.

Environment information

Parameter

Description

Image Configuration

The following image types are supported:

  • Alibaba Cloud Image: PAI provides official images for popular open-source frameworks and Python versions. For example, pytorch:2.4.1-gpu-py312-cu124-ubuntu22.04 is an image for GPU-accelerated instances that contains PyTorch 2.4.1, Python 3.12, and CUDA 12.4.

    If you need a dependency with a specific version, you can search for keywords in the search box. For example, search for cu124 to find runtime images with CUDA version 12.4.

  • Custom Image: You can use custom images that are added to PAI. The image repository must be set to public, or the image must be stored in Container Registry (ACR). For more information, see custom images.

  • Image Address: Configure the URL of a custom or official image accessible on the public network.

    • To accelerate image pulling, see Image acceleration.

    • If the image address is private, click Enter Username and Password and configure the username and password of the image repository, or try to temporarily set the image repository to allow public pulls.

System Disk

Used to store files during the development process. When Resource Type is set to Public Resources, or when Resource Quota is set to subscription general computing resources (CPU cores ≥ 2 and memory ≥ 4 GB, or with a GPU), each instance is provided with a free 100 GiB cloud disk as a system disk. The cloud disk supports scale-out at the price displayed in the console.

Warning
  • If you use only the free cloud disk, the content on the cloud disk is cleared if the instance is stopped for more than 15 days.

  • After expansion, the entire cloud disk (free + paid) is no longer subject to the 15-day release restriction. However, the expanded part continues to incur fees. Billing stops only after the instance is deleted.

  • You cannot scale in a disk after it is expanded. Expand the disk as needed.

  • When an instance is deleted, the cloud disk is released at the same time. Before you delete the instance, make sure that you have backed up necessary data.

If you need permanent storage, you can configure Dataset Mounting or Mount storage.

Dataset Mounting

Can be used to store datasets that need to be read, or to persistently store files during development. The following two dataset types are supported:

  • Custom Dataset: You can create a custom dataset to store training data files. You can set the dataset to be read-only and select a specific version.

  • Public Dataset: PAI provides pre-built public datasets, which support only read-only mounting mode.

Mount Path: Path where the dataset is mounted in the DSW instance, for example, /mnt/data. Access the dataset from code using this path.

Note
  • The mount paths of multiple datasets cannot be the same.

  • If you configure a CPFS dataset, you need to configure the network and make sure that the selected VPC is the same as the CPFS file system. Otherwise, the DSW instance may fail to be created.

  • When the resource group is a dedicated resource group, the first dataset must be a NAS type dataset, and it will be mounted to both your specified path and the DSW default working directory /mnt/workspace/.

For more information about mounting, see Mount a dataset, OSS bucket, NAS file system, or CPFS file system.

Mount storage

You can also use storage mounting to store datasets that need to be read or to persistently store files during development.

For more information about mounting, see Mount a dataset, OSS bucket, NAS file system, or CPFS file system.

Working Directory

The working directory is the startup path for Notebook and WebIDE. The default is /mnt/workspace.

Show More

Parameter

Description

Custom Startup Script

Used to customize the environment or perform initialization tasks during instance startup. The custom script is executed after the image and resources are ready, but before development applications such as JupyterLab and Code Server start.

Note
  • Timeout is 3 minutes: The custom script increases the instance startup time. The script has a timeout of 3 minutes. Do not perform long-running tasks such as image downloads in the custom script.

  • View script run logs: After the instance starts, you can find the logs generated by the custom script in the /var/log/user-command/ directory.

Environment Variable

Used for main container startup, system processes, and user processes. You can add custom environment variables or overwrite default system environment variables as needed.

Note: We recommend that you do not modify the following environment variables:

# Modifications do not take effect
USER_NAME # Will be overwritten by the logic in the service

# System variables that are not recommended for modification. Modification may affect normal use.
JUPYTER_NAME: Constructed from instance information by default. Can be used to modify the JupyterLab URL access path.
JUPYTER_COMMAND: The Jupyter startup command. Default is set to lab to start JupyterLab.
JUPYTER_SERVER_ADDR: The JupyterLab service listening address. Default is 0.0.0.0.
JUPYTER_SERVER_PORT: The JupyterLab service listening port. Default is 8088.
JUPYTER_SERVER_AUTH: The JupyterLab access password. Default is empty.
JUPYTER_SERVER_ROOT: The Jupyter working directory. Priority is lower than WORKSPACE_DIR.
CODE_SERVER_ADDR: The code-server service listening address. Default is 0.0.0.0.
CODE_SERVER_PORT: The code-server service listening port. Default is 8082.
CODE_SERVER_AUTH: The code-server access password. Default is empty.
WORKSPACE_DIR: The system sets this environment variable based on the working directory parameter set when the instance is created. It can change the startup directory of Jupyter and code-server. An error may occur if the path does not exist.

Advanced Configurations

Allows users to adjust some secure kernel parameters required by services through advanced configuration. This is currently supported only for Lingjun resource group instances. For parameter details, see the table below.

Advanced configuration parameter

Description

Example value

VmMaxMapCount

Sets the maximum number of memory mapping areas that a process can have. The default value is 65530. Values less than 65530 do not take effect. An excessively high value may waste memory resources.

1024000

EnableNvidiaIBGDA

Enables the IBGDA feature when the GPU driver is loaded. This must be enabled to use the DeepEP feature.

true

EnableNvidiaGDRCopy

Installs the GDRCopy kernel module. The current version is 2.4.4. This must be enabled to use the DeepEP feature.

true

VpmuFeature

After this is enabled, you can use asys to troubleshoot CPU hotspots. Valid values: 0, 1, and 2. Default value: 0. 0 indicates that this feature is disabled. 1 indicates that this feature is partially enabled. 2 indicates that this feature is fully enabled. Enabling vpmu may cause performance loss. We recommend that you enable it as needed.

2

EnableVcpuTier

Used to enable the RunD core binding optimization. Default value: false. This option allows RunD vCPUs to be promptly scheduled by the kernel to a runnable CPU. However, this may cause scheduling contention and interference if two vCPUs are scheduled on the same hyper-threading pair. The following conditions must also be met:

  1. The instance must be a full-card instance (no other instances or tasks on the node).

  2. Number of CPUs >= 4 × Number of GPUs.

  3. The number of CPUs must be an even number.

true

VcpuTierHighCPU

Used to manually specify HighCPU after EnableVcpuTier is enabled. If not specified, the default value (the number of GPUs for a single worker, such as 8 or 16) is used. The value must be in the range of [Number of GPUs, Number of CPUs/2).

4

Network information

Parameter

Description

VPC Settings

This parameter is available only when Resource Type is set to Public Resources.

To use a DSW instance in a Virtual Private Cloud (VPC), create a VPC in the same region as the DSW instance and configure this parameter. You must also configure a vSwitch and a Security Group. For details about configuration policies for different scenarios, see Network configuration.

vSwitch

This parameter can be configured when a VPC is configured. A vSwitch is a subnet within a VPC. Your DSW instance and other cloud resources are connected to the vSwitch.

Security Group

This parameter is required when a VPC is configured. A security group is a virtual firewall for a DSW instance. It controls all inbound and outbound network traffic.

Internet Access Gateway

The following configuration methods are supported:

  • Public Gateway: Network bandwidth is limited. During periods of high user concurrency or when downloading large files, the network speed might be insufficient.

  • Private Gateway: To solve the bandwidth limitations of the public gateway, you can create an Internet NAT gateway, attach an EIP, and configure SNAT entries in the DSW instance's VPC. For more information, see Improve public network access speed with a dedicated gateway.

The following parameters are available only when you select a CPFS dataset for Mount Configuration:

  • Enable All Options: This option is disabled by default. The system disables VPCs that are not connected to the CPFS dataset.

Note

If a CPFS dataset is mounted, you must configure a VPC, and the selected VPC must be the same as the one used by CPFS.

Extended CIDR Block

You can configure this parameter after you configure the vSwitch. When the number of available IP addresses in a virtual private cloud (VPC) is insufficient to meet your expanding business needs, or when poor initial network planning results in an address shortage, you can use a secondary CIDR block to expand the VPC address space. For more information, see Use a secondary CIDR block to expand the address space of a CIDR block.

Access configuration

Parameter

Description

Enable SSH

Enables remote connections to the instance and is configurable only after you select a virtual private cloud (VPC). When this option is enabled, a Custom Services named SSH is created. If you use a custom image, ensure that sshd is installed.

SSH Public Key

You can configure this parameter after you enable the SSH Configuration switch.

Note

To support both VPC and public network logon, you must add public keys from multiple clients. Add each public key on a new line. You can add up to 10 public keys.

Service Access and Port Configuration

Used to configure SSH remote access or access services in an instance over the Internet.

  • Listener Port: The port that the service running in the DSW instance listens on.

  • Service Access Method:

    • Access over VPC: This access method is supported by default. You can access services in DSW from other terminals within the VPC, such as an ECS instance.

    • Access over Internet: Select this option to enable public network access. You must also configure a NAT Gateway and an EIP.

  • Internet Access Port: The port that allows access from the public network.

Create Private Zone in VPC

Creates an internal authoritative domain name (PrivateZone). You can use this domain name within the VPC to access the instance's SSH service or other custom services, avoiding the inconvenience of a changing instance IP address. Creating a PrivateZone domain name incurs charges. For more information, see Alibaba Cloud DNS Product Billing.

Public Network Access

NLB:

  • NLB Instance Resource Group: Select the resource group of the NLB instance.

  • NLB Instance: Select an NLB instance that is in the same VPC as the DSW instance.

DNAT + EIP:

  • NAT Gateway: Maps public network requests (EIP:port) to a private DSW instance (private IP:port) to allow public network access to services in the instance.

  • EIP: Provides a public IP address to access services in an instance over the public network.

Roles and permissions

Parameter

Description

Visibility

Choose Visible to the Instance Owner or Visible to Current Workspace.

Instance Owner

Only the workspace administrator can change the instance owner.

Show More

Parameter

Description

Instance RAM Role

When you access other cloud resources within a DSW instance, you can associate a RAM role with the instance. This method uses temporary credentials from STS to access other cloud resources, which avoids using long-term AccessKeys and reduces the risk of key exposure.

The instance RAM role can be configured as follows:

  • Default Roles of PAI: Has permissions to access internal PAI products, MaxCompute, and OSS. A temporary access credential issued based on the PAI default role has permissions equivalent to the DSW instance owner when accessing internal PAI products and MaxCompute tables. When accessing OSS, it can only access the default storage path bucket configured for the current workspace.

  • Custom Roles: Configure a custom role for customized or more fine-grained permission management.

  • Does Not Associate Role: Select this option to access other cloud products directly using an AccessKey.

For more information about how to configure an instance RAM role, see Configure an instance RAM role for a DSW instance.

FAQ

DSW instance startup

Click to expand

Q: DSW instance fails to start

Troubleshooting: Click the DSW instance name and check the error message on the Events tab.

image

  • Your requested resource type [ecs.******] is not enough currently, please try other regions or other resource types

    • Cause: The selected instance type has insufficient inventory in the current region.

    • Solution: Try creating the instance again later, or select a different instance type or region.

  • Your resource usage has exceeded the default limitation. Please contact us via ticket system to raise the limitation.

    • Cause: Each Alibaba Cloud account is limited to 2 GPUs per region by default. This error occurs when the selected instance type exceeds this quota.

    • Solution: To request a quota increase, submit a ticket.

  • the available zone with vSwitch is out of stock or InternalError-ResourceAllocateFailed

    • Cause: When a VPC is configured, the associated vSwitch restricts resource allocation to its availability zone. This causes failures if that zone has insufficient capacity.

    • Solution:

      1. Try creating a vSwitch and a DSW instance in a different availability zone.

      2. Try selecting a different instance type for the DSW instance.

  • Sales of this resource are temporarily suspended in the specified zone. We recommend that you use the multi-zone creation function to avoid the risk of insufficient resource.

    Try one of the following:

    • Switch to a different region.

    • Change the instance type.

    • Start the instance during off-peak hours.

  • CommodityInstanceNotAvailableError: The commodity instance has been released due to an overdue balance. Please create a new instance.

    • Cause: The system automatically released the instance due to prolonged overdue payments.

    • Solution: Create a new instance.

  • The charge of current ECI instance has been stopped, but the related resources are still being cleaned.

    • Cause: Trial resources are shared and may experience high demand during peak hours, causing startup to exceed 30 minutes. If resources are not acquired within one hour, startup fails.

    • Solution: Try the following options:

      • Switch to a different region.

      • Change the instance type. You cannot change the type of a pending instance — stop it first.

      • Try again during off-peak hours.

      • If these methods do not resolve the issue, contact your account manager for assistance.

  • The cluster resources are fully utilized. Please try later or other regions.

    • Cause: The compute resources are fully occupied.

    • Solution: Try the following options:

      • Switch to a different region.

      • Change the instance type. Note that you cannot change the instance type of a pending instance. You must stop the instance before changing its type.

      • Try again during off-peak hours.

      • If these methods do not resolve the issue, contact your account manager for assistance.

  • Create ECI failed because the specified instance is out of stock. It is recommended to use the multi-zone creation function to avoid the risk of stockout.

    • Cause: The specified compute resource is out of stock.

    • Solution: Try the following options:

      • Switch to a different region.

      • Change the instance type. You cannot change the type of a pending instance — stop it first.

      • Try again during off-peak hours.

      • If these methods do not resolve the issue, contact your account manager for assistance.

  • back-off 10s restarting failed container=dsw-notebook pod

    • Cause: The system disk is full.

      Check the system disk usage:

      image

      image

    • Solution: Expand the system disk using Change Settings.

      image

      Important

      An expanded system disk incurs charges continuously, regardless of whether the instance is running. To stop all billing, delete the instance. Back up all necessary data before deletion.

  • Startup failed, with the message "Workspace member not found"

    This error indicates that the account you are logged in with is not a member of the target workspace. Contact your workspace administrator to add your account to the workspace.

  • failed to create containerd container: failed to prepare layer from archive: failed to validate archive quota ...

    • Cause: The image used to create the instance is too large for the system disk, resulting in insufficient space.

    • Solution: Open the instance details page and expand the system disk. Note that expanding the system disk incurs additional charges based on its capacity.

      image

  • InternalError-Failed to perform action, error: OperationDenied.NoStock: The resource is out of stock in the specified zone. Please try other types, or choose other regions and zones.

    The resource is temporarily out of stock in the specified availability zone. Try selecting a different instance type, or switch to another region and availability zone.

  • RISK.RISK_CONTROL_REJECTION

    Your account has been restricted by risk control. You must resolve this issue before you can create an instance.

  • Creation failure due to overdue payments

    If your account has an overdue balance, you cannot create a DSW instance. Vouchers cannot be used to offset overdue payments. To check your account balance, log in to the Billing Management console.

Q: Instance type is out of stock or quota is insufficient

Common error messages:

  • "Your requested resource type [ecs.******] is not enough currently" (insufficient resource inventory).

  • "Your resource usage has exceeded the default limitation" (exceeds the 2-GPU quota per region).

  • "The cluster resources are fully utilized" (compute resources are fully occupied).

Cause Analysis:

  • Insufficient public resource inventory

    • Public resources are shared among multiple users, so inventory may be tight during peak hours.

    • Specific GPU instance types, such as high-end GPUs, are more likely to be sold out.

    • Each account has a quota of 2 GPUs per region.

  • Insufficient dedicated resource quota

    • The purchased dedicated resource quota has been exhausted.

    • Ineffective quota allocation has resulted in an insufficient quota for a specific workspace.

Solutions:

  • Change the instance type: If the selected GPU type is out of stock, try a different GPU type.

  • Switch the region: In the upper-left corner of the PAI console, switch to another region and try to create an instance.

  • Increase the GPU quota: If you need to use public resources with more than 2 GPUs, submit a ticket.

  • Purchase a dedicated resource quota: For guaranteed resource availability, purchase a dedicated resource quota. Purchase general-purpose compute resources and Manage resource quotas.

Q: Execute a Python file on DSW startup

Yes, you can configure a Custom Startup Script when you create a DSW instance or by modifying an existing instance's configuration.

image

The custom script runs after the image and resources are prepared but before development applications (JupyterLab, Code Server) launch.

Note
  • 3-minute timeout: A custom startup script increases instance startup time and will time out after 3 minutes. Avoid including long-running tasks, such as downloading large images, in the script.

  • Log access: After the instance starts, you can find logs generated by the script in the /var/log/user-command/ directory.

Q: Cannot find my DSW instance

The overview page shows instances you created. Switch between instance types and regions to locate your instance.

image

Q: DSW page is blank or unresponsive

Blank pages, continuously loading Notebooks, or unresponsive Terminals are typically caused by your local environment. Try the following:

  1. Clear your browser's cache and try again.

  2. Access the page using your browser's incognito or private mode.

  3. Switch your network environment, for example, from a corporate network to a mobile hotspot, to rule out firewall restrictions.

  4. Try using a different browser, such as Chrome or Firefox.

Q: Data persistence on cloud disk instances

DSW instances that use a Disk for the system disk include instances created in public resource groups and general-purpose resource instances. Data persistence on the system disk for these instances is as follows:

  • Stopping the instance: Data might be lost. If the cloud disk has not been expanded and the instance remains stopped for more than 15 days, the data is erased and cannot be recovered. If the cloud disk has been expanded or the instance has been stopped for 15 days or less, the data is preserved.

  • Restarting the instance: Data is not lost. When an instance is stopped or restarted, all packages installed via pip, code files, and other data stored on the system disk are retained.

  • Changing the instance type: Data is not lost. Adjusting the instance type (such as CPU, memory, or GPU configuration) does not affect data on the system disk.

  • Changing the instance image: Some data may be lost. Mounted datasets and OSS data are unaffected, but system disk content may be reset. Back up your data before changing the image. Mount a dataset, OSS bucket, NAS file system, or CPFS file system.

For general-purpose resource instances that use Temporary Storage as the system disk, all data is lost when the instance is stopped, restarted, or reconfigured (by changing its instance type or image). This occurs even if the resource group is configured with a prepaid cloud disk.

Q: Recover an instance released due to inactivity

No, the data cannot be recovered. If a DSW instance created from public resources has a standard (non-expanded) system disk and remains stopped for more than 15 consecutive days, the system disk is automatically erased.

Q: DSW instance startup becomes slow

A large saved image can cause a gradual increase in startup time.

Q: Why does the "out of stock" or "InternalError-ResourceAllocateFailed" error occur when creating a DSW instance, or why does the instance fail to start after configuring SSH/public IP but succeeds after removing the configuration?

This error is usually caused by insufficient resource inventory in the zone where the VPC vSwitch is located. When you specify a VPC and vSwitch to create a DSW instance, the system limits resource scheduling to the zone where the vSwitch is located. If the resources in that zone are sold out, the instance creation fails. The root cause is an insufficient resource inventory in that specific zone, not a network configuration conflict.

Common error messages:

  • out of stock

  • InternalError-ResourceAllocateFailed

  • The instance fails to start after configuring SSH access or a public IP, but starts successfully after these configurations are removed.

Solution: Switch to a vSwitch in another zone that has available resources. We recommend that you create vSwitches in multiple zones within the VPC to increase the success rate of resource scheduling.

Q: What should I do if a stopped DSW instance cannot be restarted? How can I check the arrival time of public resource inventory?

Public resource inventory is updated dynamically, and you cannot query the specific arrival time. After a DSW instance is manually stopped, its underlying computing resources are released. Restarting the instance is equivalent to reapplying for those resources. If the original instance type has been occupied by other users when you try to restart, the startup may fail.

Solution:

  • Switch to another region.

  • Switch to another available instance type.

  • Retry during off-peak hours.

Q: What should I do if the "Private zone service status is not OPENED" error is reported when creating a DSW instance?

This error indicates that the Alibaba Cloud DNS PrivateZone service is not enabled. DSW relies on PrivateZone for VPC network resolution initialization.

Solution:

  1. Log on to the Alibaba Cloud Management Console, and then search for and enable the Alibaba Cloud DNS PrivateZone service.

  2. Confirm that the VPC where the DSW instance is located is associated with this service.

  3. Recreate or restart the DSW instance to trigger network initialization.

Q: What should I do if the "Workspace member not found" or "Workspace not exists" error is reported when creating a DSW instance?

This error usually occurs for the following reasons:

  • The account that you are using is not a member of the target workspace.

  • You selected the wrong workspace.

Solution:

  • Switch to the correct workspace.

  • Contact the workspace administrator to add the current account as a member of the workspace.

DSW instance stop or release

Click to expand

Q: How do I release a DSW instance?

On the DSW instance page, click Stop or Delete.

image

Important: If you expanded the system disk when you created the DSW instance, the system disk continues to incur charges regardless of whether the instance is running. To completely stop billing for the DSW instance, you must delete it.

Q: Why can't I find my DSW instance?

If you cannot find the instance, try switching to a different region or workspace.

image

Q: How do I release a free trial resource package?

You do not need to stop or release free trial resource packages.

Q: How do I stop all charges for a DSW instance? What is the difference between Stop and Delete?

  • Stopping an instance: Releases the instance's compute resources (CPU/GPU) and pauses compute charges. Note: An expanded system disk continues to incur charges.

  • Deleting an instance: Permanently deletes the instance and all its resources, including the system disk. All associated billing stops completely.

How to choose:

  • Stop: If you don't need the instance for a while but want to retain its data and environment to restart it later.

  • Delete: If you no longer need the instance and want to stop all billing. Be sure to back up your data first.

Q: Why is my DSW instance stuck in the Stopping or Deleting state for a long time?

The system needs to safely terminate tasks, save state, and reclaim resources. Common causes of extended wait times:

  • Some processes in the instance did not exit cleanly.

  • High memory usage prevents the instance from responding to the shutdown command.

If this occurs, wait a few minutes and then refresh the page. The instance should stop normally.

Q: Will I lose my data and code after I stop or delete a DSW instance?

Data retention depends on the operation you perform and your instance's system disk type.

  • Stopping an instance:

    When an instance is stopped, data retention depends on its system disk type.

    • Instances with a cloud disk system disk (most pay-as-you-go specifications and general-purpose resource instances that use a Disk as the system disk): If the cloud disk was not expanded and the instance is stopped for more than 15 days, the data is cleared and cannot be recovered. If the cloud disk was expanded or the instance has been stopped for 15 days or less, the data is not lost.

    • Instances that use Temporary Storage as the system disk: When you stop the instance, its data is deleted and cannot be recovered.

  • Deleting an instance:

    All data on its system disk is permanently erased and cannot be recovered. Therefore, make sure to back up all important data before you delete the instance.

Q: Why does my running DSW instance stop automatically?

The instance is configured with an idle auto-shutdown policy, enabled by default for free trial instances.

  • Trigger condition: The CPU and GPU utilization stays below a set threshold for 3 consecutive hours.

  • Recommended actions:

    • Manual stop: To ensure resource savings, manually stop the instance when it is not in use. The auto-shutdown policy is not guaranteed to trigger every time.

    • Modify the policy: If you need to run long-running tasks, modify or disable this policy.

      Modifying the DSW auto-shutdown policy

      1. Go to the workspace details page and click Configure Workspace > Auto-stop Settings.

        image

      2. In the DSW configuration section, you can modify the shutdown and exclusion policies. For example, if you do not want an instance to shut down automatically, you can add its name to the exclusion policy.

        image

Q: Why do I still see a "Running" status or receive bills after stopping or deleting all my DSW instances?

Check the following common causes:

  • You may be confusing a resource package with an instance. The "Running" status can refer to a resource package (such as "250 compute hours per month"), not an instance. A resource package remains active until it expires, and its status is independent of any instance.

  • Stopping an instance only pauses compute charges. An expanded system disk continues to incur storage charges.

  • Billing delay. Billing is not real-time, and bills may be generated several hours after you use a resource. For example, charges incurred in the morning might not appear on your bill until the afternoon.

DSW instance free trial

For information about how to claim, use, and release DSW free trial resources, see Claim, use, and release free trial resources.

Appendix: Create an instance using a Python SDK

  1. The Alibaba Cloud SDK uses the Credentials tool to obtain credential information. Before you call an API, you must install and configure the tool. The requirements are as follows:

    • Python 3.7 or later.

    • You can use the V2.0 Alibaba Cloud SDK.

    The installation command is as follows:

    pip install alibabacloud_credentials
  2. Obtain an AccessKey. This example uses AccessKey information to configure access credentials. To prevent your credentials from being exposed, we recommend that you configure the AccessKey as environment variables. The required variable names are ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET.

  3. Install the PAI-related Python SDKs.

    # Install the workspace SDK.
    pip install alibabacloud-aiworkspace20210204 -U -q
    # Install the DSW SDK.
    pip install alibabacloud_pai_dsw20220101 -U -q
    # OpenAPI dependency.
    pip install alibabacloud_tea_openapi -U -q
    # Install the SDK for querying subscription resource groups.
    pip install https://sdk-portal-us-prod.oss-accelerate.aliyuncs.com/downloads/u-b8602de7-c468-436c-8a02-2eca4a30d376-python-paistudio.zip -U -q
  4. Create a DSW instance.

    Code example for creating an instance

    import os
    
    from alibabacloud_aiworkspace20210204.client import Client as AIWorkspaceClient
    from alibabacloud_aiworkspace20210204.models import (ListWorkspacesRequest,
                                                         ListImagesRequest)
    from alibabacloud_credentials.client import Client
    from alibabacloud_credentials.models import Config
    from alibabacloud_pai_dsw20220101.client import Client as DSWClient
    from alibabacloud_pai_dsw20220101.models import (GetInstanceRequest,
                                                     ListEcsSpecsRequest,
                                                     CreateInstanceRequest)
    from alibabacloud_tea_openapi.client import TeaException
    from alibabacloud_tea_openapi.models import Config as AliyunConfig
    
    # Configure access credentials.
    # An Alibaba Cloud account AccessKey has permissions on all APIs. We recommend that you use a RAM user for API access or routine O&M.
    # We strongly recommend that you do not save your AccessKey ID and AccessKey secret in your project code. Otherwise, your AccessKey may be leaked, which threatens the security of all resources under your account.
    # This example reads the AccessKey from environment variables by default for identity verification using the Credentials SDK.
    region_id = 'cn-beijing'  # The region. You can set it to cn-hangzhou, cn-shanghai, cn-shenzhen, or other regions.
    config = Config(
        type='access_key',
        access_key_id=os.environ.get('ALIBABA_CLOUD_ACCESS_KEY_ID'),
        access_key_secret=os.environ.get('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    )
    cred = Client(config)
    # client config.
    workspace_client = AIWorkspaceClient(
        config=AliyunConfig(
            credential=cred,
            region_id=region_id,
            endpoint="aiworkspace.{}.aliyuncs.com".format(region_id),
        )
    )
    dsw_client = DSWClient(
        config=AliyunConfig(
            credential=cred,
            region_id=region_id,
            endpoint='pai-dsw.{}.aliyuncs.com'.format(region_id),
        )
    )
    
    
    # Define a helper function to display DSW instance information.
    def show_instance(instance_id):
        instance = dsw_client.get_instance(instance_id=instance_id, request=GetInstanceRequest()).body
        print(instance.status, instance.instance_name, instance.ecs_spec, instance.accumulated_running_time_in_ms)
    
    
    # Query the properties and ID of an existing workspace.
    workspace_name = 'The name of an existing AI workspace'
    # Get the list of workspaces.
    workspaces = workspace_client.list_workspaces(ListWorkspacesRequest(
        page_number=1,
        page_size=10,
        workspace_name=workspace_name,  # Fuzzy match. If you do not specify a name, all workspaces are returned.
    ))
    if len(workspaces.body.workspaces) == 0:
        raise RuntimeError('Specify a correct workspace_name.')
    for workspace in workspaces.body.workspaces:
        print(workspace.workspace_name,
              workspace.workspace_id,
              workspace.status, workspace.creator)
    # Use the first result as the workspace for subsequent operations. You can switch to another workspace or specify a string ID as needed.
    workspace_id = workspaces.body.workspaces[0].workspace_id
    # Get the list of images. You can filter the images by label.
    images = workspace_client.list_images(ListImagesRequest(
        page_size=100,
        # workspace_id=workspace_id, # If you do not specify workspace_id, all built-in images of the PAI platform are queried.
        labels=','.join(['system.supported.dsw=true',
                         # 'system.framework=tensorflow', # Specify PyTorch or TensorFlow.
                         'system.pythonVersion=3.6',
                         ]),
        verbose=True  # verbose=True lists more detailed information, including labels.
    ))
    # You can view all available images.
    for image in images.body.images:
        print(image.image_id, image.image_uri)
    # Obtain the image to use for submitting the task. This example uses the first image.
    image_uri = images.body.images[0].image_uri
    print('image_uri', image_uri)
    # Get the list of DSW node specifications.
    try:
        resp = dsw_client.list_ecs_specs(ListEcsSpecsRequest(accelerator_type='CPU',  # CPU or GPU
                                                             )).body
    except TeaException as t:
        print("List ECS Specs failed:", t.message)
    else:
        for spec in resp.ecs_specs:
            print(spec.instance_type + ", CPU: " + str(spec.cpu) + ", Memory: " + str(spec.memory))
        # Obtain the node specification for submitting the task.
        ecs_spec = resp.ecs_specs[0].instance_type
        print('Selected ecs_spec:', ecs_spec)
    
    # Create a DSW instance.
    request = CreateInstanceRequest(instance_name="Test_From_SDK_1",
                                    ecs_spec=ecs_spec,
                                    workspace_id=workspace_id,
                                    # image_id='', # You can specify the ID of an image in the workspace. You can specify either image_id or image_url.
                                    image_url=image_uri)
    try:
        ins_resp = dsw_client.create_instance(request)
    except TeaException as t:
        print('Failed to create the instance. Error message: ' + t.message)
    else:
        instance_id = ins_resp.body.instance_id
        print("Created Instance ID:", instance_id)
        show_instance(instance_id)                           

For more information about the APIs, see API overview.