ALIYUN::PAIDLC::Job

ALIYUN::PAIDLC::Job类型用于创建一个任务到集群中运行。

语法

{
  "Type": "ALIYUN::PAIDLC::Job",
  "Properties": {
    "ThirdpartyLibs": List,
    "Options": String,
    "Priority": Integer,
    "Envs": String,
    "JobMaxRunningTimeMinutes": Integer,
    "WorkspaceId": String,
    "CodeSource": Map,
    "UserVpc": Map,
    "JobSpecs": List,
    "UserCommand": String,
    "DataSources": List,
    "JobType": String,
    "ResourceId": String,
    "ThirdpartyLibDir": String,
    "DisplayName": String,
    "SuccessPolicy": String,
    "Settings": Map
  }
}

属性

属性名称

类型

必须

允许更新

描述

约束

ThirdpartyLibs

List

指定一个第三方Python库和对应版本要求。

例如:numpy==1.16.1

Options

String

本任务的额外配置。

通过此参数可以调整挂载的数据源的一些行为。例如任务有挂载OSS类型的数据源时,可以通过将此参数配置为fs.oss.download.thread.concurrency=4,fs.oss.download.queue.size=16,覆盖JindoFS的默认参数。

Priority

Integer

任务的优先级。

默认值:1。

参数取值范围为1~9。其中:

  • 1为最低优先级。

  • 9为最高优先级。

Envs

String

环境变量配置。

JobMaxRunningTimeMinutes

Integer

作业最长运行时长。

单位:分钟。

WorkspaceId

String

工作空间ID。

如何获取工作空间ID,请参见ListWorkspaces - 获取工作空间列表

CodeSource

Map

本任务使用的代码源。

任务的节点启动之前,DLC会自动下载代码源中配置好的代码,并Mount到容器的本地目录。更多信息,请参见CodeSource属性

UserVpc

Map

用户VPC配置。

更多信息,请参见UserVpc语法

JobSpecs

List

任务运行时的配置。

更多信息,请查见JobSpecs属性

UserCommand

String

任务所有节点的启动命令。

DataSources

List

本任务使用的所有数据源列表。

按照数据源中的配置Mount到每一个节点所在的容器本地目录上(本地目录由数据源中的配置MountPath指定)。

任务的启动命令中的进程以MountPath为路径直接访问每一个数据源代表的分布式文件系统。

更多信息,请参见DataSources属性

JobType

String

任务类型。

大小写敏感。当前支持的任务类型:

  • TFJob

  • PyTorchJob

  • XGBoostJob

  • OneFlowJob

  • ElasticBatch

ResourceId

String

资源组ID。

可选参数。

  • 参数值为空表示提交到公共资源组。

  • 如果当前工作空间已经绑定专有资源组,此处可以指定对应的资源组ID;如何创建专有资源组、查询专有资源组ID。更多信息,请参见新建及管理通用训练资源

ThirdpartyLibDir

String

Python三方库(requirements.txt)文件所在文件夹名称

每个节点在运行指定的UserCommand之前,PAI-DLC会从指定文件夹取出requirements.txt文件,并调用pip install -r安装。

DisplayName

String

任务的名称。

命名格式如下:

  • 名称长度不超过256个字符。

  • 允许数字、字母、下划线(_)、英文句号(.)和短横线(-)。

SuccessPolicy

String

分布式多机任务的成功策略。

目前只有Tensorflow的多机任务支持。

取值:

  • ChiefWorker:当指定为这个值的时候,只要Chief的pod成功结束,那么就认为整个任务成功结束。

  • AllWorkers(默认值):必须是所有的Worker全部成功,才会认为整个任务成功。

Settings

Map

作业额外参数配置。

CodeSource语法

"CodeSource": {
  "MountPath": String,
  "Commit": String,
  "Branch": String,
  "CodeSourceId": String
}

CodeSource属性

属性名称

类型

必须

允许更新

描述

约束

MountPath

String

本任务需要挂载的路径。

默认使用数据源中的挂载路径。

Commit

String

本任务需要下载的代码Commit ID。

默认使用代码源中的Commit ID配置。

Branch

String

本任务运行时,引用的代码仓库的分支。

默认使用代码源中的配置分支字段。

CodeSourceId

String

代码源ID。

如何获取代码源ID,请参见ListCodeSources - 获取代码配置列表

UserVpc语法

"UserVpc": {
  "VpcId": String,
  "SecurityGroupId": String,
  "SwitchId": String,
  "ExtendedCIDRs": List
}

UserVpc属性

属性名称

类型

必须

允许更新

描述

约束

VpcId

String

用户VPC的ID。

SecurityGroupId

String

用户安全组的ID。

SwitchId

String

用户交换机的ID。

可选参数。

  • 参数值为空时系统会根据库存情况自动选择合适的交换机。

  • 也可以自己指定交换机ID。

ExtendedCIDRs

List

扩展网段。

取值:

  • 当交换机ID为空时,不提供该参数,系统将自动获取该VPC下的所有网段。

  • 当交换机ID不为空时,必须提供该参数。 建议填写VPC下的所有网段。

JobSpecs语法

"JobSpecs": [
  {
    "PodCount": Integer,
    "ImageConfig": Map,
    "UseSpotInstance": Boolean,
    "Type": String,
    "EcsSpec": String,
    "ResourceConfig": Map,
    "Image": String,
    "ExtraPodSpec": Map
  }
]

JobSpecs属性

属性名称

类型

必须

允许更新

描述

约束

PodCount

Integer

副本数量。

ImageConfig

Map

用于私有镜像信息配置。

UseSpotInstance

Boolean

是否使用竞价实例。

取值:

  • true:使用竞价实例。

  • false:不适用竞价实例。

Type

String

类型。

Type与Job Type紧密相关,不同Job Type支持不同的Worker Type。

  • TFJob:支持Chief、PS、Worker、 Evaluator GraphLearn。

  • PyTorchJob: 支持Worker、 Master。

  • XGBoostJob: 支持Worker、 Master。

PyTorchJob 与 XGBoostJob 中的Master是可选的,若Master没有指定,系统会自动把第一个Worker节点当作Master节点。

EcsSpec

String

Worker的硬件规格。

不同规格的价格会有区别。更多信息,请参见PAI-DLC计费说明

ResourceConfig

Map

资源配置。

Image

String

此类Worker的运行镜像地址。

可以调用 ListImages获取PAI平台提供的社区和PAI优化过的镜像。也可以指定第三方公开的镜像。

ExtraPodSpec

Map

额外的Pod配置。

DataSources语法

"DataSources": [
  {
    "MountPath": String,
    "DataSourceId": String
  }
]

DataSources属性

属性名称

类型

必须

允许更新

描述

约束

MountPath

String

本任务需要挂载的路径。

默认使用数据源中的挂载路径。

DataSourceId

String

数据源的ID。

如何查看数据源ID,请参见ListDatasets - 获取数据集列表

返回值

Fn::GetAtt

JobId:此次调用创建的任务ID。

示例

  • YAML格式

    ROSTemplateFormatVersion: '2015-09-01'
    Parameters:
      CodeSource:
        Description: The code source used in this task.Before the mission node starts,
          the DLC will automatically download the code configured in the code source,
          and mount to the local directory of the container.
        Type: Json
      DataSources:
        Description: List of data source used for task operation.
        Type: Json
      DisplayName:
        Description: 'The name of the task is as follows:
    
          The name length does not exceed 256 characters.
    
          Allow numbers, letters, lower strokes (_), English period (.) And short horizontal
          lines (-).'
        Type: String
      Envs:
        Description: Environment variable configuration.
        Type: String
      JobMaxRunningTimeMinutes:
        Description: The longest running time is running, and the unit is minutes.
        Type: Number
      JobSpecs:
        Description: 'Jobspecs describes various configurations of tasks during the mission,
          such as mirror address, start command, node resource statement, number of copies,
          etc.
    
          The DLC task consists of different types of nodes. The same type of nodes have
          exactly the same configuration. This configuration is called a Jobspec. Jobspecs
          describes the configuration of all types of nodes and is the array of Jobspec.'
        Type: Json
      JobType:
        AllowedValues:
        - TFJob
        - PyTorchJob
        - XGBoostJob
        - OneFlowJob
        - ElasticBatch
        Description: 'The type of job. Values: TFJob, PyTorchJob, XGBoostJob, OneFlowJob,
          ElasticBatch'
        Type: String
      Options:
        Description: The additional configuration of this task can adjust some of the
          behavior of the mounting data source through this parameter.If the task has
          a data source that mounted the OSS type, you can cover the default parameters
          of the jinofs by configure the configuration of this parameter to fs.OSS.DOWNLOAD.CONCURRENCY
          = 4, fs.oss.download.queue.size = 16.
        Type: String
      Priority:
        Description: 'The priority of the task, optional parameter, default value 1, the
          range of parameter values is 1 ~ 9.in:
    
          1 is the minimum priority.
    
          9 is the highest priority.'
        Type: Number
      ResourceId:
        Description: 'Resource group ID, optional parameter.
    
          The parameter value is empty indicating that submitted to the public resource
          group.
    
          If the current working space has been bound to a proprietary resource group,
          you can specify the corresponding resource group ID here; how to create a proprietary
          resource group and inquire about the proprietary resource group ID, please refer
          to the preparation and management of the DLC resource group cluster.'
        Type: String
      Settings:
        Description: Job settings.
        Type: Json
      SuccessPolicy:
        Description: 'The successful strategy of distributed multi -machine tasks is currently
          only supported by TensorFlow''s multi -machine task.
    
          ChiefWorker: When it is specified as this value, as long as the Chief''s POD
          is successful, it is considered that the entire task is successful.
    
          All workers: All workers must be successful to think that the entire task is
          successful.'
        Type: String
      ThirdpartyLibDir:
        Description: The name folder of the Requirements.txt file is located; before each
          node runs the specified usercommand, PAI -DLC will take the requirements.txt
          file from the specified folder and call the PIP Install -R installation.
        Type: String
      ThirdpartyLibs:
        Description: Python third-party library list to be installed.
        Type: Json
      UserCommand:
        Description: Start commands of all nodes in the task.
        Type: String
      UserVpc:
        Description: User VPC configuration.
        Type: Json
      WorkspaceId:
        Description: Work space ID, how to get working space ID, see listworkSpaces.
        Type: String
    Resources:
      Job:
        Properties:
          CodeSource:
            Ref: CodeSource
          DataSources:
            Ref: DataSources
          DisplayName:
            Ref: DisplayName
          Envs:
            Ref: Envs
          JobMaxRunningTimeMinutes:
            Ref: JobMaxRunningTimeMinutes
          JobSpecs:
            Ref: JobSpecs
          JobType:
            Ref: JobType
          Options:
            Ref: Options
          Priority:
            Ref: Priority
          ResourceId:
            Ref: ResourceId
          Settings:
            Ref: Settings
          SuccessPolicy:
            Ref: SuccessPolicy
          ThirdpartyLibDir:
            Ref: ThirdpartyLibDir
          ThirdpartyLibs:
            Ref: ThirdpartyLibs
          UserCommand:
            Ref: UserCommand
          UserVpc:
            Ref: UserVpc
          WorkspaceId:
            Ref: WorkspaceId
        Type: ALIYUN::PAIDLC::Job
    Outputs:
      JobId:
        Description: The task ID created this time.
        Value:
          Fn::GetAtt:
          - Job
          - JobId
  • JSON格式

    {
      "ROSTemplateFormatVersion": "2015-09-01",
      "Parameters": {
        "ThirdpartyLibs": {
          "Type": "Json",
          "Description": "Python third-party library list to be installed."
        },
        "Options": {
          "Type": "String",
          "Description": "The additional configuration of this task can adjust some of the behavior of the mounting data source through this parameter.If the task has a data source that mounted the OSS type, you can cover the default parameters of the jinofs by configure the configuration of this parameter to fs.OSS.DOWNLOAD.CONCURRENCY = 4, fs.oss.download.queue.size = 16."
        },
        "Priority": {
          "Type": "Number",
          "Description": "The priority of the task, optional parameter, default value 1, the range of parameter values is 1 ~ 9.in:\n1 is the minimum priority.\n9 is the highest priority."
        },
        "Envs": {
          "Type": "String",
          "Description": "Environment variable configuration."
        },
        "JobMaxRunningTimeMinutes": {
          "Type": "Number",
          "Description": "The longest running time is running, and the unit is minutes."
        },
        "WorkspaceId": {
          "Type": "String",
          "Description": "Work space ID, how to get working space ID, see listworkSpaces."
        },
        "CodeSource": {
          "Type": "Json",
          "Description": "The code source used in this task.Before the mission node starts, the DLC will automatically download the code configured in the code source, and mount to the local directory of the container."
        },
        "UserVpc": {
          "Type": "Json",
          "Description": "User VPC configuration."
        },
        "JobSpecs": {
          "Type": "Json",
          "Description": "Jobspecs describes various configurations of tasks during the mission, such as mirror address, start command, node resource statement, number of copies, etc.\nThe DLC task consists of different types of nodes. The same type of nodes have exactly the same configuration. This configuration is called a Jobspec. Jobspecs describes the configuration of all types of nodes and is the array of Jobspec."
        },
        "UserCommand": {
          "Type": "String",
          "Description": "Start commands of all nodes in the task."
        },
        "DataSources": {
          "Type": "Json",
          "Description": "List of data source used for task operation."
        },
        "JobType": {
          "Type": "String",
          "Description": "The type of job. Values: TFJob, PyTorchJob, XGBoostJob, OneFlowJob, ElasticBatch",
          "AllowedValues": [
            "TFJob",
            "PyTorchJob",
            "XGBoostJob",
            "OneFlowJob",
            "ElasticBatch"
          ]
        },
        "ResourceId": {
          "Type": "String",
          "Description": "Resource group ID, optional parameter.\nThe parameter value is empty indicating that submitted to the public resource group.\nIf the current working space has been bound to a proprietary resource group, you can specify the corresponding resource group ID here; how to create a proprietary resource group and inquire about the proprietary resource group ID, please refer to the preparation and management of the DLC resource group cluster."
        },
        "ThirdpartyLibDir": {
          "Type": "String",
          "Description": "The name folder of the Requirements.txt file is located; before each node runs the specified usercommand, PAI -DLC will take the requirements.txt file from the specified folder and call the PIP Install -R installation."
        },
        "DisplayName": {
          "Type": "String",
          "Description": "The name of the task is as follows:\nThe name length does not exceed 256 characters.\nAllow numbers, letters, lower strokes (_), English period (.) And short horizontal lines (-)."
        },
        "SuccessPolicy": {
          "Type": "String",
          "Description": "The successful strategy of distributed multi -machine tasks is currently only supported by TensorFlow's multi -machine task.\nChiefWorker: When it is specified as this value, as long as the Chief's POD is successful, it is considered that the entire task is successful.\nAll workers: All workers must be successful to think that the entire task is successful."
        },
        "Settings": {
          "Type": "Json",
          "Description": "Job settings."
        }
      },
      "Resources": {
        "Job": {
          "Type": "ALIYUN::PAIDLC::Job",
          "Properties": {
            "ThirdpartyLibs": {
              "Ref": "ThirdpartyLibs"
            },
            "Options": {
              "Ref": "Options"
            },
            "Priority": {
              "Ref": "Priority"
            },
            "Envs": {
              "Ref": "Envs"
            },
            "JobMaxRunningTimeMinutes": {
              "Ref": "JobMaxRunningTimeMinutes"
            },
            "WorkspaceId": {
              "Ref": "WorkspaceId"
            },
            "CodeSource": {
              "Ref": "CodeSource"
            },
            "UserVpc": {
              "Ref": "UserVpc"
            },
            "JobSpecs": {
              "Ref": "JobSpecs"
            },
            "UserCommand": {
              "Ref": "UserCommand"
            },
            "DataSources": {
              "Ref": "DataSources"
            },
            "JobType": {
              "Ref": "JobType"
            },
            "ResourceId": {
              "Ref": "ResourceId"
            },
            "ThirdpartyLibDir": {
              "Ref": "ThirdpartyLibDir"
            },
            "DisplayName": {
              "Ref": "DisplayName"
            },
            "SuccessPolicy": {
              "Ref": "SuccessPolicy"
            },
            "Settings": {
              "Ref": "Settings"
            }
          }
        }
      },
      "Outputs": {
        "JobId": {
          "Description": "The task ID created this time.",
          "Value": {
            "Fn::GetAtt": [
              "Job",
              "JobId"
            ]
          }
        }
      }
    }