查询命令

您可以通过客户端工具查看任务日志、任务列表和任务详情。本文介绍查询相关的命令详情,包括调用格式、参数解释及使用示例。

查看任务日志(logs)

  • 功能

    查看一个训练任务的日志详情。

  • 格式

    ./dlc logs <yourJobId> <yourPodId> [--max_events_num <yourMaxNum>] [--start_time <yourStartTime>] [--end_time <yourEndTime>]
  • 参数

    参数

    是否必选

    描述

    类型

    <yourJobId>

    待查看训练任务的ID。

    STRING

    <yourPodId>

    待查看日志的实例(Pod)ID。在分布式任务场景下,存在多个实例(Pod)。

    STRING

    max_events_num <yourMaxNum>

    返回的日志最大行数,默认值为2000。

    INT

    start_time <yourStartTime>

    日志查询的起始时间,默认值为7天前。例如,start_time 2020-11-08T16:00:00Z

    STRING

    end_time <yourEndTime>

    日志查询的截止时间,默认值为当前时间。例如,end_time 2020-11-08T17:00:00Z

    STRING

  • 示例

    针对分布式训练任务的0Worker节点,获取十行日志。

    ./dlc logs dlcdys3r9jlu**** dlcdys3r********-worker-0 --max_events_num 10

    系统返回如下类似结果。

    WARN: ./requirements.txt not found, skip installing requirements.
    ================================================
    |  PAI Tensorflow powered by Aliyun PAI Team.  |
    ================================================
    Network is under initialization...
    Network successfully initialized.
    [2021-04-16 12:27:56.368026] [INFO] [7#7] [tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    [2021-04-16 12:27:56.375586] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:80] ====================CPU Architecture=====================
    [2021-04-16 12:27:56.375600] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:84] Disable AVX512.
    [2021-04-16 12:27:56.375605] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:87] CPU Vendor ID: GenuineIntel

查看任务列表与状态

  • 功能

    获取训练任务的信息。如果不指定JobID,则会将所有的任务信息列出;如果指定了JobID,则只会展示对应的任务信息。

  • 格式

    ./dlc get job [JOB_ID] [--workspace_id <yourWorkspaceId>] [--display_name <yourJobName>] [--job_type <yourJobType>] [--status <yourJobStatus>] [--start_time <yourStartTime>] [--end_time <yourEndTime>] [--page_num <yourPageNum>] [--page_size <yourPageSize>] [--max_events_num <yourMaxNum>] [--events] [--events_only]
  • 参数

    参数

    是否必选

    描述

    类型

    JOB_ID

    待查看训练任务的ID。

    STRING

    workspace_id <yourWorkspaceId>

    工作空间ID。

    STRING

    display_name <yourJobName>

    任务名称,支持模糊查询,不支持通配符查询,大小写不敏感。

    STRING

    job_type <yourJobType>

    任务类型,支持查询所有任务类型。默认为空,代表所有类型。

    STRING

    status <yourJobStatus>

    任务状态。默认为空,代表任务所有状态。

    STRING

    start_time <yourStartTime>

    查询区间的起始时间,使用任务的创建时间来过滤。例如:start_time 2022-08-04T02:09:32Z

    STRING

    end_time <yourEndTime>

    查询区间的截止时间,使用任务的创建时间来过滤。例如:end_time 2022-08-04T02:09:32Z

    STRING

    page_num <yourPageNum>

    分页查询,指定当前查询需要返回的页码,编号从1开始,默认为1。

    INT

    page_size <yourPageSize>

    分页查询中,指定当前查询每页返回的数量,默认为10。

    INT

    max_events_num <yourMaxNum>

    返回的系统事件的最大行数,默认为2000。

    INT

    events

    是否查询任务的系统事件,仅查询单个任务时才会生效。默认为false。

    BOOL

    events_only

    是否只查询任务的系统事件,仅查询单个任务时才会生效。默认为false。

    BOOL

  • 示例

    • 按照任务名称模糊匹配查询所有的训练任务。

      ./dlc get job --display_name epl

      系统返回如下类似结果。

      +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
      |        Name        |      JobId       | WorkspaceId |  WorkspaceName   | ResourceId |  ResourceName  | JobType | Priority | JobStatus |      UserId      |      CreateTime      |    SubmittedTime     |     RunningTime      |    SuccessedTime     | StoppedTime | FailedTime |      FinishTime      | Duration(seconds) |
      +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
      | test_epl_test-**** | dlc02xipvt5z**** | 23****      | doc_test_**** |            | public-cluster | TFJob   | 1        | Succeeded | 144963168668**** | 2022-08-01T06:41:05Z | 2022-08-01T06:45:08Z | 2022-08-01T06:48:57Z | 2022-08-01T06:53:21Z |             |            | 2022-08-01T06:53:21Z | 736               |
      | test_epl_****      | dlc1iyv3szl2**** | 23****      | doc_test_**** |            | public-cluster | TFJob   | 1        | Succeeded | 144963168668**** | 2022-08-01T03:23:51Z | 2022-08-01T03:27:22Z | 2022-08-01T03:27:50Z | 2022-08-01T03:33:48Z |             |            | 2022-08-01T03:33:48Z | 597               |
      +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
    • 查询指定的训练任务。

      ./dlc get job dlc02xipvt5z****

      系统返回如下类似结果。

      {
         "ClusterId": "",
         "CodeSource": {
            "Branch": "main",
            "CodeSourceId": "code-29****c****c4****ae0c9ec75a5****",
            "MountPath": ""
         },
         "DataSources": [
            {
               "DataSourceId": "d-ya7gc2p2iqq240****",
               "MountPath": ""
            }
         ],
         "DisplayName": "test_epl_test-****",
         "Duration": 736,
         "ElasticSpec": {
            "AIMasterType": "",
            "EnableElasticTraining": false,
            "MaxParallelism": 0,
            "MinParallelism": 0
         },
         "EnabledDebugger": false,
         "GmtCreateTime": "2022-08-01T06:41:05Z",
         "GmtFinishTime": "2022-08-01T06:53:21Z",
         "GmtRunningTime": "2022-08-01T06:48:57Z",
         "GmtSubmittedTime": "2022-08-01T06:45:08Z",
         "GmtSuccessedTime": "2022-08-01T06:53:21Z",
         "JobId": "dlc02xipvt5z****",
         "JobSpecs": [
            {
               "AssignNodeSpec": {
                  "EnableAssignNode": false,
                  "NodeNames": ""
               },
               "EcsSpec": "ecs.gn6v-c8g1.2xlarge",
               "Image": "registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:1.15-gpu-py36-cu100-ubuntu1****",
               "PodCount": 2,
               "ResourceConfig": {
                  "CPU": "",
                  "GPU": "",
                  "GPUType": "",
                  "Memory": "",
                  "SharedMemory": ""
               },
               "Type": "Worker",
               "UseSpotInstance": false
            }
         ],
         "JobType": "TFJob",
         "Pods": [
            {
               "GmtCreateTime": "2022-08-01T06:45:08Z",
               "GmtFinishTime": "2022-08-01T06:53:20Z",
               "GmtStartTime": "2022-08-01T06:52:06Z",
               "Ip": "10.224.xx.xx",
               "PodId": "dlc02xipvt5z****-worker-0",
               "PodUid": "",
               "Status": "Succeeded",
               "Type": "worker"
            },
            {
               "GmtCreateTime": "2022-08-01T06:45:08Z",
               "GmtFinishTime": "2022-08-01T06:53:20Z",
               "GmtStartTime": "2022-08-01T06:48:57Z",
               "Ip": "10.224.xx.xx",
               "PodId": "dlc02xipvt5z****-worker-1",
               "PodUid": "",
               "Status": "Succeeded",
               "Type": "worker"
            }
         ],
         "ReasonCode": "JobSucceeded",
         "ReasonMessage": "TFJob dlc02xipvt5z**** successfully completed.",
         "RequestId": "76FC3500-xxxx-533F-B24A-AC9B2A72****",
         "ResourceId": "",
         "Priority": 1,
         "ResourceLevel": "",
         "Settings": {
            "BusinessUserId": "",
            "Caller": "",
            "EnableErrorMonitoringInAIMaster": false,
            "EnableTideResource": false,
            "ErrorMonitoringArgs": "",
            "PipelineId": ""
         },
         "Status": "Succeeded",
         "ThirdpartyLibDir": "",
         "UserCommand": "cd /root/xxxx/xxxx/\npip install .\ncd examples/resnet\nbash scripts/xxxx_dp.sh",
         "UserId": "144963168668****",
         "WorkspaceId": "23****",
         "WorkspaceName": "doc_test_****"
      }

相关文档

您可以通过控制台查看任务详情。具体操作,请参见查看训练详情