您可以通过客户端工具查看任务日志。本文介绍查询相关的命令详情,包括调用格式、参数解释及使用示例。
查看任务日志(logs)
- 功能
查看一个训练任务的日志详情。
- 格式
dlc logs <yourJobId> <yourPodId> [--max_events_num <yourMaxNum>] [--start_time <yourStartTime>] [--end_time <yourEndTime>]
- 参数
参数 是否必选 描述 类型 <yourJobId> 是 待查看训练任务的ID。 STRING <yourPodId> 是 待查看日志的Pod ID。在分布式任务场景下,存在多个Pod。 STRING max_events_num <yourMaxNum> 否 返回的日志最大行数,默认值为2000。 INT start_time <yourStartTime 否 日志查询的起始时间,默认值为7天前。例如,start_time 2020-11-08T16:00:00Z。 STRING end_time <yourEndTime> 否 日志查询的截止时间,默认值为当前时间。例如,end_time 2020-11-08T17:00:00Z。 STRING - 示例
针对分布式训练任务的0号Worker节点,获取十行日志。
系统返回如下类似结果。dlc logs dlc-20210411xxxxxx-xxxxxxxxxxxxx dlc-20210411xxxxxx-xxxxxxxxxxxxx-worker-0 --max_events_num 10
WARN: ./requirements.txt not found, skip installing requirements. ================================================ | PAI Tensorflow powered by Aliyun PAI Team. | ================================================ Network is under initialization... Network successfully initialized. [2021-04-16 12:27:56.368026] [INFO] [7#7] [tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA [2021-04-16 12:27:56.375586] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:80] ====================CPU Architecture===================== [2021-04-16 12:27:56.375600] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:84] Disable AVX512. [2021-04-16 12:27:56.375605] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:87] CPU Vendor ID: GenuineIntel
查看任务列表与状态
- 功能
获取训练任务的信息。如果不指定JobID,则会将所有的任务信息列出;如果指定了JobID,则只会展示对应的任务信息。
- 格式
dlc get job [JOB_ID] [--workspace_id <yourWorkspaceId>] [--display_name <yourJobName>] [--job_type <yourJobType>] [--status <yourJobStatus>] [--start_time <yourStartTime>] [--end_time <yourEndTime>] [--page_num <yourPageNum>] [--page_size <yourPageSize>] [--max_events_num <yourMaxNum>] [--events] [--events_only]
- 参数
参数 是否必选 描述 类型 JOB_ID 否 待查看训练任务的ID。 STRING workspace_id <yourWorkspaceId> 否 工作空间ID。 STRING display_name <yourJobName> 否 任务名称,支持模糊查询,不支持通配符查询,大小写不敏感。 STRING job_type <yourJobType> 否 任务类型,支持查询所有任务类型。默认为空,代表所有类型。 STRING status <yourJobStatus> 否 任务状态。默认为空,代表任务所有状态。 STRING start_time <yourStartTime> 否 查询区间的起始时间,使用任务的创建时间来过滤。例如:start_time 2022-08-04T02:09:32Z。 STRING end_time <yourEndTime> 否 查询区间的截止时间,使用任务的创建时间来过滤。例如:end_time 2022-08-04T02:09:32Z。 STRING page_num <yourPageNum> 否 分页查询,指定当前查询需要返回的页码,编号从1开始,默认为1。 INT page_size <yourPageSize> 否 分页查询中,指定当前查询每页返回的数量,默认为10。 INT max_events_num <yourMaxNum> 否 返回的系统事件的最大行数,默认为2000。 INT events 否 是否查询任务的系统事件,仅查询单个任务时才会生效。默认为false。 BOOL events_only 否 是否只查询任务的系统事件,仅查询单个任务时才会生效。默认为false。 BOOL - 示例
- 按照任务名称模糊匹配查询所有的训练任务。
系统返回如下类似结果。dlc get job --display_name epl
+--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+ | Name | JobId | WorkspaceId | WorkspaceName | ResourceId | ResourceName | JobType | Priority | JobStatus | UserId | CreateTime | SubmittedTime | RunningTime | SuccessedTime | StoppedTime | FailedTime | FinishTime | Duration(seconds) | +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+ | test_epl_test-**** | dlc02xipvt5z**** | 23**** | doc_test_**** | | public-cluster | TFJob | 1 | Succeeded | 144963168668**** | 2022-08-01T06:41:05Z | 2022-08-01T06:45:08Z | 2022-08-01T06:48:57Z | 2022-08-01T06:53:21Z | | | 2022-08-01T06:53:21Z | 736 | | test_epl_**** | dlc1iyv3szl2**** | 23**** | doc_test_**** | | public-cluster | TFJob | 1 | Succeeded | 144963168668**** | 2022-08-01T03:23:51Z | 2022-08-01T03:27:22Z | 2022-08-01T03:27:50Z | 2022-08-01T03:33:48Z | | | 2022-08-01T03:33:48Z | 597 | +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
- 查询指定的训练任务。
系统返回如下类似结果。dlc get job dlc02xipvt5z****
{ "ClusterId": "", "CodeSource": { "Branch": "main", "CodeSourceId": "code-29****c****c4****ae0c9ec75a5****", "MountPath": "" }, "DataSources": [ { "DataSourceId": "d-ya7gc2p2iqq240****", "MountPath": "" } ], "DisplayName": "test_epl_test-****", "Duration": 736, "ElasticSpec": { "AIMasterType": "", "EnableElasticTraining": false, "MaxParallelism": 0, "MinParallelism": 0 }, "EnabledDebugger": false, "GmtCreateTime": "2022-08-01T06:41:05Z", "GmtFinishTime": "2022-08-01T06:53:21Z", "GmtRunningTime": "2022-08-01T06:48:57Z", "GmtSubmittedTime": "2022-08-01T06:45:08Z", "GmtSuccessedTime": "2022-08-01T06:53:21Z", "JobId": "dlc02xipvt5z****", "JobSpecs": [ { "AssignNodeSpec": { "EnableAssignNode": false, "NodeNames": "" }, "EcsSpec": "ecs.gn6v-c8g1.2xlarge", "Image": "registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:1.15-gpu-py36-cu100-ubuntu1****", "PodCount": 2, "ResourceConfig": { "CPU": "", "GPU": "", "GPUType": "", "Memory": "", "SharedMemory": "" }, "Type": "Worker", "UseSpotInstance": false } ], "JobType": "TFJob", "Pods": [ { "GmtCreateTime": "2022-08-01T06:45:08Z", "GmtFinishTime": "2022-08-01T06:53:20Z", "GmtStartTime": "2022-08-01T06:52:06Z", "Ip": "10.224.xx.xx", "PodId": "dlc02xipvt5z****-worker-0", "PodUid": "", "Status": "Succeeded", "Type": "worker" }, { "GmtCreateTime": "2022-08-01T06:45:08Z", "GmtFinishTime": "2022-08-01T06:53:20Z", "GmtStartTime": "2022-08-01T06:48:57Z", "Ip": "10.224.xx.xx", "PodId": "dlc02xipvt5z****-worker-1", "PodUid": "", "Status": "Succeeded", "Type": "worker" } ], "ReasonCode": "JobSucceeded", "ReasonMessage": "TFJob dlc02xipvt5z**** successfully completed.", "RequestId": "76FC3500-xxxx-533F-B24A-AC9B2A72****", "ResourceId": "", "Priority": 1, "ResourceLevel": "", "Settings": { "BusinessUserId": "", "Caller": "", "EnableErrorMonitoringInAIMaster": false, "EnableTideResource": false, "ErrorMonitoringArgs": "", "PipelineId": "" }, "Status": "Succeeded", "ThirdpartyLibDir": "", "UserCommand": "cd /root/xxxx/xxxx/\npip install .\ncd examples/resnet\nbash scripts/xxxx_dp.sh", "UserId": "144963168668****", "WorkspaceId": "23****", "WorkspaceName": "doc_test_****" }
- 按照任务名称模糊匹配查询所有的训练任务。