您可以通过客户端工具查看任务日志、任务列表和任务详情。本文介绍查询相关的命令详情,包括调用格式、参数解释及使用示例。
查看任务日志(logs)
功能
查看一个训练任务的日志详情。
格式
./dlc logs <yourJobId> <yourPodId> [--max_events_num <yourMaxNum>] [--start_time <yourStartTime>] [--end_time <yourEndTime>]
参数
参数
是否必选
描述
类型
<yourJobId>
是
待查看训练任务的ID。
STRING
<yourPodId>
是
待查看日志的实例(Pod)ID。在分布式任务场景下,存在多个实例(Pod)。
STRING
max_events_num <yourMaxNum>
否
返回的日志最大行数,默认值为2000。
INT
start_time <yourStartTime>
否
日志查询的起始时间,默认值为7天前。例如,start_time 2020-11-08T16:00:00Z。
STRING
end_time <yourEndTime>
否
日志查询的截止时间,默认值为当前时间。例如,end_time 2020-11-08T17:00:00Z。
STRING
示例
针对分布式训练任务的0号Worker节点,获取十行日志。
./dlc logs dlcdys3r9jlu**** dlcdys3r********-worker-0 --max_events_num 10
系统返回如下类似结果。
WARN: ./requirements.txt not found, skip installing requirements. ================================================ | PAI Tensorflow powered by Aliyun PAI Team. | ================================================ Network is under initialization... Network successfully initialized. [2021-04-16 12:27:56.368026] [INFO] [7#7] [tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA [2021-04-16 12:27:56.375586] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:80] ====================CPU Architecture===================== [2021-04-16 12:27:56.375600] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:84] Disable AVX512. [2021-04-16 12:27:56.375605] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:87] CPU Vendor ID: GenuineIntel
查看任务列表与状态
功能
获取训练任务的信息。如果不指定JobID,则会将所有的任务信息列出;如果指定了JobID,则只会展示对应的任务信息。
格式
./dlc get job [JOB_ID] [--workspace_id <yourWorkspaceId>] [--display_name <yourJobName>] [--job_type <yourJobType>] [--status <yourJobStatus>] [--start_time <yourStartTime>] [--end_time <yourEndTime>] [--page_num <yourPageNum>] [--page_size <yourPageSize>] [--max_events_num <yourMaxNum>] [--events] [--events_only]
参数
参数
是否必选
描述
类型
JOB_ID
否
待查看训练任务的ID。
STRING
workspace_id <yourWorkspaceId>
否
工作空间ID。
STRING
display_name <yourJobName>
否
任务名称,支持模糊查询,不支持通配符查询,大小写不敏感。
STRING
job_type <yourJobType>
否
任务类型,支持查询所有任务类型。默认为空,代表所有类型。
STRING
status <yourJobStatus>
否
任务状态。默认为空,代表任务所有状态。
STRING
start_time <yourStartTime>
否
查询区间的起始时间,使用任务的创建时间来过滤。例如:start_time 2022-08-04T02:09:32Z。
STRING
end_time <yourEndTime>
否
查询区间的截止时间,使用任务的创建时间来过滤。例如:end_time 2022-08-04T02:09:32Z。
STRING
page_num <yourPageNum>
否
分页查询,指定当前查询需要返回的页码,编号从1开始,默认为1。
INT
page_size <yourPageSize>
否
分页查询中,指定当前查询每页返回的数量,默认为10。
INT
max_events_num <yourMaxNum>
否
返回的系统事件的最大行数,默认为2000。
INT
events
否
是否查询任务的系统事件,仅查询单个任务时才会生效。默认为false。
BOOL
events_only
否
是否只查询任务的系统事件,仅查询单个任务时才会生效。默认为false。
BOOL
示例
按照任务名称模糊匹配查询所有的训练任务。
./dlc get job --display_name epl
系统返回如下类似结果。
+--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+ | Name | JobId | WorkspaceId | WorkspaceName | ResourceId | ResourceName | JobType | Priority | JobStatus | UserId | CreateTime | SubmittedTime | RunningTime | SuccessedTime | StoppedTime | FailedTime | FinishTime | Duration(seconds) | +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+ | test_epl_test-**** | dlc02xipvt5z**** | 23**** | doc_test_**** | | public-cluster | TFJob | 1 | Succeeded | 144963168668**** | 2022-08-01T06:41:05Z | 2022-08-01T06:45:08Z | 2022-08-01T06:48:57Z | 2022-08-01T06:53:21Z | | | 2022-08-01T06:53:21Z | 736 | | test_epl_**** | dlc1iyv3szl2**** | 23**** | doc_test_**** | | public-cluster | TFJob | 1 | Succeeded | 144963168668**** | 2022-08-01T03:23:51Z | 2022-08-01T03:27:22Z | 2022-08-01T03:27:50Z | 2022-08-01T03:33:48Z | | | 2022-08-01T03:33:48Z | 597 | +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
查询指定的训练任务。
./dlc get job dlc02xipvt5z****
系统返回如下类似结果。
{ "ClusterId": "", "CodeSource": { "Branch": "main", "CodeSourceId": "code-29****c****c4****ae0c9ec75a5****", "MountPath": "" }, "DataSources": [ { "DataSourceId": "d-ya7gc2p2iqq240****", "MountPath": "" } ], "DisplayName": "test_epl_test-****", "Duration": 736, "ElasticSpec": { "AIMasterType": "", "EnableElasticTraining": false, "MaxParallelism": 0, "MinParallelism": 0 }, "EnabledDebugger": false, "GmtCreateTime": "2022-08-01T06:41:05Z", "GmtFinishTime": "2022-08-01T06:53:21Z", "GmtRunningTime": "2022-08-01T06:48:57Z", "GmtSubmittedTime": "2022-08-01T06:45:08Z", "GmtSuccessedTime": "2022-08-01T06:53:21Z", "JobId": "dlc02xipvt5z****", "JobSpecs": [ { "AssignNodeSpec": { "EnableAssignNode": false, "NodeNames": "" }, "EcsSpec": "ecs.gn6v-c8g1.2xlarge", "Image": "registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:1.15-gpu-py36-cu100-ubuntu1****", "PodCount": 2, "ResourceConfig": { "CPU": "", "GPU": "", "GPUType": "", "Memory": "", "SharedMemory": "" }, "Type": "Worker", "UseSpotInstance": false } ], "JobType": "TFJob", "Pods": [ { "GmtCreateTime": "2022-08-01T06:45:08Z", "GmtFinishTime": "2022-08-01T06:53:20Z", "GmtStartTime": "2022-08-01T06:52:06Z", "Ip": "10.224.xx.xx", "PodId": "dlc02xipvt5z****-worker-0", "PodUid": "", "Status": "Succeeded", "Type": "worker" }, { "GmtCreateTime": "2022-08-01T06:45:08Z", "GmtFinishTime": "2022-08-01T06:53:20Z", "GmtStartTime": "2022-08-01T06:48:57Z", "Ip": "10.224.xx.xx", "PodId": "dlc02xipvt5z****-worker-1", "PodUid": "", "Status": "Succeeded", "Type": "worker" } ], "ReasonCode": "JobSucceeded", "ReasonMessage": "TFJob dlc02xipvt5z**** successfully completed.", "RequestId": "76FC3500-xxxx-533F-B24A-AC9B2A72****", "ResourceId": "", "Priority": 1, "ResourceLevel": "", "Settings": { "BusinessUserId": "", "Caller": "", "EnableErrorMonitoringInAIMaster": false, "EnableTideResource": false, "ErrorMonitoringArgs": "", "PipelineId": "" }, "Status": "Succeeded", "ThirdpartyLibDir": "", "UserCommand": "cd /root/xxxx/xxxx/\npip install .\ncd examples/resnet\nbash scripts/xxxx_dp.sh", "UserId": "144963168668****", "WorkspaceId": "23****", "WorkspaceName": "doc_test_****" }
相关文档
您可以通过控制台查看任务详情。具体操作,请参见查看训练详情。