管理监控工具DCGM（v1.5）-PG1阿里云产品-阿里云-真武 PPU 云服务(ppu)-阿里云帮助中心

1. 概述

PPU DCGM是一套用于在集群和数据中心环境中管理和监控PPU的工具，基于开源的DCGM。包括如下组件：

dcgmi命令行工具
nv-hostengine命令行工具
dcgm共享库
dcgm-exporter工具

2. 获取与安装

目前PPU DCGM工具由单独的工具包发布，请点击下载链接前往PPU artifactory页面下载。

说明

下载软件包需要账号和密码，请联系您的客户经理（PDSA）获取。

安装包包含如下内容：

PPU DCGM的rpm和deb安装包
hgdcgm-exporter (文件夹)

安装PPU DCGM所需依赖条件如下：

已安装PPU驱动
已安装PPU SDK
root权限（无root权限将导致部分功能不可用）

2.1 停止和删除旧版本PPU DCGM

请注意区分是否root权限运行：

# root权限运行，终止nv-hostengine
sudo nv-hostengine -t

# 非root权限运行，终止nv-hostengine，确保nvhostengine.pid的路径与启动nv-hostengine时的一样
nv-hostengine -t --pid <your path>/nvhostengine.pid 

# 若使用deb安装包 (Ubuntu)
sudo dpkg -r datacenter-gpu-manager

# 若使用rpm安装包 （CentOS）
sudo rpm -e datacenter-gpu-manager

2.2 安装PPU DCGM软件包

# 从3.0.8版本起，dcgm统一为一个安装包，不再区分PCIe和OAM安装包
# 以3.0.9版本为例：

# 若使用deb安装包 (Ubuntu)
sudo dpkg -i --force-overwrite datacenter-gpu-manager_3.0.9_amd64.deb

# 若使用rpm安装包（CentOS）
sudo rpm --force -ivh --nodeps datacenter-gpu-manager-3.0.9-1-x86_64.rpm

2.3 安装依赖的软件包

PPU DCGM运行环境需要安装如下软件：

在Ubuntu或Debian等基于APT包管理器的Linux发行版上，可以使用以下命令安装lspci：
```
sudo apt-get update
sudo apt-get install pciutils
```
在CentOS或Red Hat等基于Yum包管理器的Linux发行版上，可以使用以下命令安装lspci：
```
sudo yum makecache
sudo yum install pciutils
```

2.4 启动PPU DCGM后台服务

请注意区分是否root权限运行：

# root权限运行，启动HGDCGM后台服务
nv-hostengine 

# 非root权限运行，请指定nvhostengine.pid的路径  
# 命令：nv-hostengine --pid <your path>/nvhostengine.pid
例如： nv-hostengine --pid /work/test/nvhostengine.pid

可执行dcgmi discovery -l验证PPU DCGM是否可用：

root@be1816e3958c:~# dcgmi discovery -l
2 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: PPU                                                            |
|        | PCI Bus ID: 00000000:10:00.0                                         |
|        | Device UUID: GPU-019ea108-c110-0828-0000-00002062b161                |
+--------+----------------------------------------------------------------------+
| 1      | Name: PPU                                                            |
|        | PCI Bus ID: 00000000:11:00.0                                         |
|        | Device UUID: GPU-019ea108-c120-040c-0000-0000c007e820                |
+--------+----------------------------------------------------------------------+
...

3. dcgmi命令行工具

dcgmi是一个提供交互性查询的命令行工具，它和nv-hostengine进行通信，收集相关数据并显示。

使用dcgmi工具前请先启动nv-hostengine服务。请注意区分是否root权限运行，非root权限运行将导致部分功能不可用。

# root权限运行，启动DCGM后台服务
nv-hostengine 

# 非root权限运行，请指定nvhostengine.pid的路径
nv-hostengine --pid <your path>/nvhostengine.pid

dcgmi支持多个子命令，可通过执行dcgmi -h查看帮助信息：

root@be1816e3958c:~# dcgmi -h
Usage: dcgmi
   dcgmi subsystem
   dcgmi -v

Flags:
  -v    vv          Get DCGMI version information
        subsystem   The desired subsystem to be accessed.
 Subsystems Available:
        topo        GPU Topology [dcgmi topo -h for more info]
        stats       Process Statistics [dcgmi stats -h for more info]
        diag        System Validation/Diagnostic [dcgmi diag –h for more info]
        policy      Policy Management [dcgmi policy –h for more info]
        health      Health Monitoring [dcgmi health –h for more info]
        config      Configuration Management [dcgmi config –h for more info]
        group       GPU Group Management [dcgmi group –h for more info]
        fieldgroup  Field Group Management [dcgmi fieldgroup –h for more info]
        discovery   Discover GPUs on the system [dcgmi discovery –h for more info]
        nvlink      Displays NvLink link statuses and error counts [dcgmi nvlink –h for more info]
        dmon        Stats Monitoring of GPUs [dcgmi dmon –h for more info]
        modules     Control and list DCGM modules
        profile     Control and list DCGM profiling metrics
  --    ignore_rest Ignores the rest of the labeled arguments following this
                     flag.
      --version     Displays version information and exits.
  -h  --help        Displays usage information and exits.
...

3.1 查看设备列表 (discovery)

可通过discovery子命令查看PPU设备列表，并查看设备的状态信息。

通过执行dcgmi discovery -h可查看相关帮助信息:

root@be1816e3958c:~# dcgmi discovery -h

 discovery -- Used to discover and identify GPUs and their attributes.

Usage: dcgmi discovery
   dcgmi discovery --host <IP/FQDN> -l
   dcgmi discovery --host <IP/FQDN> -i <flags> -g <groupId> -v
   dcgmi discovery -c

Flags:
  -g  --group      groupId    The group ID to query.
      --host       IP/FQDN    Connects to specified IP or fully-qualified domain
                               name. To connect to a host engine that was
                               started with -d (unix socket), prefix the unix
                               socket filename with 'unix://'. [default =
                               localhost]
  -l  --list                  List all GPUs discovered on the host.
  -i  --info       flags      Specify which information to return. [default =
                               atp]
                               a - device info
                               p - power limits
                               t - thermal limits
                               c - clocks
  -c  --compute-hierarchy           List all of the gpu instances and compute
                               instances
...

3.1.1 查看设备列表

通过执行dcgmi discovery -l可查看PPU设备列表，将会显示设备的名称、PCI Bus ID和UUID等信息：

root@be1816e3958c:~# dcgmi discovery -l
2 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: PPU                                                            |
|        | PCI Bus ID: 00000000:10:00.0                                         |
|        | Device UUID: GPU-019ea108-c110-0828-0000-00002062b161                |
+--------+----------------------------------------------------------------------+
| 1      | Name: PPU                                                            |
|        | PCI Bus ID: 00000000:11:00.0                                         |
|        | Device UUID: GPU-019ea108-c120-040c-0000-0000c007e820                |
+--------+----------------------------------------------------------------------+
...

3.1.2 查看状态信息

通过-i选项可查看设备的状态信息，例如PCI Bus ID等，显示的状态信息可通过-i参数指定：

`-i`取值	对应状态信息
a	设备信息
p	功率限制
t	温度限制
c	时钟信息

例如执行dcgmi discovery -i p查看设备功率限制，默认不通过-g选项指定GPU group时，将汇总显示所有PPU的状态信息：

root@be1816e3958c:~# dcgmi discovery -i p
+--------------------------+-------------------------------------------------+
| Group of 2 GPUs          | Device Information                              |
+==========================+=================================================+
| Current Power Limit (W)  | 300                                             |
| Default Power Limit (W)  | 300                                             |
| Max Power Limit (W)      | 300                                             |
| Min Power Limit (W)      | 200                                             |
| Enforced Power Limit (W) | 300                                             |
+--------------------------+-------------------------------------------------+

通过添加-v选项可以GPU为粒度显示状态信息，例如执行dcgmi discovery -i a -g 0 -v，逐个显示GPU group 0的设备基础信息：

root@be1816e3958c:~# dcgmi discovery -i a -g 0 -v
Device info:
+--------------------------+-------------------------------------------------+
| GPU ID: 0                | Device Information                              |
+==========================+=================================================+
| Device Name              | PPU                                             |
| PCI Bus ID               | 00000000:10:00.0                                |
| UUID                     | GPU-019ea108-c110-0828-0000-00002062b161        |
| Serial Number            | TH7510H07                                       |
| InfoROM Version          | Not Supported                                   |
| VBIOS                    | 1.4.44                                          |
+--------------------------+-------------------------------------------------+
+--------------------------+-------------------------------------------------+
| GPU ID: 1                | Device Information                              |
+==========================+=================================================+
| Device Name              | PPU                                             |
| PCI Bus ID               | 00000000:11:00.0                                |
| UUID                     | GPU-019ea108-c120-040c-0000-0000c007e820        |
| Serial Number            | TH7510H07                                       |
| InfoROM Version          | Not Supported                                   |
| VBIOS                    | 1.4.44                                          |
+--------------------------+-------------------------------------------------+

3.1.3 显示MIG实例信息

通过-c选项可列出存在的MIG GPU instace和Compute instance实例信息，例如dcgmi discovery -c查看所有PPU的MIG实例信息，将会显示PPU设备中MIG实例的层级关系：

root@be1816e3958c:/work/setup# dcgmi discovery -c
+-------------------+--------------------------------------------------------------------+
| Instance Hierarchy                                                                     |
+===================+====================================================================+
| GPU 1             | GPU GPU-019ea108-c120-040c-0000-0000c007e820 (EntityID: 1)         |
| -> I 1/0          | GPU Instance (EntityID: 68)                                        |
|    -> CI 1/0/0    | Compute Instance (EntityID: 68)                                    |
| -> I 1/1          | GPU Instance (EntityID: 69)                                        |
|    -> CI 1/1/0    | Compute Instance (EntityID: 69)                                    |
+-------------------+--------------------------------------------------------------------+

3.2 监控设备参数 (dmon)

可通过dmon子命令查看和监控相关设备参数，可查询指定GPU group内设备的指定field group的参数数据，通过执行dcgmi dmon -h可查看相关帮助信息:

root@be1816e3958c:~# dcgmi dmon -h

 dmon -- Used to monitor GPUs and their stats.

Usage: dcgmi dmon
   dcgmi dmon -i <gpuId> -g <groupId> -f <fieldGroupId> -e <fieldId> -d
        <delay> -c <count> -l

Flags:
      --host       IP/FQDN    Connects to specified IP or fully-qualified domain
                               name. To connect to a host engine that was
                               started with -d (unix socket), prefix the unix
                               socket filename with 'unix://'. [default =
                               localhost]
  -f  --field-group-id fieldGroupId  The field group to query on the specified
                               host.
  -e  --field-id   fieldId     Field identifier to view/inject.
  -l  --list                  List to look up the long names, short names and
                               field ids.
  -h  --help                  Displays usage information and exits.
  -i  --gpu-id     gpuId       The comma separated list of GPU/GPU-I/GPU-CI IDs
                               to run the dmon on. Default is -1 which runs for
                               all supported GPU. Run dcgmi discovery -c to
                               check list of available GPU entities
  -g  --group-id   groupId     The group to query on the specified host.
  -d  --delay      delay       In milliseconds. Integer representing how often
                               to query results from DCGM and print them for all
                               of the entities. [default = 1000 msec,  Minimum
                               value = 1 msec.]
  -c  --count      count       Integer representing How many times to loop
                               before exiting. [default- runs forever.]
...

3.2.1 查看支持的参数列表

通过-l选项可查看支持的参数列表，包含对应名称和field id信息，例如执行dcgmi dmon -l，输出实例如下，包含参数名称、缩写和对应field ID：

root@be1816e3958c:~# dcgmi dmon -l
________________________________________________________________________________________________________________________
Long Name                                                                                 Short Name          Field ID
________________________________________________________________________________________________________________________
driver_version                                                                             DRVER               1
nvml_version                                                                               NVVER               2
process_name                                                                               PRNAM               3
device_count                                                                               DVCNT               4
cuda_driver_version                                                                        CDVER               5
name                                                                                       DVNAM               50
brand                                                                                      DVBRN               51
nvml_index                                                                                 NVIDX               52
serial_number                                                                              SRNUM               53
uuid                                                                                       UUID#               54
minor_number                                                                               MNNUM               55
oem_inforom_version                                                                        OEMVR               56
pci_busid                                                                                  PCBID               57
...

3.2.2 查询设备参数

通过-e选项可指定field id列表并查询对应取值，例如执行dcgmi dmon -e 140,150查询所有PPU的内存和核心温度：

root@be1816e3958c:~# dcgmi dmon -e 140,150
#Entity   MMTMP             TMPTR
ID        C                  C
GPU 1     29                29
GPU 0     30                28
GPU 1     30                29
GPU 0     30                27
...

通过-f选项可指定field group，批量查询本field group内所有field id的对应取值，例如指定dcgmi dmon -f 5查询field group为5的所有参数。使用dcgmi子命令fieldgroup可管理查询项编组：

# 创建field group
root@be1816e3958c:~# dcgmi fieldgroup -c test -f 140,150
Successfully created field group "test" with a field group ID of 5

# 查询field group
root@be1816e3958c:~# dcgmi dmon -f 5
#Entity   MMTMP             TMPTR
ID        C                  C
GPU 1     30                29
GPU 0     30                28
GPU 1     30                29
GPU 0     30                28
...

注意：field id 支持情况请参考Field Id支持状态。

3.2.3 其他控制参数

默认通过dmon监控设备参数将周期的打印，可通过Ctrl + c打断查询过程。查询周期可通过-d选项指定，查询次数可通过-c选项指定，例如dcgmi dmon -e 140,150 -c 1将只查询一次数据。

root@be1816e3958c:~# dcgmi dmon -e 140,150 -c 1
#Entity   MMTMP             TMPTR
ID        C                  C
GPU 1     29                29
GPU 0     30                28

3.3 修改设备配置 (config)

可通过config子命令查看和修改设备配置，支持设置频率，ECC模式等参数。可执行dcgmi config -h查看帮助信息：

root@be1816e3958c:~# dcgmi config -h

 config -- Used to configure settings for groups of GPUs.

Usage: dcgmi config
   dcgmi config --host <IP/FQDN> -g <groupId> --enforce
   dcgmi config --host <IP/FQDN> -g <groupId> --get -v -j
   dcgmi config --host <IP/FQDN> -g <groupId> --set -e <0/1> -s <0/1> -a
        <mem,proc> -P <limit> -c <mode>

Flags:
  -g  --group      groupId    The GPU group to query on the specified host.
      --host       IP/FQDN    Connects to specified IP or fully-qualified domain
                               name. To connect to a host engine that was
                               started with -d (unix socket), prefix the unix
                               socket filename with 'unix://'. [default = localhost]
      --set                   Set configuration.
      --get                   Get configuration. Displays the Target and the
                               Current Configuration.
                               ------
                               1.Sync Boost - Current and Target Sync Boost State
                               2.SM Application Clock - Current and Target SM application clock values
                               3.Memory Application Clock - Current and Target Memory application clock values
                               4.ECC Mode - Current and Target ECC Mode
                               5 Power Limit - Current and Target power limits
                               6.Compute Mode - Current and Target compute mode
      --enforce               Enforce configuration.
  -h  --help                  Displays usage information and exits.
  -v  --verbose               Display policy information per GPU.
  -j  --json                  Print the output in a json format
  -e  --eccmode    0/1        Configure Ecc mode. (1 to Enable, 0 to Disable)
  -s  --syncboost  0/1        Configure Syncboost. (1 to Enable, 0 to Disable)
  -a  --appclocks  mem,proc   Configure Application Clocks. Must use memory,proc clocks (csv) format(MHz).
  -P  --powerlimit limit      Configure Power Limit (Watts).
  -c  --compmode   mode       Configure Compute Mode. Can be any of the
                               following:
                               0 - Unrestricted
                               1 - Prohibited
                               2 - Exclusive Process
...

3.3.1 查询设备配置

通过--get选项可查询当前的和希望修改的配置的值，例如执行dcgmi config --get查看所有PPU的配置汇总，其中：

Target列表示已设置（set）但未生效（enforce）的配置值
Current列表示已生效（enforce）的配置值

root@be1816e3958c:~# dcgmi config --get
+------------------------------+------------------------------+------------------------------+
| DCGM_ALL_SUPPORTED_GPUS                                                                    |
| Group of 2 GPUs                                                                            |
+==============================+==============================+==============================+
| Field                        | Target                       | Current                      |
+------------------------------+------------------------------+------------------------------+
| Compute Mode                 | Not Specified                | Unrestricted                 |
| ECC Mode                     | Enabled                      | Disabled                     |
| Sync Boost                   | Not Specified                | Not Supported                |
| Memory Application Clock     | Not Specified                | 1800                         |
| SM Application Clock         | Not Specified                | 1200                         |
| Power Limit                  | Not Specified                | 300                          |
+------------------------------+------------------------------+------------------------------+
...

通过增加-v选项可查看每个PPU的配置状态：

root@be1816e3958c:~# dcgmi config --get -v
+------------------------------+------------------------------+------------------------------+
| GPU ID: 0                                                                                  |
| PPU                                                                                        |
+==============================+==============================+==============================+
| Field                        | Target                       | Current                      |
+------------------------------+------------------------------+------------------------------+
| Compute Mode                 | Not Specified                | Unrestricted                 |
| ECC Mode                     | Enabled                      | Disabled                     |
| Sync Boost                   | Not Specified                | Not Supported                |
| Memory Application Clock     | Not Specified                | 1800                         |
| SM Application Clock         | Not Specified                | 1200                         |
| Power Limit                  | Not Specified                | 300                          |
+------------------------------+------------------------------+------------------------------+
+------------------------------+------------------------------+------------------------------+
| GPU ID: 1                                                                                  |
| PPU                                                                                        |
+==============================+==============================+==============================+
| Field                        | Target                       | Current                      |
+------------------------------+------------------------------+------------------------------+
| Compute Mode                 | Not Specified                | Unrestricted                 |
| ECC Mode                     | Enabled                      | Disabled                     |
| Sync Boost                   | Not Specified                | Not Supported                |
| Memory Application Clock     | Not Specified                | 1800                         |
| SM Application Clock         | Not Specified                | 1200                         |
| Power Limit                  | Not Specified                | 300                          |
+------------------------------+------------------------------+------------------------------+

3.3.2 修改设备配置

可通过--set选项配合其他修改选项（例如-e修改ECC模式），修改相关配置。例如修改ECC模式，执行dcgmi config --set -e 1设置ECC模式为使能：

root@be1816e3958c:~# dcgmi config --set -e 1
Configuration successfully set.

DCGM支持修改的设备配置如下：

配置选项	说明
-e --eccmode	修改ECC模式
-a --appclocks	修改application clock频率
-P --powerlimit	修改功率限制
-c --compmode	修改compute mode
-e --syncboost	不支持，此选项已废弃

修改后的配置状态将会被DCGM缓存，可通过--enforce选项，指定DCGM将各个设备缓存的配置设置到对应设备，例如设备复位后，通过--enforce选项生效之前的配置，执行dcgmi config --enforce：

root@be1816e3958c:~# dcgmi config --enforce
Configuration successfully enforced.

3.4 诊断设备状态 (diag)

可通过diag子命令运行诊断程序，以检查和诊断PPU设备状态，DCGM支持多种级别的诊断处理。可执行dcgmi diag -h查看帮助信息：

root@be1816e3958c:~# dcgmi diag -h

 diag -- Used to run diagnostics on the system.

Usage: dcgmi diag
   dcgmi diag --host <IP/FQDN> -g <groupId> -r <diag> -p
        <test_name.variable_name=variable_value> -c
        </full/path/to/config/file> -f <fakeGpuList> -i <gpuList> -v
        --statsonfail --debugLogFile <debug file> --statspath <plugin
        statistics path> -j --throttle-mask <> --fail-early
        --check-interval <failure check interval> --iterations <iterations>

Flags:
...
  -r  --run        diag       Run a diagnostic. (Note: higher numbered tests
                               include all beneath.)
                               1 - Quick (System Validation ~ seconds)
                               2 - Medium (Extended System Validation ~ 2 minutes)
                               3 - Long (System HW Diagnostics ~ 15 minutes)
                               4 - Extended (Longer-running System HW Diagnostics)
                               Specific tests to run may be specified by name,
                               and multiple tests may be specified as a comma
                               separated list. For example, the command:
                               dcgmi diag -r "sm stress,diagnostic"
                               would run the SM Stress and Diagnostic tests together.
  -p  --parameters test_name.variable_name=variable_value Test parameters to set for this run.
  -c  --configfile /full/path/to/config/file Path to the configuration file.
  -i  --gpuList    gpuList    A comma-separated list of the gpus on which the
                               diagnostic should run. Cannot be used with -g.
  -v  --verbose               Show information and warnings for each test.
...
      --throttle-mask           Specify which throttling reasons should be
                               ignored. You can provide a comma separated list
                               of reasons. For example, specifying 'HW_SLOWDOWN
                               ,SW_THERMAL' would ignore the HW_SLOWDOWN and
                               SW_THERMAL throttling reasons. Alternatively, you
                               can specify the integer value of the ignore
                               bitmask. For the bitmask, multiple reasons may be
                               specified by the sum of their bit masks.
...
      --fail-early            Enable early failure checks for the Targeted Power
                               , Targeted Stress, SM Stress, and Diagnostic
                               tests. When enabled, these tests check for a
                               failure once every 5 seconds (can be modified by
                               the --check-interval parameter) while the test is
                               running instead of a single check performed after
                               the test is complete. Disabled by default.
      --check-intervalfailure check interval Specify the interval (in seconds)
                               at which the early failure checks should occur
                               for the Targeted Power, Targeted Stress, SM
                               Stress, and Diagnostic tests when early failure
                               checks are enabled. Default is once every 5
                               seconds. Interval must be between 1 and 300
      --iterations iterations Specify a number of iterations of the diagnostic
                               to run consecutively. (Must be greater than 0.)
...

说明

diag子命令依赖PPU SDK相关组件，在使用diag子命令前，请确保PPU SDK已正确安装，GCC等工具链版本满足PPU SDK运行条件。

3.4.1 运行诊断程序

通过-r选项触发诊断测试程序运行，可通过-r选项指定运行的诊断测试程序集合，支持的程序集合如下：

诊断测试程序名称	测试内容	集合1：秒级	集合2：< 2分钟	集合3：< 30分钟	集合4：1-2小时
software	基础环境配置特性支持情况	是	是	是	是
pcie	PCIE测试 ICN链路测试		是	是	是
memory	内存分配测试内存读写测试		是	是	是
memory_bandwidth	内存吞吐能力测试		是	是	是
diagnostic	计算能力测试			是	是
sm_stress	计算能力压力测试			是	是
targeted_stress	控制计算能力测试			是	是
targeted_power	控制功耗测试			是	是
memtest	内存压力测试				是

说明

memory、memory_bandwidth和memtest诊断测试需要开启ECC功能，可通过如下命令开启所有设备的ECC功能，并复位PPU设备生效此配置：

dcgmi config --set -e 1
dcgmi config --enforce
复位PPU设备

例如执行dcgmi diag -r 1运行诊断测试程序集合1：

root@be1816e3958c:/work/deploy# dcgmi diag -r 1
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Skip                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | N/A                                            |
| Inforom                   | Skip                                           |
+---------------------------+------------------------------------------------+

可通过-r选项传入诊断测试程序名称程序，来运行指定的测试程序。可通过-p选项传入测试程序参数，格式为test_name.variable_name=variable_value，例如控制sm_stress运行时长，执行dcgmi diag -r "sm_stress" -p sm_stress.test_duration=5：

root@be1816e3958c:/work/deploy# dcgmi diag -r "sm_stress" -p sm_stress.test_duration=5
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Skip                                           |
+-----  Integration  -------+------------------------------------------------+
+-----  Hardware  ----------+------------------------------------------------+
+-----  Stress  ------------+------------------------------------------------+
| SM Stress                 | Pass - All                                     |
+---------------------------+------------------------------------------------+

说明

diag子命令诊断测试不适用于退换货流程（RMA）判断标准，请通过Field Diag、Bug Report等工具检查产品质量问题。

3.4.2 其他控制选项

可通过选项-g控制运行诊断的GPU group，或者通过-i选项指定运行的GPU索引。
可通过--throttle-mask屏蔽一些可能导致诊断失败的原因，比如禁止因为温度保护导致测试失败。
可通过--fail-early和--check-intervalfailure配置周期检查失败条件，而不是诊断结束时检查是否失败，这样可在测试过程中遇到失败提前退出诊断。

说明

diag子命令并不支持不同型号的PPU设备混合诊断的场景，若系统内包含多种型号的PPU设备，请通过-i或者-g选项选择相同型号的PPU设备进行诊断测试。

3.5 查看拓扑信息 (topo)

可通过topo子命令查看GPU之间的拓扑结构信息，可执行dcgmi topo -h查看帮助信息：

root@be1816e3958c:~# dcgmi topo -h

 topo -- Used to find the topology of GPUs on the system.

Usage: dcgmi topo
   dcgmi topo --host <IP/FQDN> -g <groupId> -j
   dcgmi topo --host <IP/FQDN> --gpuid <gpuId> -j

Flags:
  -g  --group      groupId    The group ID to query.
      --host       IP/FQDN    Connects to specified IP or fully-qualified domain
                               name. To connect to a host engine that was
                               started with -d (unix socket), prefix the unix
                               socket filename with 'unix://'. [default = localhost]
  -h  --help                  Displays usage information and exits.
      --gpuid      gpuId      The GPU ID to query.
  -j  --json                  Print the output in a json format
...

3.5.1 查看设备拓扑信息

通过--gpuid可指定查看某个PPU和其他设备的拓扑链接，通过-g选项可查看整个GPU group和外部拓扑链接的汇总。例如查看PPU 0的拓扑链接，执行dcgmi topo --gpuid 0，将会显示PPU 0和其他PPU设备连接的情况：

root@be1816e3958c:~# dcgmi topo --gpuid 0
+-------------------+------------------------------------------------------------------------------+
| Topology Information                                                                             |
| GPU ID: 0                                                                                        |
+===================+==============================================================================+
| CPU Core Affinity | 0 - 47, 96 - 143                                                             |
| To GPU 1          | Connected via a single PCIe switch                                           |
|                   | Connected via one NVLINK (Link: 5)                                           |
| To GPU 2          | Connected via a CPU-level link                                               |
|                   | Connected via one NVLINK (Link: 3)                                           |
| To GPU 3          | Connected via a CPU-level link                                               |
|                   | Connected via one NVLINK (Link: 2)                                           |
| To GPU 4          | Connected via a CPU-level link                                               |
| To GPU 5          | Connected via a CPU-level link                                               |
| To GPU 6          | Connected via a CPU-level link                                               |
| To GPU 7          | Connected via a CPU-level link                                               |
|                   | Connected via one NVLINK (Link: 0)                                           |
+-------------------+------------------------------------------------------------------------------+

3.6 查看ICN链路状态 (nvlink)

可通过nvlink子命令查看ICN链路状态，可执行dcgmi nvlink -h查看帮助信息：

root@be1816e3958c:/work/deploy# dcgmi nvlink -h

 nvlink -- Used to get NvLink link status or error counts for GPUs and
 NvSwitches in the system

 NVLINK Error description
 =========================
 CRC FLIT Error => Data link receive flow control digit CRC error.
 CRC Data Error => Data link receive data CRC error.
 Replay Error   => Data link transmit replay error.
 Recovery Error => Data link transmit recovery error.

Usage: dcgmi nvlink
   dcgmi nvlink --host <IP/FQDN> -g <gpuId> -e -j
   dcgmi nvlink --host <IP/FQDN> -s

Flags:
      --host       IP/FQDN    Connects to specified IP or fully-qualified domain
                               name. To connect to a host engine that was
                               started with -d (unix socket), prefix the unix
                               socket filename with 'unix://'. [default = localhost]
  -e  --errors                Print NvLink errors for a given gpuId (-g).
  -s  --link-status           Print NvLink link status for all GPUs and
                               NvSwitches in the system.
  -h  --help                  Displays usage information and exits.
  -g  --gpuid      gpuId      The GPU ID to query. Required for -e
  -j  --json                  Print the output in a json format
...

说明

nvlink子命令暂不支持icn error (-e --errors) 相关查询功能。

3.6.1 查看链路状态

通过-s选项可查询ICN链路状态，执行dcgmi nvlink -s实例如下，其中U表示链路正常，D表示链路断开：

root@b5ab3167ed51:~# dcgmi nvlink -s
+----------------------+
|  NvLink Link Status  |
+----------------------+
GPUs:
    gpuId 0:
        U D U U U U D _ _ _ _ _ _ _ _ _ _ _
    gpuId 1:
        U D U U U U D _ _ _ _ _ _ _ _ _ _ _
    gpuId 2:
        U D U U U U D _ _ _ _ _ _ _ _ _ _ _
    gpuId 3:
        U D U U U U D _ _ _ _ _ _ _ _ _ _ _
    gpuId 4:
        U D U U U U D _ _ _ _ _ _ _ _ _ _ _
    gpuId 5:
        U D U U U U D _ _ _ _ _ _ _ _ _ _ _
    gpuId 6:
        U D U U U U D _ _ _ _ _ _ _ _ _ _ _
    gpuId 7:
        U D U U U U D _ _ _ _ _ _ _ _ _ _ _
NvSwitches:
    No NvSwitches found.

Key: Up=U, Down=D, Disabled=X, Not Supported=_

3.7 管理设备策略 (policy)

可通过policy子命令设置和查看PPU管理策略，如出现ECC出错 / 过热等问题后，自动重启设备。

可执行dcgmi policy -h查看帮助信息：

root@be1816e3958c:~# dcgmi policy -h

 policy -- Used to control policies for groups of GPUs. Policies control actions
 which are triggered by specific events.

Usage: dcgmi policy
   dcgmi policy --host <IP/FQDN> -g <groupId> --get -j
   dcgmi policy --host <IP/FQDN> -g <groupId> --reg
   dcgmi policy --host <IP/FQDN> -g <groupId> --set <actn,val> -M <max> -T
        <max> -P <max> -e -p -n -x

Flags:
  -g  --group      groupId    The GPU group to query on the specified host.
...
      --get                   Get the current violation policy.
      --reg                   Register this process for policy updates.  This
                               process will sit in an infinite loop waiting for
                               updates from the policy manager.
      --set        actn,val   Set the current violation policy. Use csv action ,validation (ie. 1,2)
                               -----
                               Action to take when any of the violations
                               specified occur.
                               0 - None
                               1 - GPU Reset
                               -----
                               Validation to take after the violation action has been performed.
                               0 - None
                               1 - System Validation (short)
                               2 - System Validation (medium)
                               3 - System Validation (long)
      --clear                 Clear the current violation policy.
  -h  --help                  Displays usage information and exits.
  -v  --verbose               Display policy information per GPU.
  -M  --maxpages   max        Specify the maximum number of retired pages that
                               will trigger a violation.
  -T  --maxtemp    max        Specify the maximum temperature a group's GPUs can
                               reach before triggering a violation.
  -P  --maxpower   max        Specify the maximum power a group's GPUs can reach
                               before triggering a violation.
  -e  --eccerrors             Add ECC double bit errors to the policy
                               conditions.
  -p  --pcierrors             Add PCIe replay errors to the policy conditions.
  -n  --nvlinkerrors           Add NVLink errors to the policy conditions.
  -x  --xiderrors             Add XID errors to the policy conditions.
...

3.7.1 查看当前策略

通过--get选项可查看当前策略的配置情况，执行dcgmi policy --get示例如下：

root@be1816e3958c:~# dcgmi policy --get
Policy information
+-----------------------------+------------------------------------------------+
| Policy Information                                                           |
| DCGM_ALL_SUPPORTED_GPUS                                                      |
+=============================+================================================+
| Violation conditions        | Double-bit ECC errors                          |
|                             | PCI errors and replays                         |
|                             | Max temperature threshold - 90                 |
| Isolation mode              | Manual                                         |
| Action on violation         | None                                           |
| Validation after action     | System Validation (Long)                       |
| Validation failure action   | None                                           |
+-----------------------------+------------------------------------------------+

Violation conditions：触发策略的条件，如ECC错误，PCIE错误，温度超限等。
Action on violation：策略被触发的行为，可触发重启PPU设备等。
Validation after action：策略被触发后的验证行为。

3.7.2 设置和生效策略

设置并生效策略需要两步操作：

通过--set选项配合-M-T等选项设置策略，如温度门限等
通过--reg让本dcgmi进程开始轮询检查策略是否被违反

通过--set选项可指定违反策略时的处理（如重启PPU），以及违反策略的处理执行后，到下次检查的时间。--set选项需要配合其他检查条件选项使用，支持的检查条件如下：

检查条件选项	说明
-M --maxpages	retired page个数达到上限
-T --maxtemp	设备温度超过设置上限
-P --maxpower	设备功率超过设置上限
-e --eccerrors	出现2bit ECC错误
-p --pcierrors	出现ECC重播错误（暂不支持）
-x --xiderrors	驱动发生错误

例如执行dcgmi policy --set 0,3 -T 90 -p -e，设置温度策略 / PCIE策略 / ECC策略，然后执行dcgmi policy --reg生效策略并开始检查，当策略违反时，本进程会执行对应行为（如复位对应设备）：

root@be1816e3958c:~# dcgmi policy --set 0,3 -T 90 -p -e
Policy successfully set.
root@be1816e3958c:~# dcgmi policy --reg
Listening for violations.
...

3.8 检查设备健康状态 (health)

可通过health子命令监控和查看设备健康状态，如检查是否出现ECC错误，温度和功耗是否曾超过设定门限等。

可执行dcgmi health -h查看帮助信息：

root@be1816e3958c:~# dcgmi health -h

 health --  Used to manage the health watches of a group. The health of the GPUs
 in a group can then be monitored during a process.

Usage: dcgmi health
   dcgmi health --host <IP/FQDN> -g <groupId> -c -j
   dcgmi health --host <IP/FQDN> -g <groupId> -f -j
   dcgmi health --host <IP/FQDN> -g <groupId> -s <flags> -j -m <seconds> -u <seconds>

Flags:
  -g  --group      groupId    The GPU group to query on the specified host.
...
  -f  --fetch                 Fetch the current watch status.
  -s  --set        flags      Set the watches to be monitored. [default = pm]
                               a - all watches
                               p - PCIe watches (*)
                               m - memory watches (*)
                               i - infoROM watches
                               t - thermal and power watches (*)
                               n - NVLink watches (*)
                               (*) watch requires 60 sec before first query
      --clear                 Disable all watches being monitored.
  -c  --check                 Check to see if any errors or warnings have
                               occurred in the currently monitored watches.
...
  -m  --max-keep-age seconds    How long DCGM should cache the samples in seconds.
  -u  --update-interval seconds    How often DCGM should retrieve health from the driver in seconds.
...

3.8.1 查看使能的监控项

通过-f选项可查看当前生效的监控项，例如执行dcgmi health -f，可看到当前生效了内存 / PCIe等健康检查的监控项。

root@be1816e3958c:~# dcgmi health -f
Health monitor systems report
+-----------------+--------------------------------------------------------------------+
| PCIe            | On                                                                 |
| NVLINK          | On                                                                 |
| Memory          | On                                                                 |
| SM              | Off                                                                |
| InfoROM         | On                                                                 |
| Thermal         | On                                                                 |
| Power           | On                                                                 |
| Driver          | Off                                                                |
| NvSwitch NF     | Off                                                                |
| NvSwitch F      | Off                                                                |
+-----------------+--------------------------------------------------------------------+

3.8.2 设置监控项

通过-s选项可设置生效的监控项，通过--clear选项可清除生效的监控项。

对于订阅的监控项，可通过-m选项配置缓存的监控数据时长，以及通过-u选项配置多久查询一次设备状态，例如执行dcgmi health -s a -m 30 -u 1，订阅所有监控项，缓存30秒数据，每秒查询一次设备状态：

root@be1816e3958c:~# dcgmi health -s a -m 30 -u 1
Health monitor systems set successfully.

3.8.3 查看监控错误

通过-c选项可查看订阅的监控项中，哪些出现了错误，如执行dcgmi health -c，示例如下：

root@be1816e3958c:~# dcgmi health -c
+---------------------------+----------------------------------------------------------+
| Health Monitor Report                                                                |
+===========================+==========================================================+
| Overall Health            | Failure                                                  |
| GPU                       |                                                          |
| -> 0                      | Failure                                                  |
|    -> Errors              |                                                          |
|       -> NVLINK system    | Failure                                                  |
|                           | GPU 0's NvLink link 0 is currently down Run a field      |
|                           | diagnostic on the GPU.                                   |
...

3.9 管理设备编组 (group)

可通过group子命令管理设备 / 模块编组，例如可将多个PPU设备编入一个编组（group），并在其他dcgmi子命令中通过-g选项指定此编组。PPU DCGM支持多种编组级别，如PPU级别编组，ICN链路级别编组，Compute instance级别编组等。

PPU DCGM定义了默认的PPU设备编组，group id为0，包含所有支持的PPU设备。dcgmi其他子命令若不显式通过-g指定编组，默认操作group 0（即全部PPU设备）。

可执行dcgmi group -h查看帮助信息：

root@be1816e3958c:~# dcgmi group -h

 group -- Used to create and maintain groups of GPUs. Groups of GPUs can then be
 uniformly controlled through other DCGMI subsystems.

Usage: dcgmi group
   dcgmi group --host <IP/FQDN> -l -j
   dcgmi group --host <IP/FQDN> -c <groupName> --default --defaultnvswitches
   dcgmi group --host <IP/FQDN> -c <groupName> -a <entityId>
   dcgmi group --host <IP/FQDN> -d <groupId>
   dcgmi group --host <IP/FQDN> -g <groupId> -i -j
   dcgmi group --host <IP/FQDN> -g <groupId> -a <entityId>
   dcgmi group --host <IP/FQDN> -g <groupId> -r <entityId>

Flags:
...
  -l  --list                  List the groups that currently exist for a host.
  -d  --delete     groupId    Delete a group on the remote host.
  -c  --create     groupName  Create a group on the remote host.
  -h  --help                  Displays usage information and exits.
  -i  --info                  Display the information for the specified group ID.
  -r  --remove     entityId   Remove device(s) from group. (csv gpuIds, or
                               entityIds like gpu:0,nvswitch:994)
  -a  --add        entityId   Add device(s) to group. (csv gpuIds or entityIds
                               simlar to gpu:0, instance:1, compute_instance:2,
                               nvswitch:994)
      --default               Adds all available GPUs to the group being created.
      --defaultnvswitches           Adds all available NvSwitches to the group being created.

3.9.1 查看设备编组

通过-l选项可查看已存在的编组，例如执行dcgmi group -l，示例如下：

root@be1816e3958c:~# dcgmi group -l
+-------------------+----------------------------------------------------------+
| GROUPS                                                                       |
| 2 groups found.                                                              |
+===================+==========================================================+
| Groups            |                                                          |
| -> 0              |                                                          |
|    -> Group ID    | 0                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_GPUS                                  |
|    -> Entities    | GPU 0, GPU 1                                             |
| -> 1              |                                                          |
|    -> Group ID    | 1                                                        |
|    -> Group Name  | DCGM_ALL_SUPPORTED_NVSWITCHES                            |
|    -> Entities    | None                                                     |
+-------------------+----------------------------------------------------------+

通过选项-i可查看编组的详细信息，通过-g指定Group ID，例如执行dcgmi group -g 0 -i，示例如下：

root@be1816e3958c:~# dcgmi group -g 0 -i
+-------------------+----------------------------------------------------------+
| GROUP INFO                                                                   |
+===================+==========================================================+
| 0                 |                                                          |
| -> Group ID       | 0                                                        |
| -> Group Name     | DCGM_ALL_SUPPORTED_GPUS                                  |
| -> Entities       | GPU 0, GPU 1                                             |
+-------------------+----------------------------------------------------------+

3.9.2 管理设备编组

可通过-c选项创建编组，配合-a选项指定本编组内包含的设备，例如执行dcgmi group -c test -a gpu:0,1，创建一个名称为test的组，并添加PPU 0和PPU 1到本组。可通过--default或者--defaultnvswitches更方便的一次性添加所有设备到一个组。

root@be1816e3958c:~# dcgmi group -c test -a gpu:0,1
Successfully created group "test" with a group ID of 2
Add to group operation successful.

可通过-r选项从某个编组中删除某成员，例如执行dcgmi group -g 2 -r 1，从group 2中删除PPU 1。

root@be1816e3958c:~# dcgmi group -g 2 -r 1
Remove from group operation successful.

可通过-d选项删除编组，例如执行dcgmi group -d 2删除group 2：

root@be1816e3958c:~# dcgmi group -d 2
Successfully removed group 2

3.10 管理查询项编组 (fieldgroup)

可通过fieldgroup子命令管理查询项field ID的编组（field group），后续可通过指定查询项编组，一次性查询组内所有参数，例如在dmon子命令中通过-f指定查询项编组。

可执行dcgmi fieldgroup -h查看帮助信息：

root@be1816e3958c:~# dcgmi fieldgroup -h

 fieldgroup -- Used to create and maintain groups of field IDs. Groups of field
 IDs can then be uniformly controlled through other DCGMI subsystems.

Usage: dcgmi fieldgroup
   dcgmi fieldgroup --host <IP/FQDN> -l -j
   dcgmi fieldgroup --host <IP/FQDN> -c <fieldGroupName> -f <fieldIds>
   dcgmi fieldgroup --host <IP/FQDN> -i -g <fieldGroupId> -j
   dcgmi fieldgroup --host <IP/FQDN> -d -g <fieldGroupId>

Flags:
...
  -l  --list                  List the field groups that currently exist for a host.
  -i  --info                  Display the information for the specified field group ID.
  -d  --delete                Delete a field group on the remote host.
  -c  --create     fieldGroupName Create a field group on the remote host.
  -h  --help                  Displays usage information and exits.
  -g  --fieldgroup fieldGroupId The field group to query on the specified host.
  -f  --fieldids   fieldIds   Comma-separated list of the field ids to add to a
                               field group when creating a new one.
...

3.10.1 查看查询项编组

通过-l选项可查看已存在的编组，例如执行dcgmi fieldgroup -l，示例如下：

root@be1816e3958c:~# dcgmi fieldgroup -l
3 field groups found.
+-------------------+----------------------------------------------------------+
| FIELD GROUPS                                                                 |
+===================+==========================================================+
| ID                | 1                                                        |
| Name              | DCGM_INTERNAL_30SEC                                      |
| Field IDs         | 300                                                      |
+-------------------+----------------------------------------------------------+
+-------------------+----------------------------------------------------------+
| FIELD GROUPS                                                                 |
+===================+==========================================================+
| ID                | 2                                                        |
| Name              | DCGM_INTERNAL_HOURLY                                     |
| Field IDs         | 501, 509, 510, 511, 512, 513                             |
+-------------------+----------------------------------------------------------+
...

通过-i选项可查询某个编组的详细信息，通过-g指定编组，例如执行dcgmi fieldgroup -g 2 -i，示例如下：

root@be1816e3958c:~# dcgmi fieldgroup -g 2 -i
+-------------------+----------------------------------------------------------+
| FIELD GROUPS                                                                 |
+===================+==========================================================+
| ID                | 2                                                        |
| Name              | DCGM_INTERNAL_HOURLY                                     |
| Field IDs         | 501, 509, 510, 511, 512, 513                             |
+-------------------+----------------------------------------------------------+

3.10.2 管理查询项编组

通过-c选项可创建查询项编组，并通过-f选项指定本组内包含的查询项（field id），例如执行dcgmi fieldgroup -c temperature -f 140,150，创建一个名称为temperature的编组，包含memory和PPU核心温度查询项：

root@be1816e3958c:~# dcgmi fieldgroup -c temperature -f 140,150
Successfully created field group "temperature" with a field group ID of 4

通过-d选项可删除查询项编组，通过-g指定编组，例如执行dcgmi fieldgroup -d -g 4，示例如下：

root@be1816e3958c:~# dcgmi fieldgroup -d -g 4
Successfully removed field group 4

3.11 管理性能分析指标（profile）

可通过profile子命令查看和管理性能分析指标，可执行dcgmi profile -h查看帮助信息：

root@457471bb3c5e:~# dcgmi profile -h

 profile -- View available profiling metrics for GPUs

Usage: dcgmi profile

Flags:
      --host       IP/FQDN    Connects to specified IP or fully-qualified domain
                               name. To connect to a host engine that was
                               started with -d (unix socket), prefix the unix
                               socket filename with 'unix://'. [default =
                               localhost]
  -l  --list                  List available profiling metrics for a GPU or
                               group of GPUs
      --pause                  Pause DCGM profiling in order to run NVIDIA
                               developer tools like nvprof, nsight compute, or
                               nsight systems.
      --resume                 Resume DCGM profiling that was paused previously
                               with --pause.
  -h  --help                  Displays usage information and exits.
  -j  --json                  Print the output in a json format
  -i  --gpu-id     gpuId       The comma seperated list of GPU IDs to query.
                               Default is supported GPUs on the system. Run
                               dcgmi discovery -l to check list of GPUs
                               available
  -g  --group-id   groupId     The group of GPUs to query on the specified host.
...

3.11.1 列出支持的性能分析指标

通过dcgmi profile -l选项可查看可用的性能分析指标，可通过-i或者-g选项约束查询的PPU设备范围，示例如下：

root@457471bb3c5e:~# dcgmi profile -l
+----------------+----------+------------------------------------------------------+
| Group.Subgroup | Field ID | Field Tag                                            |
+----------------+----------+------------------------------------------------------+
| A.0            | 1002     | sm_active                                            |
| A.0            | 1003     | sm_occupancy                                         |
...

通过dcgmi dmon -e选项指定Field ID列即可查询相关性能分析指标，例如：

dcgmi dmon -e 1002,1003

3.11.2 暂停和恢复指标采集

订阅性能分析指标（Field ID > 1000）将会占用PPU性能数据采集资源，导致其他内置DCGM server的工具（如dcgm-exporter）和Asight工具无法运行，可通过dcgmi profile --pause暂停性能分析指标的采集，以运行其他工具，之后通过dcgmi profile --resume恢复跟踪采集，暂停阶段性能分析指标结果将上报为N/A。例如：

# 暂停性能分析指标采集
dcgmi profile --pause

# acu 跟踪数据采集
acu -o test_report -f python test_linear.py

# 恢复性能分析指标采集
dcgmi profile --resume

4. DCGM API支持状态

PPU DCGM提供dcgm共享库libdcgm.so，命令行工具dcgmi、后台服务nv-hostengine和hgdcgm-exporter均基于此动态库开发。libdcgm.so对外提供的接口说明参见dcgm_agent.h中的API说明。

libdcgm.so中DCGM支持的API支持情况如下：

API名称	说明	是否支持
dcgmInit	This method is used to initialize DCGM within this process.	是
dcgmStartEmbedded	Start an embedded host engine agent within this process.	是
dcgmProfGetSupportedMetricGroups	Get all of the profiling metric groups for a given GPU group	是
dcgmGroupCreate	Used to create a entity group handle which can store one or more entity Ids as an opaque handle returned in pDcgmGrpId.	是
dcgmFieldGroupCreate	Used to create a group of fields and return the handle in dcgmFieldGroupId	是
dcgmWatchFields	Request that DCGM start recording updates for a given field collection	是
dcgmUpdateAllFields	Tell the DCGM module to update all the fields being watched	是
dcgmEntityGetLatestValues	Request latest cached field value for a group of fields for a specific entity	是
dcgmFieldGroupDestroy	Used to remove a field group that was created with dcgmFieldGroupCreate	是
DcgmFieldsTerm	Terminates the DcgmFields module. Call this once from inside your program	是
dcgmGetGpuInstanceHierarchy	Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent	是
dcgmGroupAddDevice	Used to add specified GPU Id to the group represented by groupId.	是
dcgmGetAllSupportedDevices	Get identifiers corresponding to all the DCGM-supported devices on the system.	是
dcgmGetAllDevices	Get identifiers corresponding to all the devices on the system	是
dcgmGetDeviceTopology	Gets device topology corresponding to the gpuId	是
dcgmGetDeviceAttributes	Gets device attributes corresponding to the gpuId.	是
dcgmHealthSet	Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t	是
dcgmHealthCheck	Check the configured watches for any errors/failures/warnings that have occurred since the last time this check was invoked.	是

5. Field Id支持状态

PPU DCGM查询项（field id）支持情况如下：

Field ID名称	Field ID	说明	是否支持
DCGM_FI_DRIVER_VERSION	1	Driver Version	是
DCGM_FI_NVML_VERSION	2	Underlying NVML version	是
DCGM_FI_PROCESS_NAME	3	Process Name	是
DCGM_FI_DEV_COUNT	4	Number of Devices on the node	是
DCGM_FI_CUDA_DRIVER_VERSION	5	Cuda Driver Version	是
DCGM_FI_DEV_NAME	50	Name of the GPU device	是
DCGM_FI_DEV_NVML_INDEX	52	NVML index of this GPU	是
DCGM_FI_DEV_SERIAL	53	Device Serial Number	是
DCGM_FI_DEV_UUID	54	UUID corresponding to the device	是
DCGM_FI_GPU_TOPOLOGY_PCI	60	Topology of all GPUs on the system via PCI (static)	是
DCGM_FI_DEV_MIG_MODE	67	MIG mode for the device	是
DCGM_FI_DEV_VBIOS_VERSION	85	VBIOS version of the device	是
DCGM_FI_DEV_SM_CLOCK	100	SM clock frequency (in MHz)	是
DCGM_FI_DEV_MEM_CLOCK	101	Memory clock frequency (in MHz)	是
DCGM_FI_DEV_VIDEO_CLOCK	102	Video encoder/decoder clock for the device	是
DCGM_FI_DEV_APP_SM_CLOCK	110	SM Application clocks	是
DCGM_FI_DEV_APP_MEM_CLOCK	111	Memory Application clocks	是
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS	112	Current clock throttle reasons	是
DCGM_FI_DEV_MAX_SM_CLOCK	113	Maximum supported SM clock for the device	是
DCGM_FI_DEV_MAX_MEM_CLOCK	114	Maximum supported Memory clock for the device	是
DCGM_FI_DEV_MAX_VIDEO_CLOCK	115	Maximum supported Video encoder/decoder clock for the device	是
DCGM_FI_DEV_MEMORY_TEMP	140	Memory temperature (in C)	是
DCGM_FI_DEV_GPU_TEMP	150	GPU temperature (in C)	是
DCGM_FI_DEV_MEM_MAX_OP_TEMP	151	Maximum operating temperature for the memory of this GPU	是
DCGM_FI_DEV_GPU_MAX_OP_TEMP	152	Maximum operating temperature for this GPU	是
DCGM_FI_DEV_POWER_USAGE	155	Power draw (in W)	是
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	156	Total energy consumption since boot (in mJ)	否
DCGM_FI_DEV_SLOWDOWN_TEMP	158	Slowdown temperature for the device	是
DCGM_FI_DEV_SHUTDOWN_TEMP	159	Shutdown temperature for the device	是
DCGM_FI_DEV_POWER_MGMT_LIMIT	160	Current Power limit for the device	是
DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN	161	Minimum power management limit for the device	是
DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX	162	Maximum power management limit for the device	是
DCGM_FI_DEV_POWER_MGMT_LIMIT_DEF	163	Default power management limit for the device	是
DCGM_FI_DEV_ENFORCED_POWER_LIMIT	164	Effective power limit that the driver enforces after taking into account all limiters	是
DCGM_FI_DEV_PCIE_TX_THROUGHPUT	200	Total number of bytes transmitted through PCIe TX (in KB) via NVML	是
DCGM_FI_DEV_PCIE_RX_THROUGHPUT	201	Total number of bytes received through PCIe RX (in KB) via NVML	是
DCGM_FI_DEV_PCIE_REPLAY_COUNTER	202	PCIe replay counter	否
DCGM_FI_DEV_GPU_UTIL	203	GPU utilization (in %)	是
DCGM_FI_DEV_MEM_COPY_UTIL	204	Memory utilization (in %)	是
DCGM_FI_DEV_ENC_UTIL	206	Encoder utilization (in %)	是
DCGM_FI_DEV_DEC_UTIL	207	Decoder utilization (in %)	是
DCGM_FI_DEV_PCIE_MAX_LINK_GEN	235	PCIe Max Link Generation	是
DCGM_FI_DEV_PCIE_MAX_LINK_WIDTH	236	PCIe Max Link Width	是
DCGM_FI_DEV_PCIE_LINK_GEN	237	PCIe Current Link Generation	是
DCGM_FI_DEV_PCIE_LINK_WIDTH	238	PCIe Current Link Width	是
DCGM_FI_DEV_FB_TOTAL	250	Total Frame Buffer of the GPU in MB	是
DCGM_FI_DEV_FB_FREE	251	Free Frame Buffer in MB	是
DCGM_FI_DEV_FB_USED	252	Used Frame Buffer in MB	是
DCGM_FI_DEV_ECC_CURRENT	300	Current ECC mode for the device	是
DCGM_FI_DEV_ECC_PENDING	301	Pending ECC mode for the device	是
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL	310	Total number of single-bit volatile ECC errors	是
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL	311	Total number of double-bit volatile ECC errors	是
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL	312	Total number of single-bit persistent ECC errors	是
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL	313	Total number of double-bit persistent ECC errors	是
DCGM_FI_DEV_ECC_SBE_VOL_DEV	318	Device memory single bit volatile ECC errors	是
DCGM_FI_DEV_ECC_DBE_VOL_DEV	319	Device memory double bit volatile ECC errors	是
DCGM_FI_DEV_ECC_SBE_AGG_DEV	328	Device memory single bit aggregate (persistent) ECC errors	是
DCGM_FI_DEV_ECC_DBE_AGG_DEV	329	Device memory double bit aggregate (persistent) ECC errors	是
DCGM_FI_DEV_RETIRED_SBE	390	Total number of retired pages due to single-bit errors	是
DCGM_FI_DEV_RETIRED_DBE	391	Total number of retired pages due to double-bit errors	是
DCGM_FI_DEV_RETIRED_PENDING	392	Total number of pages pending retirement	是
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS	393	Number of remapped rows for uncorrectable errors	是
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS	394	Number of remapped rows for correctable errors	是
DCGM_FI_DEV_ROW_REMAP_FAILURE	395	Whether remapping of rows has failed	是
DCGM_FI_DEV_ROW_REMAP_PENDING	396	Whether remapping of rows is pending	是
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL	409	Total number of NVLink flow-control CRC errors	否
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL	419	Total number of NVLink data CRC errors	否
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL	429	Total number of NVLink retries	否
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0	440	ICN Link Lane Bandwidth Counter	是
DCGM_FI_DEV_NVLINK_BANDWIDTH_L1	441	ICN Link Lane Bandwidth Counter	是
DCGM_FI_DEV_NVLINK_BANDWIDTH_L2	442	ICN Link Lane Bandwidth Counter	是
DCGM_FI_DEV_NVLINK_BANDWIDTH_L3	443	ICN Link Lane Bandwidth Counter	是
DCGM_FI_DEV_NVLINK_BANDWIDTH_L4	444	ICN Link Lane Bandwidth Counter	是
DCGM_FI_DEV_NVLINK_BANDWIDTH_L5	445	ICN Link Lane Bandwidth Counter	是
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL	449	ICN Link Bandwidth Counter total for all Lanes	是
DCGM_FI_DEV_NVLINK_BANDWIDTH_L6	475	ICN Link Lane Bandwidth Counter	是
DCGM_FI_PROF_GR_ENGINE_ACTIVE	1001	Ratio of time the graphics engine is active	否
DCGM_FI_PROF_SM_ACTIVE	1002	The ratio of cycles an SM has at least 1 warp assigned (in %)	是
DCGM_FI_PROF_SM_OCCUPANCY	1003	The ratio of number of warps resident on an SM (in %)	是
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	1004	Ratio of cycles the tensor (HMMA) pipe is active (in %)	是
DCGM_FI_PROF_DRAM_ACTIVE	1005	Ratio of cycles the device memory interface is active sending or receiving data (in %)	是
DCGM_FI_PROF_PIPE_FP64_ACTIVE	1006	Ratio of cycles the fp64 pipes are active (in %)	否
DCGM_FI_PROF_PIPE_FP32_ACTIVE	1007	Ratio of cycles the fp32 pipes are active (in %)	否
DCGM_FI_PROF_PIPE_FP16_ACTIVE	1008	Ratio of cycles the fp16 pipes are active (in %)	否
DCGM_FI_PROF_PCIE_TX_BYTES	1009	The number of bytes of active PCIe tx (transmit) data including both header and payload.	是
DCGM_FI_PROF_PCIE_RX_BYTES	1010	The number of bytes of active PCIe rx (read) data including both header and payload.	是
DCGM_FI_PROF_NVLINK_TX_BYTES	1011	The total number of bytes of active NvLink tx (transmit) data including both header and payload.	是
DCGM_FI_PROF_NVLINK_RX_BYTES	1012	The total number of bytes of active NvLink rx (read) data including both header and payload.	是
DCGM_FI_PROF_NVLINK_L0_TX_BYTES	1040	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L0_RX_BYTES	1041	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L1_TX_BYTES	1042	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L1_RX_BYTES	1043	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L2_TX_BYTES	1044	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L2_RX_BYTES	1045	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L3_TX_BYTES	1046	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L3_RX_BYTES	1047	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L4_TX_BYTES	1048	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L4_RX_BYTES	1049	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L5_TX_BYTES	1050	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L5_RX_BYTES	1051	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L6_TX_BYTES	1052	ICN per link bandwidth	是
DCGM_FI_PROF_NVLINK_L6_RX_BYTES	1053	ICN per link bandwidth	是
DCGM_FI_PROF_KSD_HIT_RATE	6001	Ratio of KSD hit rate	是
DCGM_FI_PROF_KVD_HIT_RATE	6002	Ratio of KVD hit rate	是
DCGM_FI_PROF_L2_HIT_RATE	6003	Ratio of L2 cache hit rate	是
DCGM_FI_PROF_LLC_HIT_RATE	6004	Ratio of LLC cache hit rate	否

6. dcgm-exporter工具

PPU DCGM提供dcgm-exporter工具，方便集成到k8s环境中，以供外部查询和管理PPU设备。

6.1 运行步骤

根据获取与安装，在解压完PPU DCGM安装包后，会得到hgdcgm-exporter文件夹，里面包含dcgm-exporter可执行文件和default-counters.csv，可通过修改default-counters.csv控制对外提供的信息，default-counters.csv内容示例如下：

# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).

# PCIE
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.

# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_FB_TOTAL, gauge, Framebuffer memory total (in MiB)
...

通过执行./dcgm-exporter启动dcgm-exporter，执行结果示例如下，其中9400是exporter的http接口：

# 启动dcgm exporter服务
./dcgm-exporter &

# 获取metrics结果
curl localhost:9400/metrics

执行结果示例如下：

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 1500
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 1800
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 36
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 38
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 130.730000
# HELP DCGM_FI_DEV_PCIE_TX_THROUGHPUT Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# TYPE DCGM_FI_DEV_PCIE_TX_THROUGHPUT counter
DCGM_FI_DEV_PCIE_TX_THROUGHPUT{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 0.000000
# HELP DCGM_FI_DEV_PCIE_RX_THROUGHPUT Total number of bytes received through PCIe RX (in KB) via NVML.
# TYPE DCGM_FI_DEV_PCIE_RX_THROUGHPUT counter
DCGM_FI_DEV_PCIE_RX_THROUGHPUT{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 0.000000
...

6.2 和dcgmi命令行工具配合使用

当dcgmi命令行工具和dcgm-exporter在相同系统内运行时，由于dcgm-exporter默认使用内嵌的DCGM server，将导致有多个订阅者订阅PPU监控数据，导致监控性能开销增加，订阅性能分析指标（Field ID > 1000）失败等问题。

可通过dcgm-exporter参数-r指定dcgmi后台服务nv-hostengine的端口地址，使得dcgm-exporter通过nv-hostengine获取监控数据。例如：

./dcgm-exporter -r localhost:5555

7. FAQ

7.1 在Docker内使用DCGM工具

通过docker run启动docker时，需要删除--gpus选项，可以通过添加--privileged参数创建privileged container或者--device参数直接映射PPU设备给unprivileged container访问host机器上各种PPU设备的能力。

如果使用所有卡来启动镜像，可以参考下述命令（--privileged）：

docker run --privileged --ipc=host --shm-size=4g --ulimit memlock=-1 --ulimit stack=67108864 --init -it -v $HOME:/mnt -w /workspace --name test_$USER <DOCKER_NAME> /bin/bash

如果使用特定某PPU设备来启动镜像，可以参考下述命令，例如只使用PPU 0设备（--device=/dev/alixpu_ppu0 --device=/dev/alixpu --device=/dev/alixpu_ctl）：

docker run --device=/dev/alixpu_ppu0 --device=/dev/alixpu --device=/dev/alixpu_ctl --ipc=host --shm-size=4g --ulimit memlock=-1 --ulimit stack=67108864 --init -it -v $HOME:/mnt -w /workspace --name tf_$USER <DOCKER_NAME> /bin/bash

8. 已知问题

SDK v1.0.0：dcgm不支持在开启vGPU的系统中使用。
SDK v1.0.0：查询的application clock频率可能低于实际值。
SDK v1.0.0：查询大于1000的field ID结果可能不准确。
SDK v1.2.0：dcgmi和dcgm API暂不支持TensorCore/Pipe等利用率指标查询项。
非root权限运行时，修改设备配置等功能不可用。
dcgmi暂不支持stats子命令。
dcgmi nvlink子命令暂不支持icn error相关查询功能。
dcgmi policy子命令不支持设置pcie/icn错误策略门限。
dcgmi health子命令不支持订阅infoROM/thermal and power/nvlink。
ICN链路吞吐速率为Raw数据的吞吐量，包含消息头等开销。
不支持policy violation相关设置和查询功能。
不支持sample相关查询，例如温度/功率周期采样信息。
不支持vGPU利用率和进程信息相关查询。
不支持persistence mode相关功能。
不支持performance state相关功能。
不支持调整memory clock频率。
memory_bandwidth诊断测试在PPU频率未达到最大频率时，测试将会失败。
极少情况下，启动nv-hostengine时，会报"DCGM Failed to find any GPUs on the node"，可能原因是服务器环境中残留老版本的KMD或PPU SDK文件。