管理监控工具DCGM(v1.5)
1. 概述
PPU DCGM是一套用于在集群和数据中心环境中管理和监控PPU的工具,基于开源的DCGM。包括如下组件:
dcgmi命令行工具
nv-hostengine命令行工具
dcgm共享库
dcgm-exporter工具
2. 获取与安装
目前PPU DCGM工具由单独的工具包发布,请点击下载链接前往PPU artifactory页面下载。
下载软件包需要账号和密码,请联系您的客户经理(PDSA)获取。
安装包包含如下内容:
PPU DCGM的rpm和deb安装包
hgdcgm-exporter (文件夹)
安装PPU DCGM所需依赖条件如下:
已安装PPU驱动
已安装PPU SDK
root权限(无root权限将导致部分功能不可用)
2.1 停止和删除旧版本PPU DCGM
请注意区分是否root权限运行:
# root权限运行,终止nv-hostengine
sudo nv-hostengine -t
# 非root权限运行,终止nv-hostengine,确保nvhostengine.pid的路径与启动nv-hostengine时的一样
nv-hostengine -t --pid <your path>/nvhostengine.pid
# 若使用deb安装包 (Ubuntu)
sudo dpkg -r datacenter-gpu-manager
# 若使用rpm安装包 (CentOS)
sudo rpm -e datacenter-gpu-manager2.2 安装PPU DCGM软件包
# 从3.0.8版本起,dcgm统一为一个安装包,不再区分PCIe和OAM安装包
# 以3.0.9版本为例:
# 若使用deb安装包 (Ubuntu)
sudo dpkg -i --force-overwrite datacenter-gpu-manager_3.0.9_amd64.deb
# 若使用rpm安装包(CentOS)
sudo rpm --force -ivh --nodeps datacenter-gpu-manager-3.0.9-1-x86_64.rpm2.3 安装依赖的软件包
PPU DCGM运行环境需要安装如下软件:
在Ubuntu或Debian等基于APT包管理器的Linux发行版上,可以使用以下命令安装lspci:
sudo apt-get update sudo apt-get install pciutils在CentOS或Red Hat等基于Yum包管理器的Linux发行版上,可以使用以下命令安装lspci:
sudo yum makecache sudo yum install pciutils
2.4 启动PPU DCGM后台服务
请注意区分是否root权限运行:
# root权限运行,启动HGDCGM后台服务
nv-hostengine
# 非root权限运行,请指定nvhostengine.pid的路径
# 命令:nv-hostengine --pid <your path>/nvhostengine.pid
例如: nv-hostengine --pid /work/test/nvhostengine.pid可执行dcgmi discovery -l验证PPU DCGM是否可用:
root@be1816e3958c:~# dcgmi discovery -l
2 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: PPU |
| | PCI Bus ID: 00000000:10:00.0 |
| | Device UUID: GPU-019ea108-c110-0828-0000-00002062b161 |
+--------+----------------------------------------------------------------------+
| 1 | Name: PPU |
| | PCI Bus ID: 00000000:11:00.0 |
| | Device UUID: GPU-019ea108-c120-040c-0000-0000c007e820 |
+--------+----------------------------------------------------------------------+
...3. dcgmi命令行工具
dcgmi是一个提供交互性查询的命令行工具,它和nv-hostengine进行通信,收集相关数据并显示。
使用dcgmi工具前请先启动nv-hostengine服务。请注意区分是否root权限运行,非root权限运行将导致部分功能不可用。
# root权限运行,启动DCGM后台服务
nv-hostengine
# 非root权限运行,请指定nvhostengine.pid的路径
nv-hostengine --pid <your path>/nvhostengine.piddcgmi支持多个子命令,可通过执行dcgmi -h查看帮助信息:
root@be1816e3958c:~# dcgmi -h
Usage: dcgmi
dcgmi subsystem
dcgmi -v
Flags:
-v vv Get DCGMI version information
subsystem The desired subsystem to be accessed.
Subsystems Available:
topo GPU Topology [dcgmi topo -h for more info]
stats Process Statistics [dcgmi stats -h for more info]
diag System Validation/Diagnostic [dcgmi diag –h for more info]
policy Policy Management [dcgmi policy –h for more info]
health Health Monitoring [dcgmi health –h for more info]
config Configuration Management [dcgmi config –h for more info]
group GPU Group Management [dcgmi group –h for more info]
fieldgroup Field Group Management [dcgmi fieldgroup –h for more info]
discovery Discover GPUs on the system [dcgmi discovery –h for more info]
nvlink Displays NvLink link statuses and error counts [dcgmi nvlink –h for more info]
dmon Stats Monitoring of GPUs [dcgmi dmon –h for more info]
modules Control and list DCGM modules
profile Control and list DCGM profiling metrics
-- ignore_rest Ignores the rest of the labeled arguments following this
flag.
--version Displays version information and exits.
-h --help Displays usage information and exits.
...3.1 查看设备列表 (discovery)
可通过discovery子命令查看PPU设备列表,并查看设备的状态信息。
通过执行dcgmi discovery -h可查看相关帮助信息:
root@be1816e3958c:~# dcgmi discovery -h
discovery -- Used to discover and identify GPUs and their attributes.
Usage: dcgmi discovery
dcgmi discovery --host <IP/FQDN> -l
dcgmi discovery --host <IP/FQDN> -i <flags> -g <groupId> -v
dcgmi discovery -c
Flags:
-g --group groupId The group ID to query.
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default =
localhost]
-l --list List all GPUs discovered on the host.
-i --info flags Specify which information to return. [default =
atp]
a - device info
p - power limits
t - thermal limits
c - clocks
-c --compute-hierarchy List all of the gpu instances and compute
instances
...3.1.1 查看设备列表
通过执行dcgmi discovery -l可查看PPU设备列表,将会显示设备的名称、PCI Bus ID和UUID等信息:
root@be1816e3958c:~# dcgmi discovery -l
2 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: PPU |
| | PCI Bus ID: 00000000:10:00.0 |
| | Device UUID: GPU-019ea108-c110-0828-0000-00002062b161 |
+--------+----------------------------------------------------------------------+
| 1 | Name: PPU |
| | PCI Bus ID: 00000000:11:00.0 |
| | Device UUID: GPU-019ea108-c120-040c-0000-0000c007e820 |
+--------+----------------------------------------------------------------------+
...3.1.2 查看状态信息
通过-i选项可查看设备的状态信息,例如PCI Bus ID等,显示的状态信息可通过-i参数指定:
| 对应状态信息 |
a | 设备信息 |
p | 功率限制 |
t | 温度限制 |
c | 时钟信息 |
例如执行dcgmi discovery -i p查看设备功率限制,默认不通过-g选项指定GPU group时,将汇总显示所有PPU的状态信息:
root@be1816e3958c:~# dcgmi discovery -i p
+--------------------------+-------------------------------------------------+
| Group of 2 GPUs | Device Information |
+==========================+=================================================+
| Current Power Limit (W) | 300 |
| Default Power Limit (W) | 300 |
| Max Power Limit (W) | 300 |
| Min Power Limit (W) | 200 |
| Enforced Power Limit (W) | 300 |
+--------------------------+-------------------------------------------------+通过添加-v选项可以GPU为粒度显示状态信息,例如执行dcgmi discovery -i a -g 0 -v,逐个显示GPU group 0的设备基础信息:
root@be1816e3958c:~# dcgmi discovery -i a -g 0 -v
Device info:
+--------------------------+-------------------------------------------------+
| GPU ID: 0 | Device Information |
+==========================+=================================================+
| Device Name | PPU |
| PCI Bus ID | 00000000:10:00.0 |
| UUID | GPU-019ea108-c110-0828-0000-00002062b161 |
| Serial Number | TH7510H07 |
| InfoROM Version | Not Supported |
| VBIOS | 1.4.44 |
+--------------------------+-------------------------------------------------+
+--------------------------+-------------------------------------------------+
| GPU ID: 1 | Device Information |
+==========================+=================================================+
| Device Name | PPU |
| PCI Bus ID | 00000000:11:00.0 |
| UUID | GPU-019ea108-c120-040c-0000-0000c007e820 |
| Serial Number | TH7510H07 |
| InfoROM Version | Not Supported |
| VBIOS | 1.4.44 |
+--------------------------+-------------------------------------------------+3.1.3 显示MIG实例信息
通过-c选项可列出存在的MIG GPU instace和Compute instance实例信息,例如dcgmi discovery -c查看所有PPU的MIG实例信息,将会显示PPU设备中MIG实例的层级关系:
root@be1816e3958c:/work/setup# dcgmi discovery -c
+-------------------+--------------------------------------------------------------------+
| Instance Hierarchy |
+===================+====================================================================+
| GPU 1 | GPU GPU-019ea108-c120-040c-0000-0000c007e820 (EntityID: 1) |
| -> I 1/0 | GPU Instance (EntityID: 68) |
| -> CI 1/0/0 | Compute Instance (EntityID: 68) |
| -> I 1/1 | GPU Instance (EntityID: 69) |
| -> CI 1/1/0 | Compute Instance (EntityID: 69) |
+-------------------+--------------------------------------------------------------------+3.2 监控设备参数 (dmon)
可通过dmon子命令查看和监控相关设备参数,可查询指定GPU group内设备的指定field group的参数数据,通过执行dcgmi dmon -h可查看相关帮助信息:
root@be1816e3958c:~# dcgmi dmon -h
dmon -- Used to monitor GPUs and their stats.
Usage: dcgmi dmon
dcgmi dmon -i <gpuId> -g <groupId> -f <fieldGroupId> -e <fieldId> -d
<delay> -c <count> -l
Flags:
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default =
localhost]
-f --field-group-id fieldGroupId The field group to query on the specified
host.
-e --field-id fieldId Field identifier to view/inject.
-l --list List to look up the long names, short names and
field ids.
-h --help Displays usage information and exits.
-i --gpu-id gpuId The comma separated list of GPU/GPU-I/GPU-CI IDs
to run the dmon on. Default is -1 which runs for
all supported GPU. Run dcgmi discovery -c to
check list of available GPU entities
-g --group-id groupId The group to query on the specified host.
-d --delay delay In milliseconds. Integer representing how often
to query results from DCGM and print them for all
of the entities. [default = 1000 msec, Minimum
value = 1 msec.]
-c --count count Integer representing How many times to loop
before exiting. [default- runs forever.]
...3.2.1 查看支持的参数列表
通过-l选项可查看支持的参数列表,包含对应名称和field id信息,例如执行dcgmi dmon -l,输出实例如下,包含参数名称、缩写和对应field ID:
root@be1816e3958c:~# dcgmi dmon -l
________________________________________________________________________________________________________________________
Long Name Short Name Field ID
________________________________________________________________________________________________________________________
driver_version DRVER 1
nvml_version NVVER 2
process_name PRNAM 3
device_count DVCNT 4
cuda_driver_version CDVER 5
name DVNAM 50
brand DVBRN 51
nvml_index NVIDX 52
serial_number SRNUM 53
uuid UUID# 54
minor_number MNNUM 55
oem_inforom_version OEMVR 56
pci_busid PCBID 57
...3.2.2 查询设备参数
通过-e选项可指定field id列表并查询对应取值,例如执行dcgmi dmon -e 140,150查询所有PPU的内存和核心温度:
root@be1816e3958c:~# dcgmi dmon -e 140,150
#Entity MMTMP TMPTR
ID C C
GPU 1 29 29
GPU 0 30 28
GPU 1 30 29
GPU 0 30 27
...通过-f选项可指定field group,批量查询本field group内所有field id的对应取值,例如指定dcgmi dmon -f 5查询field group为5的所有参数。使用dcgmi子命令fieldgroup可管理查询项编组:
# 创建field group
root@be1816e3958c:~# dcgmi fieldgroup -c test -f 140,150
Successfully created field group "test" with a field group ID of 5
# 查询field group
root@be1816e3958c:~# dcgmi dmon -f 5
#Entity MMTMP TMPTR
ID C C
GPU 1 30 29
GPU 0 30 28
GPU 1 30 29
GPU 0 30 28
...注意:field id 支持情况请参考Field Id支持状态。
3.2.3 其他控制参数
默认通过dmon监控设备参数将周期的打印,可通过Ctrl + c打断查询过程。查询周期可通过-d选项指定,查询次数可通过-c选项指定,例如dcgmi dmon -e 140,150 -c 1将只查询一次数据。
root@be1816e3958c:~# dcgmi dmon -e 140,150 -c 1
#Entity MMTMP TMPTR
ID C C
GPU 1 29 29
GPU 0 30 283.3 修改设备配置 (config)
可通过config子命令查看和修改设备配置,支持设置频率,ECC模式等参数。可执行dcgmi config -h查看帮助信息:
root@be1816e3958c:~# dcgmi config -h
config -- Used to configure settings for groups of GPUs.
Usage: dcgmi config
dcgmi config --host <IP/FQDN> -g <groupId> --enforce
dcgmi config --host <IP/FQDN> -g <groupId> --get -v -j
dcgmi config --host <IP/FQDN> -g <groupId> --set -e <0/1> -s <0/1> -a
<mem,proc> -P <limit> -c <mode>
Flags:
-g --group groupId The GPU group to query on the specified host.
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default = localhost]
--set Set configuration.
--get Get configuration. Displays the Target and the
Current Configuration.
------
1.Sync Boost - Current and Target Sync Boost State
2.SM Application Clock - Current and Target SM application clock values
3.Memory Application Clock - Current and Target Memory application clock values
4.ECC Mode - Current and Target ECC Mode
5 Power Limit - Current and Target power limits
6.Compute Mode - Current and Target compute mode
--enforce Enforce configuration.
-h --help Displays usage information and exits.
-v --verbose Display policy information per GPU.
-j --json Print the output in a json format
-e --eccmode 0/1 Configure Ecc mode. (1 to Enable, 0 to Disable)
-s --syncboost 0/1 Configure Syncboost. (1 to Enable, 0 to Disable)
-a --appclocks mem,proc Configure Application Clocks. Must use memory,proc clocks (csv) format(MHz).
-P --powerlimit limit Configure Power Limit (Watts).
-c --compmode mode Configure Compute Mode. Can be any of the
following:
0 - Unrestricted
1 - Prohibited
2 - Exclusive Process
...3.3.1 查询设备配置
通过--get选项可查询当前的和希望修改的配置的值,例如执行dcgmi config --get查看所有PPU的配置汇总,其中:
Target列表示已设置(set)但未生效(enforce)的配置值Current列表示已生效(enforce)的配置值
root@be1816e3958c:~# dcgmi config --get
+------------------------------+------------------------------+------------------------------+
| DCGM_ALL_SUPPORTED_GPUS |
| Group of 2 GPUs |
+==============================+==============================+==============================+
| Field | Target | Current |
+------------------------------+------------------------------+------------------------------+
| Compute Mode | Not Specified | Unrestricted |
| ECC Mode | Enabled | Disabled |
| Sync Boost | Not Specified | Not Supported |
| Memory Application Clock | Not Specified | 1800 |
| SM Application Clock | Not Specified | 1200 |
| Power Limit | Not Specified | 300 |
+------------------------------+------------------------------+------------------------------+
...通过增加-v选项可查看每个PPU的配置状态:
root@be1816e3958c:~# dcgmi config --get -v
+------------------------------+------------------------------+------------------------------+
| GPU ID: 0 |
| PPU |
+==============================+==============================+==============================+
| Field | Target | Current |
+------------------------------+------------------------------+------------------------------+
| Compute Mode | Not Specified | Unrestricted |
| ECC Mode | Enabled | Disabled |
| Sync Boost | Not Specified | Not Supported |
| Memory Application Clock | Not Specified | 1800 |
| SM Application Clock | Not Specified | 1200 |
| Power Limit | Not Specified | 300 |
+------------------------------+------------------------------+------------------------------+
+------------------------------+------------------------------+------------------------------+
| GPU ID: 1 |
| PPU |
+==============================+==============================+==============================+
| Field | Target | Current |
+------------------------------+------------------------------+------------------------------+
| Compute Mode | Not Specified | Unrestricted |
| ECC Mode | Enabled | Disabled |
| Sync Boost | Not Specified | Not Supported |
| Memory Application Clock | Not Specified | 1800 |
| SM Application Clock | Not Specified | 1200 |
| Power Limit | Not Specified | 300 |
+------------------------------+------------------------------+------------------------------+3.3.2 修改设备配置
可通过--set选项配合其他修改选项(例如-e修改ECC模式),修改相关配置。例如修改ECC模式,执行dcgmi config --set -e 1设置ECC模式为使能:
root@be1816e3958c:~# dcgmi config --set -e 1
Configuration successfully set.DCGM支持修改的设备配置如下:
配置选项 | 说明 |
-e --eccmode | 修改ECC模式 |
-a --appclocks | 修改application clock频率 |
-P --powerlimit | 修改功率限制 |
-c --compmode | 修改compute mode |
-e --syncboost | 不支持,此选项已废弃 |
修改后的配置状态将会被DCGM缓存,可通过--enforce选项,指定DCGM将各个设备缓存的配置设置到对应设备,例如设备复位后,通过--enforce选项生效之前的配置,执行dcgmi config --enforce:
root@be1816e3958c:~# dcgmi config --enforce
Configuration successfully enforced.3.4 诊断设备状态 (diag)
可通过diag子命令运行诊断程序,以检查和诊断PPU设备状态,DCGM支持多种级别的诊断处理。可执行dcgmi diag -h查看帮助信息:
root@be1816e3958c:~# dcgmi diag -h
diag -- Used to run diagnostics on the system.
Usage: dcgmi diag
dcgmi diag --host <IP/FQDN> -g <groupId> -r <diag> -p
<test_name.variable_name=variable_value> -c
</full/path/to/config/file> -f <fakeGpuList> -i <gpuList> -v
--statsonfail --debugLogFile <debug file> --statspath <plugin
statistics path> -j --throttle-mask <> --fail-early
--check-interval <failure check interval> --iterations <iterations>
Flags:
...
-r --run diag Run a diagnostic. (Note: higher numbered tests
include all beneath.)
1 - Quick (System Validation ~ seconds)
2 - Medium (Extended System Validation ~ 2 minutes)
3 - Long (System HW Diagnostics ~ 15 minutes)
4 - Extended (Longer-running System HW Diagnostics)
Specific tests to run may be specified by name,
and multiple tests may be specified as a comma
separated list. For example, the command:
dcgmi diag -r "sm stress,diagnostic"
would run the SM Stress and Diagnostic tests together.
-p --parameters test_name.variable_name=variable_value Test parameters to set for this run.
-c --configfile /full/path/to/config/file Path to the configuration file.
-i --gpuList gpuList A comma-separated list of the gpus on which the
diagnostic should run. Cannot be used with -g.
-v --verbose Show information and warnings for each test.
...
--throttle-mask Specify which throttling reasons should be
ignored. You can provide a comma separated list
of reasons. For example, specifying 'HW_SLOWDOWN
,SW_THERMAL' would ignore the HW_SLOWDOWN and
SW_THERMAL throttling reasons. Alternatively, you
can specify the integer value of the ignore
bitmask. For the bitmask, multiple reasons may be
specified by the sum of their bit masks.
...
--fail-early Enable early failure checks for the Targeted Power
, Targeted Stress, SM Stress, and Diagnostic
tests. When enabled, these tests check for a
failure once every 5 seconds (can be modified by
the --check-interval parameter) while the test is
running instead of a single check performed after
the test is complete. Disabled by default.
--check-intervalfailure check interval Specify the interval (in seconds)
at which the early failure checks should occur
for the Targeted Power, Targeted Stress, SM
Stress, and Diagnostic tests when early failure
checks are enabled. Default is once every 5
seconds. Interval must be between 1 and 300
--iterations iterations Specify a number of iterations of the diagnostic
to run consecutively. (Must be greater than 0.)
...diag子命令依赖PPU SDK相关组件,在使用diag子命令前,请确保PPU SDK已正确安装,GCC等工具链版本满足PPU SDK运行条件。
3.4.1 运行诊断程序
通过-r选项触发诊断测试程序运行,可通过-r选项指定运行的诊断测试程序集合,支持的程序集合如下:
诊断测试程序名称 | 测试内容 | 集合1:秒级 | 集合2:< 2分钟 | 集合3:< 30分钟 | 集合4:1-2小时 |
software | 基础环境配置 特性支持情况 | 是 | 是 | 是 | 是 |
pcie | PCIE测试 ICN链路测试 | 是 | 是 | 是 | |
memory | 内存分配测试 内存读写测试 | 是 | 是 | 是 | |
memory_bandwidth | 内存吞吐能力测试 | 是 | 是 | 是 | |
diagnostic | 计算能力测试 | 是 | 是 | ||
sm_stress | 计算能力压力测试 | 是 | 是 | ||
targeted_stress | 控制计算能力测试 | 是 | 是 | ||
targeted_power | 控制功耗测试 | 是 | 是 | ||
memtest | 内存压力测试 | 是 |
memory、memory_bandwidth和memtest诊断测试需要开启ECC功能,可通过如下命令开启所有设备的ECC功能,并复位PPU设备生效此配置:
dcgmi config --set -e 1
dcgmi config --enforce
复位PPU设备
例如执行dcgmi diag -r 1运行诊断测试程序集合1:
root@be1816e3958c:/work/deploy# dcgmi diag -r 1
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Deployment --------+------------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Skip |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | N/A |
| Inforom | Skip |
+---------------------------+------------------------------------------------+可通过-r选项传入诊断测试程序名称程序,来运行指定的测试程序。可通过-p选项传入测试程序参数,格式为test_name.variable_name=variable_value,例如控制sm_stress运行时长,执行dcgmi diag -r "sm_stress" -p sm_stress.test_duration=5:
root@be1816e3958c:/work/deploy# dcgmi diag -r "sm_stress" -p sm_stress.test_duration=5
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Deployment --------+------------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Skip |
+----- Integration -------+------------------------------------------------+
+----- Hardware ----------+------------------------------------------------+
+----- Stress ------------+------------------------------------------------+
| SM Stress | Pass - All |
+---------------------------+------------------------------------------------+diag子命令诊断测试不适用于退换货流程(RMA)判断标准,请通过Field Diag、Bug Report等工具检查产品质量问题。
3.4.2 其他控制选项
可通过选项
-g控制运行诊断的GPU group,或者通过-i选项指定运行的GPU索引。可通过
--throttle-mask屏蔽一些可能导致诊断失败的原因,比如禁止因为温度保护导致测试失败。可通过
--fail-early和--check-intervalfailure配置周期检查失败条件,而不是诊断结束时检查是否失败,这样可在测试过程中遇到失败提前退出诊断。
diag子命令并不支持不同型号的PPU设备混合诊断的场景,若系统内包含多种型号的PPU设备,请通过-i或者-g选项选择相同型号的PPU设备进行诊断测试。
3.5 查看拓扑信息 (topo)
可通过topo子命令查看GPU之间的拓扑结构信息,可执行dcgmi topo -h查看帮助信息:
root@be1816e3958c:~# dcgmi topo -h
topo -- Used to find the topology of GPUs on the system.
Usage: dcgmi topo
dcgmi topo --host <IP/FQDN> -g <groupId> -j
dcgmi topo --host <IP/FQDN> --gpuid <gpuId> -j
Flags:
-g --group groupId The group ID to query.
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default = localhost]
-h --help Displays usage information and exits.
--gpuid gpuId The GPU ID to query.
-j --json Print the output in a json format
...3.5.1 查看设备拓扑信息
通过--gpuid可指定查看某个PPU和其他设备的拓扑链接,通过-g选项可查看整个GPU group和外部拓扑链接的汇总。例如查看PPU 0的拓扑链接,执行dcgmi topo --gpuid 0,将会显示PPU 0和其他PPU设备连接的情况:
root@be1816e3958c:~# dcgmi topo --gpuid 0
+-------------------+------------------------------------------------------------------------------+
| Topology Information |
| GPU ID: 0 |
+===================+==============================================================================+
| CPU Core Affinity | 0 - 47, 96 - 143 |
| To GPU 1 | Connected via a single PCIe switch |
| | Connected via one NVLINK (Link: 5) |
| To GPU 2 | Connected via a CPU-level link |
| | Connected via one NVLINK (Link: 3) |
| To GPU 3 | Connected via a CPU-level link |
| | Connected via one NVLINK (Link: 2) |
| To GPU 4 | Connected via a CPU-level link |
| To GPU 5 | Connected via a CPU-level link |
| To GPU 6 | Connected via a CPU-level link |
| To GPU 7 | Connected via a CPU-level link |
| | Connected via one NVLINK (Link: 0) |
+-------------------+------------------------------------------------------------------------------+3.6 查看ICN链路状态 (nvlink)
可通过nvlink子命令查看ICN链路状态,可执行dcgmi nvlink -h查看帮助信息:
root@be1816e3958c:/work/deploy# dcgmi nvlink -h
nvlink -- Used to get NvLink link status or error counts for GPUs and
NvSwitches in the system
NVLINK Error description
=========================
CRC FLIT Error => Data link receive flow control digit CRC error.
CRC Data Error => Data link receive data CRC error.
Replay Error => Data link transmit replay error.
Recovery Error => Data link transmit recovery error.
Usage: dcgmi nvlink
dcgmi nvlink --host <IP/FQDN> -g <gpuId> -e -j
dcgmi nvlink --host <IP/FQDN> -s
Flags:
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default = localhost]
-e --errors Print NvLink errors for a given gpuId (-g).
-s --link-status Print NvLink link status for all GPUs and
NvSwitches in the system.
-h --help Displays usage information and exits.
-g --gpuid gpuId The GPU ID to query. Required for -e
-j --json Print the output in a json format
...nvlink子命令暂不支持icn error (-e --errors) 相关查询功能。
3.6.1 查看链路状态
通过-s选项可查询ICN链路状态,执行dcgmi nvlink -s实例如下,其中U表示链路正常,D表示链路断开:
root@b5ab3167ed51:~# dcgmi nvlink -s
+----------------------+
| NvLink Link Status |
+----------------------+
GPUs:
gpuId 0:
U D U U U U D _ _ _ _ _ _ _ _ _ _ _
gpuId 1:
U D U U U U D _ _ _ _ _ _ _ _ _ _ _
gpuId 2:
U D U U U U D _ _ _ _ _ _ _ _ _ _ _
gpuId 3:
U D U U U U D _ _ _ _ _ _ _ _ _ _ _
gpuId 4:
U D U U U U D _ _ _ _ _ _ _ _ _ _ _
gpuId 5:
U D U U U U D _ _ _ _ _ _ _ _ _ _ _
gpuId 6:
U D U U U U D _ _ _ _ _ _ _ _ _ _ _
gpuId 7:
U D U U U U D _ _ _ _ _ _ _ _ _ _ _
NvSwitches:
No NvSwitches found.
Key: Up=U, Down=D, Disabled=X, Not Supported=_3.7 管理设备策略 (policy)
可通过policy子命令设置和查看PPU管理策略,如出现ECC出错 / 过热等问题后,自动重启设备。
可执行dcgmi policy -h查看帮助信息:
root@be1816e3958c:~# dcgmi policy -h
policy -- Used to control policies for groups of GPUs. Policies control actions
which are triggered by specific events.
Usage: dcgmi policy
dcgmi policy --host <IP/FQDN> -g <groupId> --get -j
dcgmi policy --host <IP/FQDN> -g <groupId> --reg
dcgmi policy --host <IP/FQDN> -g <groupId> --set <actn,val> -M <max> -T
<max> -P <max> -e -p -n -x
Flags:
-g --group groupId The GPU group to query on the specified host.
...
--get Get the current violation policy.
--reg Register this process for policy updates. This
process will sit in an infinite loop waiting for
updates from the policy manager.
--set actn,val Set the current violation policy. Use csv action ,validation (ie. 1,2)
-----
Action to take when any of the violations
specified occur.
0 - None
1 - GPU Reset
-----
Validation to take after the violation action has been performed.
0 - None
1 - System Validation (short)
2 - System Validation (medium)
3 - System Validation (long)
--clear Clear the current violation policy.
-h --help Displays usage information and exits.
-v --verbose Display policy information per GPU.
-M --maxpages max Specify the maximum number of retired pages that
will trigger a violation.
-T --maxtemp max Specify the maximum temperature a group's GPUs can
reach before triggering a violation.
-P --maxpower max Specify the maximum power a group's GPUs can reach
before triggering a violation.
-e --eccerrors Add ECC double bit errors to the policy
conditions.
-p --pcierrors Add PCIe replay errors to the policy conditions.
-n --nvlinkerrors Add NVLink errors to the policy conditions.
-x --xiderrors Add XID errors to the policy conditions.
...3.7.1 查看当前策略
通过--get选项可查看当前策略的配置情况,执行dcgmi policy --get示例如下:
root@be1816e3958c:~# dcgmi policy --get
Policy information
+-----------------------------+------------------------------------------------+
| Policy Information |
| DCGM_ALL_SUPPORTED_GPUS |
+=============================+================================================+
| Violation conditions | Double-bit ECC errors |
| | PCI errors and replays |
| | Max temperature threshold - 90 |
| Isolation mode | Manual |
| Action on violation | None |
| Validation after action | System Validation (Long) |
| Validation failure action | None |
+-----------------------------+------------------------------------------------+Violation conditions:触发策略的条件,如ECC错误,PCIE错误,温度超限等。
Action on violation:策略被触发的行为,可触发重启PPU设备等。
Validation after action:策略被触发后的验证行为。
3.7.2 设置和生效策略
设置并生效策略需要两步操作:
通过
--set选项配合-M-T等选项设置策略,如温度门限等通过
--reg让本dcgmi进程开始轮询检查策略是否被违反
通过--set选项可指定违反策略时的处理(如重启PPU),以及违反策略的处理执行后,到下次检查的时间。--set选项需要配合其他检查条件选项使用,支持的检查条件如下:
检查条件选项 | 说明 |
-M --maxpages | retired page个数达到上限 |
-T --maxtemp | 设备温度超过设置上限 |
-P --maxpower | 设备功率超过设置上限 |
-e --eccerrors | 出现2bit ECC错误 |
-p --pcierrors | 出现ECC重播错误 (暂不支持) |
-x --xiderrors | 驱动发生错误 |
例如执行dcgmi policy --set 0,3 -T 90 -p -e,设置温度策略 / PCIE策略 / ECC策略,然后执行dcgmi policy --reg生效策略并开始检查,当策略违反时,本进程会执行对应行为(如复位对应设备):
root@be1816e3958c:~# dcgmi policy --set 0,3 -T 90 -p -e
Policy successfully set.
root@be1816e3958c:~# dcgmi policy --reg
Listening for violations.
...3.8 检查设备健康状态 (health)
可通过health子命令监控和查看设备健康状态,如检查是否出现ECC错误,温度和功耗是否曾超过设定门限等。
可执行dcgmi health -h查看帮助信息:
root@be1816e3958c:~# dcgmi health -h
health -- Used to manage the health watches of a group. The health of the GPUs
in a group can then be monitored during a process.
Usage: dcgmi health
dcgmi health --host <IP/FQDN> -g <groupId> -c -j
dcgmi health --host <IP/FQDN> -g <groupId> -f -j
dcgmi health --host <IP/FQDN> -g <groupId> -s <flags> -j -m <seconds> -u <seconds>
Flags:
-g --group groupId The GPU group to query on the specified host.
...
-f --fetch Fetch the current watch status.
-s --set flags Set the watches to be monitored. [default = pm]
a - all watches
p - PCIe watches (*)
m - memory watches (*)
i - infoROM watches
t - thermal and power watches (*)
n - NVLink watches (*)
(*) watch requires 60 sec before first query
--clear Disable all watches being monitored.
-c --check Check to see if any errors or warnings have
occurred in the currently monitored watches.
...
-m --max-keep-age seconds How long DCGM should cache the samples in seconds.
-u --update-interval seconds How often DCGM should retrieve health from the driver in seconds.
...3.8.1 查看使能的监控项
通过-f选项可查看当前生效的监控项,例如执行dcgmi health -f,可看到当前生效了内存 / PCIe等健康检查的监控项。
root@be1816e3958c:~# dcgmi health -f
Health monitor systems report
+-----------------+--------------------------------------------------------------------+
| PCIe | On |
| NVLINK | On |
| Memory | On |
| SM | Off |
| InfoROM | On |
| Thermal | On |
| Power | On |
| Driver | Off |
| NvSwitch NF | Off |
| NvSwitch F | Off |
+-----------------+--------------------------------------------------------------------+3.8.2 设置监控项
通过-s选项可设置生效的监控项,通过--clear选项可清除生效的监控项。
对于订阅的监控项,可通过-m选项配置缓存的监控数据时长,以及通过-u选项配置多久查询一次设备状态,例如执行dcgmi health -s a -m 30 -u 1,订阅所有监控项,缓存30秒数据,每秒查询一次设备状态:
root@be1816e3958c:~# dcgmi health -s a -m 30 -u 1
Health monitor systems set successfully.3.8.3 查看监控错误
通过-c选项可查看订阅的监控项中,哪些出现了错误,如执行dcgmi health -c,示例如下:
root@be1816e3958c:~# dcgmi health -c
+---------------------------+----------------------------------------------------------+
| Health Monitor Report |
+===========================+==========================================================+
| Overall Health | Failure |
| GPU | |
| -> 0 | Failure |
| -> Errors | |
| -> NVLINK system | Failure |
| | GPU 0's NvLink link 0 is currently down Run a field |
| | diagnostic on the GPU. |
...3.9 管理设备编组 (group)
可通过group子命令管理设备 / 模块编组,例如可将多个PPU设备编入一个编组(group),并在其他dcgmi子命令中通过-g选项指定此编组。PPU DCGM支持多种编组级别,如PPU级别编组,ICN链路级别编组,Compute instance级别编组等。
PPU DCGM定义了默认的PPU设备编组,group id为0,包含所有支持的PPU设备。dcgmi其他子命令若不显式通过-g指定编组,默认操作group 0(即全部PPU设备)。
可执行dcgmi group -h查看帮助信息:
root@be1816e3958c:~# dcgmi group -h
group -- Used to create and maintain groups of GPUs. Groups of GPUs can then be
uniformly controlled through other DCGMI subsystems.
Usage: dcgmi group
dcgmi group --host <IP/FQDN> -l -j
dcgmi group --host <IP/FQDN> -c <groupName> --default --defaultnvswitches
dcgmi group --host <IP/FQDN> -c <groupName> -a <entityId>
dcgmi group --host <IP/FQDN> -d <groupId>
dcgmi group --host <IP/FQDN> -g <groupId> -i -j
dcgmi group --host <IP/FQDN> -g <groupId> -a <entityId>
dcgmi group --host <IP/FQDN> -g <groupId> -r <entityId>
Flags:
...
-l --list List the groups that currently exist for a host.
-d --delete groupId Delete a group on the remote host.
-c --create groupName Create a group on the remote host.
-h --help Displays usage information and exits.
-i --info Display the information for the specified group ID.
-r --remove entityId Remove device(s) from group. (csv gpuIds, or
entityIds like gpu:0,nvswitch:994)
-a --add entityId Add device(s) to group. (csv gpuIds or entityIds
simlar to gpu:0, instance:1, compute_instance:2,
nvswitch:994)
--default Adds all available GPUs to the group being created.
--defaultnvswitches Adds all available NvSwitches to the group being created.3.9.1 查看设备编组
通过-l选项可查看已存在的编组,例如执行dcgmi group -l,示例如下:
root@be1816e3958c:~# dcgmi group -l
+-------------------+----------------------------------------------------------+
| GROUPS |
| 2 groups found. |
+===================+==========================================================+
| Groups | |
| -> 0 | |
| -> Group ID | 0 |
| -> Group Name | DCGM_ALL_SUPPORTED_GPUS |
| -> Entities | GPU 0, GPU 1 |
| -> 1 | |
| -> Group ID | 1 |
| -> Group Name | DCGM_ALL_SUPPORTED_NVSWITCHES |
| -> Entities | None |
+-------------------+----------------------------------------------------------+通过选项-i可查看编组的详细信息,通过-g指定Group ID,例如执行dcgmi group -g 0 -i,示例如下:
root@be1816e3958c:~# dcgmi group -g 0 -i
+-------------------+----------------------------------------------------------+
| GROUP INFO |
+===================+==========================================================+
| 0 | |
| -> Group ID | 0 |
| -> Group Name | DCGM_ALL_SUPPORTED_GPUS |
| -> Entities | GPU 0, GPU 1 |
+-------------------+----------------------------------------------------------+3.9.2 管理设备编组
可通过-c选项创建编组,配合-a选项指定本编组内包含的设备,例如执行dcgmi group -c test -a gpu:0,1,创建一个名称为test的组,并添加PPU 0和PPU 1到本组。可通过--default或者--defaultnvswitches更方便的一次性添加所有设备到一个组。
root@be1816e3958c:~# dcgmi group -c test -a gpu:0,1
Successfully created group "test" with a group ID of 2
Add to group operation successful.可通过-r选项从某个编组中删除某成员,例如执行dcgmi group -g 2 -r 1,从group 2中删除PPU 1。
root@be1816e3958c:~# dcgmi group -g 2 -r 1
Remove from group operation successful.可通过-d选项删除编组,例如执行dcgmi group -d 2删除group 2:
root@be1816e3958c:~# dcgmi group -d 2
Successfully removed group 23.10 管理查询项编组 (fieldgroup)
可通过fieldgroup子命令管理查询项field ID的编组(field group),后续可通过指定查询项编组,一次性查询组内所有参数,例如在dmon子命令中通过-f指定查询项编组。
可执行dcgmi fieldgroup -h查看帮助信息:
root@be1816e3958c:~# dcgmi fieldgroup -h
fieldgroup -- Used to create and maintain groups of field IDs. Groups of field
IDs can then be uniformly controlled through other DCGMI subsystems.
Usage: dcgmi fieldgroup
dcgmi fieldgroup --host <IP/FQDN> -l -j
dcgmi fieldgroup --host <IP/FQDN> -c <fieldGroupName> -f <fieldIds>
dcgmi fieldgroup --host <IP/FQDN> -i -g <fieldGroupId> -j
dcgmi fieldgroup --host <IP/FQDN> -d -g <fieldGroupId>
Flags:
...
-l --list List the field groups that currently exist for a host.
-i --info Display the information for the specified field group ID.
-d --delete Delete a field group on the remote host.
-c --create fieldGroupName Create a field group on the remote host.
-h --help Displays usage information and exits.
-g --fieldgroup fieldGroupId The field group to query on the specified host.
-f --fieldids fieldIds Comma-separated list of the field ids to add to a
field group when creating a new one.
...3.10.1 查看查询项编组
通过-l选项可查看已存在的编组,例如执行dcgmi fieldgroup -l,示例如下:
root@be1816e3958c:~# dcgmi fieldgroup -l
3 field groups found.
+-------------------+----------------------------------------------------------+
| FIELD GROUPS |
+===================+==========================================================+
| ID | 1 |
| Name | DCGM_INTERNAL_30SEC |
| Field IDs | 300 |
+-------------------+----------------------------------------------------------+
+-------------------+----------------------------------------------------------+
| FIELD GROUPS |
+===================+==========================================================+
| ID | 2 |
| Name | DCGM_INTERNAL_HOURLY |
| Field IDs | 501, 509, 510, 511, 512, 513 |
+-------------------+----------------------------------------------------------+
...通过-i选项可查询某个编组的详细信息,通过-g指定编组,例如执行dcgmi fieldgroup -g 2 -i,示例如下:
root@be1816e3958c:~# dcgmi fieldgroup -g 2 -i
+-------------------+----------------------------------------------------------+
| FIELD GROUPS |
+===================+==========================================================+
| ID | 2 |
| Name | DCGM_INTERNAL_HOURLY |
| Field IDs | 501, 509, 510, 511, 512, 513 |
+-------------------+----------------------------------------------------------+3.10.2 管理查询项编组
通过-c选项可创建查询项编组,并通过-f选项指定本组内包含的查询项(field id),例如执行dcgmi fieldgroup -c temperature -f 140,150,创建一个名称为temperature的编组,包含memory和PPU核心温度查询项:
root@be1816e3958c:~# dcgmi fieldgroup -c temperature -f 140,150
Successfully created field group "temperature" with a field group ID of 4通过-d选项可删除查询项编组,通过-g指定编组,例如执行dcgmi fieldgroup -d -g 4,示例如下:
root@be1816e3958c:~# dcgmi fieldgroup -d -g 4
Successfully removed field group 43.11 管理性能分析指标(profile)
可通过profile子命令查看和管理性能分析指标,可执行dcgmi profile -h查看帮助信息:
root@457471bb3c5e:~# dcgmi profile -h
profile -- View available profiling metrics for GPUs
Usage: dcgmi profile
Flags:
--host IP/FQDN Connects to specified IP or fully-qualified domain
name. To connect to a host engine that was
started with -d (unix socket), prefix the unix
socket filename with 'unix://'. [default =
localhost]
-l --list List available profiling metrics for a GPU or
group of GPUs
--pause Pause DCGM profiling in order to run NVIDIA
developer tools like nvprof, nsight compute, or
nsight systems.
--resume Resume DCGM profiling that was paused previously
with --pause.
-h --help Displays usage information and exits.
-j --json Print the output in a json format
-i --gpu-id gpuId The comma seperated list of GPU IDs to query.
Default is supported GPUs on the system. Run
dcgmi discovery -l to check list of GPUs
available
-g --group-id groupId The group of GPUs to query on the specified host.
...3.11.1 列出支持的性能分析指标
通过dcgmi profile -l选项可查看可用的性能分析指标,可通过-i或者-g选项约束查询的PPU设备范围,示例如下:
root@457471bb3c5e:~# dcgmi profile -l
+----------------+----------+------------------------------------------------------+
| Group.Subgroup | Field ID | Field Tag |
+----------------+----------+------------------------------------------------------+
| A.0 | 1002 | sm_active |
| A.0 | 1003 | sm_occupancy |
...通过dcgmi dmon -e选项指定Field ID列即可查询相关性能分析指标,例如:
dcgmi dmon -e 1002,10033.11.2 暂停和恢复指标采集
订阅性能分析指标(Field ID > 1000)将会占用PPU性能数据采集资源,导致其他内置DCGM server的工具(如dcgm-exporter)和Asight工具无法运行,可通过dcgmi profile --pause暂停性能分析指标的采集,以运行其他工具,之后通过dcgmi profile --resume恢复跟踪采集,暂停阶段性能分析指标结果将上报为N/A。例如:
# 暂停性能分析指标采集
dcgmi profile --pause
# acu 跟踪数据采集
acu -o test_report -f python test_linear.py
# 恢复性能分析指标采集
dcgmi profile --resume4. DCGM API支持状态
PPU DCGM提供dcgm共享库libdcgm.so,命令行工具dcgmi、后台服务nv-hostengine和hgdcgm-exporter均基于此动态库开发。libdcgm.so对外提供的接口说明参见dcgm_agent.h中的API说明。
libdcgm.so中DCGM支持的API支持情况如下:
API名称 | 说明 | 是否支持 |
dcgmInit | This method is used to initialize DCGM within this process. | 是 |
dcgmStartEmbedded | Start an embedded host engine agent within this process. | 是 |
dcgmProfGetSupportedMetricGroups | Get all of the profiling metric groups for a given GPU group | 是 |
dcgmGroupCreate | Used to create a entity group handle which can store one or more entity Ids as an opaque handle returned in pDcgmGrpId. | 是 |
dcgmFieldGroupCreate | Used to create a group of fields and return the handle in dcgmFieldGroupId | 是 |
dcgmWatchFields | Request that DCGM start recording updates for a given field collection | 是 |
dcgmUpdateAllFields | Tell the DCGM module to update all the fields being watched | 是 |
dcgmEntityGetLatestValues | Request latest cached field value for a group of fields for a specific entity | 是 |
dcgmFieldGroupDestroy | Used to remove a field group that was created with dcgmFieldGroupCreate | 是 |
DcgmFieldsTerm | Terminates the DcgmFields module. Call this once from inside your program | 是 |
dcgmGetGpuInstanceHierarchy | Gets the hierarchy of GPUs, GPU Instances, and Compute Instances by populating a list of each entity with a reference to their parent | 是 |
dcgmGroupAddDevice | Used to add specified GPU Id to the group represented by groupId. | 是 |
dcgmGetAllSupportedDevices | Get identifiers corresponding to all the DCGM-supported devices on the system. | 是 |
dcgmGetAllDevices | Get identifiers corresponding to all the devices on the system | 是 |
dcgmGetDeviceTopology | Gets device topology corresponding to the gpuId | 是 |
dcgmGetDeviceAttributes | Gets device attributes corresponding to the gpuId. | 是 |
dcgmHealthSet | Enable the DCGM health check system for the given systems defined in dcgmHealthSystems_t | 是 |
dcgmHealthCheck | Check the configured watches for any errors/failures/warnings that have occurred since the last time this check was invoked. | 是 |
5. Field Id支持状态
PPU DCGM查询项(field id)支持情况如下:
Field ID名称 | Field ID | 说明 | 是否支持 |
DCGM_FI_DRIVER_VERSION | 1 | Driver Version | 是 |
DCGM_FI_NVML_VERSION | 2 | Underlying NVML version | 是 |
DCGM_FI_PROCESS_NAME | 3 | Process Name | 是 |
DCGM_FI_DEV_COUNT | 4 | Number of Devices on the node | 是 |
DCGM_FI_CUDA_DRIVER_VERSION | 5 | Cuda Driver Version | 是 |
DCGM_FI_DEV_NAME | 50 | Name of the GPU device | 是 |
DCGM_FI_DEV_NVML_INDEX | 52 | NVML index of this GPU | 是 |
DCGM_FI_DEV_SERIAL | 53 | Device Serial Number | 是 |
DCGM_FI_DEV_UUID | 54 | UUID corresponding to the device | 是 |
DCGM_FI_GPU_TOPOLOGY_PCI | 60 | Topology of all GPUs on the system via PCI (static) | 是 |
DCGM_FI_DEV_MIG_MODE | 67 | MIG mode for the device | 是 |
DCGM_FI_DEV_VBIOS_VERSION | 85 | VBIOS version of the device | 是 |
DCGM_FI_DEV_SM_CLOCK | 100 | SM clock frequency (in MHz) | 是 |
DCGM_FI_DEV_MEM_CLOCK | 101 | Memory clock frequency (in MHz) | 是 |
DCGM_FI_DEV_VIDEO_CLOCK | 102 | Video encoder/decoder clock for the device | 是 |
DCGM_FI_DEV_APP_SM_CLOCK | 110 | SM Application clocks | 是 |
DCGM_FI_DEV_APP_MEM_CLOCK | 111 | Memory Application clocks | 是 |
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS | 112 | Current clock throttle reasons | 是 |
DCGM_FI_DEV_MAX_SM_CLOCK | 113 | Maximum supported SM clock for the device | 是 |
DCGM_FI_DEV_MAX_MEM_CLOCK | 114 | Maximum supported Memory clock for the device | 是 |
DCGM_FI_DEV_MAX_VIDEO_CLOCK | 115 | Maximum supported Video encoder/decoder clock for the device | 是 |
DCGM_FI_DEV_MEMORY_TEMP | 140 | Memory temperature (in C) | 是 |
DCGM_FI_DEV_GPU_TEMP | 150 | GPU temperature (in C) | 是 |
DCGM_FI_DEV_MEM_MAX_OP_TEMP | 151 | Maximum operating temperature for the memory of this GPU | 是 |
DCGM_FI_DEV_GPU_MAX_OP_TEMP | 152 | Maximum operating temperature for this GPU | 是 |
DCGM_FI_DEV_POWER_USAGE | 155 | Power draw (in W) | 是 |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | 156 | Total energy consumption since boot (in mJ) | 否 |
DCGM_FI_DEV_SLOWDOWN_TEMP | 158 | Slowdown temperature for the device | 是 |
DCGM_FI_DEV_SHUTDOWN_TEMP | 159 | Shutdown temperature for the device | 是 |
DCGM_FI_DEV_POWER_MGMT_LIMIT | 160 | Current Power limit for the device | 是 |
DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN | 161 | Minimum power management limit for the device | 是 |
DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX | 162 | Maximum power management limit for the device | 是 |
DCGM_FI_DEV_POWER_MGMT_LIMIT_DEF | 163 | Default power management limit for the device | 是 |
DCGM_FI_DEV_ENFORCED_POWER_LIMIT | 164 | Effective power limit that the driver enforces after taking into account all limiters | 是 |
DCGM_FI_DEV_PCIE_TX_THROUGHPUT | 200 | Total number of bytes transmitted through PCIe TX (in KB) via NVML | 是 |
DCGM_FI_DEV_PCIE_RX_THROUGHPUT | 201 | Total number of bytes received through PCIe RX (in KB) via NVML | 是 |
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | 202 | PCIe replay counter | 否 |
DCGM_FI_DEV_GPU_UTIL | 203 | GPU utilization (in %) | 是 |
DCGM_FI_DEV_MEM_COPY_UTIL | 204 | Memory utilization (in %) | 是 |
DCGM_FI_DEV_ENC_UTIL | 206 | Encoder utilization (in %) | 是 |
DCGM_FI_DEV_DEC_UTIL | 207 | Decoder utilization (in %) | 是 |
DCGM_FI_DEV_PCIE_MAX_LINK_GEN | 235 | PCIe Max Link Generation | 是 |
DCGM_FI_DEV_PCIE_MAX_LINK_WIDTH | 236 | PCIe Max Link Width | 是 |
DCGM_FI_DEV_PCIE_LINK_GEN | 237 | PCIe Current Link Generation | 是 |
DCGM_FI_DEV_PCIE_LINK_WIDTH | 238 | PCIe Current Link Width | 是 |
DCGM_FI_DEV_FB_TOTAL | 250 | Total Frame Buffer of the GPU in MB | 是 |
DCGM_FI_DEV_FB_FREE | 251 | Free Frame Buffer in MB | 是 |
DCGM_FI_DEV_FB_USED | 252 | Used Frame Buffer in MB | 是 |
DCGM_FI_DEV_ECC_CURRENT | 300 | Current ECC mode for the device | 是 |
DCGM_FI_DEV_ECC_PENDING | 301 | Pending ECC mode for the device | 是 |
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | 310 | Total number of single-bit volatile ECC errors | 是 |
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | 311 | Total number of double-bit volatile ECC errors | 是 |
DCGM_FI_DEV_ECC_SBE_AGG_TOTAL | 312 | Total number of single-bit persistent ECC errors | 是 |
DCGM_FI_DEV_ECC_DBE_AGG_TOTAL | 313 | Total number of double-bit persistent ECC errors | 是 |
DCGM_FI_DEV_ECC_SBE_VOL_DEV | 318 | Device memory single bit volatile ECC errors | 是 |
DCGM_FI_DEV_ECC_DBE_VOL_DEV | 319 | Device memory double bit volatile ECC errors | 是 |
DCGM_FI_DEV_ECC_SBE_AGG_DEV | 328 | Device memory single bit aggregate (persistent) ECC errors | 是 |
DCGM_FI_DEV_ECC_DBE_AGG_DEV | 329 | Device memory double bit aggregate (persistent) ECC errors | 是 |
DCGM_FI_DEV_RETIRED_SBE | 390 | Total number of retired pages due to single-bit errors | 是 |
DCGM_FI_DEV_RETIRED_DBE | 391 | Total number of retired pages due to double-bit errors | 是 |
DCGM_FI_DEV_RETIRED_PENDING | 392 | Total number of pages pending retirement | 是 |
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS | 393 | Number of remapped rows for uncorrectable errors | 是 |
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS | 394 | Number of remapped rows for correctable errors | 是 |
DCGM_FI_DEV_ROW_REMAP_FAILURE | 395 | Whether remapping of rows has failed | 是 |
DCGM_FI_DEV_ROW_REMAP_PENDING | 396 | Whether remapping of rows is pending | 是 |
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL | 409 | Total number of NVLink flow-control CRC errors | 否 |
DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL | 419 | Total number of NVLink data CRC errors | 否 |
DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL | 429 | Total number of NVLink retries | 否 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L0 | 440 | ICN Link Lane Bandwidth Counter | 是 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L1 | 441 | ICN Link Lane Bandwidth Counter | 是 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L2 | 442 | ICN Link Lane Bandwidth Counter | 是 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L3 | 443 | ICN Link Lane Bandwidth Counter | 是 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L4 | 444 | ICN Link Lane Bandwidth Counter | 是 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L5 | 445 | ICN Link Lane Bandwidth Counter | 是 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | 449 | ICN Link Bandwidth Counter total for all Lanes | 是 |
DCGM_FI_DEV_NVLINK_BANDWIDTH_L6 | 475 | ICN Link Lane Bandwidth Counter | 是 |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | 1001 | Ratio of time the graphics engine is active | 否 |
DCGM_FI_PROF_SM_ACTIVE | 1002 | The ratio of cycles an SM has at least 1 warp assigned (in %) | 是 |
DCGM_FI_PROF_SM_OCCUPANCY | 1003 | The ratio of number of warps resident on an SM (in %) | 是 |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | 1004 | Ratio of cycles the tensor (HMMA) pipe is active (in %) | 是 |
DCGM_FI_PROF_DRAM_ACTIVE | 1005 | Ratio of cycles the device memory interface is active sending or receiving data (in %) | 是 |
DCGM_FI_PROF_PIPE_FP64_ACTIVE | 1006 | Ratio of cycles the fp64 pipes are active (in %) | 否 |
DCGM_FI_PROF_PIPE_FP32_ACTIVE | 1007 | Ratio of cycles the fp32 pipes are active (in %) | 否 |
DCGM_FI_PROF_PIPE_FP16_ACTIVE | 1008 | Ratio of cycles the fp16 pipes are active (in %) | 否 |
DCGM_FI_PROF_PCIE_TX_BYTES | 1009 | The number of bytes of active PCIe tx (transmit) data including both header and payload. | 是 |
DCGM_FI_PROF_PCIE_RX_BYTES | 1010 | The number of bytes of active PCIe rx (read) data including both header and payload. | 是 |
DCGM_FI_PROF_NVLINK_TX_BYTES | 1011 | The total number of bytes of active NvLink tx (transmit) data including both header and payload. | 是 |
DCGM_FI_PROF_NVLINK_RX_BYTES | 1012 | The total number of bytes of active NvLink rx (read) data including both header and payload. | 是 |
DCGM_FI_PROF_NVLINK_L0_TX_BYTES | 1040 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L0_RX_BYTES | 1041 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L1_TX_BYTES | 1042 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L1_RX_BYTES | 1043 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L2_TX_BYTES | 1044 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L2_RX_BYTES | 1045 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L3_TX_BYTES | 1046 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L3_RX_BYTES | 1047 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L4_TX_BYTES | 1048 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L4_RX_BYTES | 1049 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L5_TX_BYTES | 1050 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L5_RX_BYTES | 1051 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L6_TX_BYTES | 1052 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_NVLINK_L6_RX_BYTES | 1053 | ICN per link bandwidth | 是 |
DCGM_FI_PROF_KSD_HIT_RATE | 6001 | Ratio of KSD hit rate | 是 |
DCGM_FI_PROF_KVD_HIT_RATE | 6002 | Ratio of KVD hit rate | 是 |
DCGM_FI_PROF_L2_HIT_RATE | 6003 | Ratio of L2 cache hit rate | 是 |
DCGM_FI_PROF_LLC_HIT_RATE | 6004 | Ratio of LLC cache hit rate | 否 |
6. dcgm-exporter工具
PPU DCGM提供dcgm-exporter工具,方便集成到k8s环境中,以供外部查询和管理PPU设备。
6.1 运行步骤
根据获取与安装,在解压完PPU DCGM安装包后,会得到hgdcgm-exporter文件夹,里面包含dcgm-exporter可执行文件和default-counters.csv,可通过修改default-counters.csv控制对外提供的信息,default-counters.csv内容示例如下:
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
# PCIE
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
# Utilization (the sample period varies depending on the product)
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_FB_TOTAL, gauge, Framebuffer memory total (in MiB)
...通过执行./dcgm-exporter启动dcgm-exporter,执行结果示例如下,其中9400是exporter的http接口:
# 启动dcgm exporter服务
./dcgm-exporter &
# 获取metrics结果
curl localhost:9400/metrics执行结果示例如下:
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 1500
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 1800
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 36
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 38
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 130.730000
# HELP DCGM_FI_DEV_PCIE_TX_THROUGHPUT Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# TYPE DCGM_FI_DEV_PCIE_TX_THROUGHPUT counter
DCGM_FI_DEV_PCIE_TX_THROUGHPUT{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 0.000000
# HELP DCGM_FI_DEV_PCIE_RX_THROUGHPUT Total number of bytes received through PCIe RX (in KB) via NVML.
# TYPE DCGM_FI_DEV_PCIE_RX_THROUGHPUT counter
DCGM_FI_DEV_PCIE_RX_THROUGHPUT{gpu="0",UUID="GPU-019ea108-c170-021e-0000-00002069567c",device="ppu0",modelName="PPU",Hostname="ubuntu",DCGM_FI_DRIVER_VERSION="0.8.0"} 0.000000
... 6.2 和dcgmi命令行工具配合使用
当dcgmi命令行工具和dcgm-exporter在相同系统内运行时,由于dcgm-exporter默认使用内嵌的DCGM server,将导致有多个订阅者订阅PPU监控数据,导致监控性能开销增加,订阅性能分析指标(Field ID > 1000)失败等问题。
可通过dcgm-exporter参数-r指定dcgmi后台服务nv-hostengine的端口地址,使得dcgm-exporter通过nv-hostengine获取监控数据。例如:
./dcgm-exporter -r localhost:55557. FAQ
7.1 在Docker内使用DCGM工具
通过docker run启动docker时,需要删除--gpus选项,可以通过添加--privileged参数创建privileged container或者--device参数直接映射PPU设备给unprivileged container访问host机器上各种PPU设备的能力。
如果使用所有卡来启动镜像,可以参考下述命令(--privileged):
docker run --privileged --ipc=host --shm-size=4g --ulimit memlock=-1 --ulimit stack=67108864 --init -it -v $HOME:/mnt -w /workspace --name test_$USER <DOCKER_NAME> /bin/bash如果使用特定某PPU设备来启动镜像,可以参考下述命令,例如只使用PPU 0设备(--device=/dev/alixpu_ppu0 --device=/dev/alixpu --device=/dev/alixpu_ctl):
docker run --device=/dev/alixpu_ppu0 --device=/dev/alixpu --device=/dev/alixpu_ctl --ipc=host --shm-size=4g --ulimit memlock=-1 --ulimit stack=67108864 --init -it -v $HOME:/mnt -w /workspace --name tf_$USER <DOCKER_NAME> /bin/bash8. 已知问题
SDK v1.0.0:dcgm不支持在开启vGPU的系统中使用。
SDK v1.0.0:查询的application clock频率可能低于实际值。
SDK v1.0.0:查询大于1000的field ID结果可能不准确。
SDK v1.2.0:dcgmi和dcgm API暂不支持TensorCore/Pipe等利用率指标查询项。
非root权限运行时,修改设备配置等功能不可用。
dcgmi暂不支持stats子命令。
dcgmi nvlink子命令暂不支持icn error相关查询功能。
dcgmi policy子命令不支持设置pcie/icn错误策略门限。
dcgmi health子命令不支持订阅infoROM/thermal and power/nvlink。
ICN链路吞吐速率为Raw数据的吞吐量,包含消息头等开销。
不支持policy violation相关设置和查询功能。
不支持sample相关查询,例如温度/功率周期采样信息。
不支持vGPU利用率和进程信息相关查询。
不支持persistence mode相关功能。
不支持performance state相关功能。
不支持调整memory clock频率。
memory_bandwidth诊断测试在PPU频率未达到最大频率时,测试将会失败。
极少情况下,启动nv-hostengine时,会报"DCGM Failed to find any GPUs on the node",可能原因是服务器环境中残留老版本的KMD或PPU SDK文件。