编译器使用指南
1. 介绍
ppu-clang是基于llvm13.0.1进行开发的编译器,兼容CUDA编程模型及语言。输入的CUDA C/C++源码经过一系列转换,生成在PPU上运行的可执行程序。
1.1 CUDA Programing Model 介绍
ppu-clang工具包针对CUDA C/C++源码进行编译,将其转换为两部分程序进行运行,一部分是CPU上的进程运行主控程序,另一部分使用PPU作为协处理加速器,运行的device代码来处理单批次、多批次的并行作业,无需主机进程的干预,在并行执行中取得最佳性能。 CUDA源码的总体指导规范被称为CUDA Programming Model,有关 CUDA 编程模型的更多信息,请参阅 CUDA C++ Programming Guide。目前ppu-clang编译器支持的最高版本为CUDA-12.9。
1.2 CUDA C/C++
CUDA C/C是一种基于C语言的扩展,包括C++语言实现的函数集合,Host/Device代码区分的属性,以及不同类型的数据存储标识。这些扩展的函数有很多参数,调用方式与标准C程序非常相似,在执行上为了能更好的调用GPGPU上的并行线程进行了扩展。目前host端编译器支持情况为,gcc host compiler的版本支持范围在[5.5 - 14.2],clang host compiler的支持范围在[clang 9 - clang 18]。
1.3 ppu-clang编译器的作用
ppu-clang编译器的整个编译过程包含CUDA源码的拆分、预处理、编译和binary合并等多个步骤,ppu-clang负责驱动所有的子程序,向开发者隐藏CUDA编译的具体细节。它支持一系列常规编译器的选项,例如自定义宏、自定义路径、自定义库路径等,也支持一部分只有PPU上才能接受的编译选项。在接收编译选项后,将其分离传导给合适的Host/Device编译器工具,并非所有的编译选项都会传到Device编译器部分。
1.4 与NVCC的编译兼容
为了降低开发项目的维护/移植代价,SDK Package提供了CUDA兼容的能力。对于CUDA C/C++源码的编译,开发者既可以使用标准的ppu-clang工具链,也可以继续保留在NV CUDA SDK下的nvcc编译方式。

如上图所示,开发者原有的nvcc命令行在编译的第一步,会通过nvcc wrapper的转换工具,转换成ppu-clang支持的命令行。开发者可以直接参考nv官方的CUDA nvcc编译器使用手册。
目前编译器经过cuda wrapper工具封装后,绝大部分选项都支持, 其余 30 个未支持option与 dynamic parallelism/host splitter/仅限windows 系统使用/error number等相关 。
1.4.1 host/device分离部分
目前编译器host编译器仅支持gcc。不支持的host端的选项如下:
-nohdinitlist, –no-host-device-initializer-list
1.4.2 与NVCC平台相关特殊选项
本部分选项基本与NVCC工具链相关,目前编译器不支持。受影响的选项有:
-target-dir, –target-directory
-no-compress, –no-compress
–no-align-double
-nodlink, –no-device-link
-preserve-relocs, –preserve-relocs
-dump-callgraph, –dump-callgraph
1.4.3 宏兼容说明
NVCC中使用了许多预定义宏,nvcc wrapper中对大部分宏进行了封装支持,只有部分与机器平台强相关的宏暂不支持,下表列出了暂时不支持的宏:
CUDACC_EWP
NVCC_DIAG_PRAGMA_SUPPORT
1.4.4 混合编译说明
用户如使用NVCC命令行,默认情况下(-arch=sm_80)编译产物为全系列支持的混合fatbin产物; 用户可以通过特定扩展option来指定具体的编译产物:
-arch=sm_80a 将只编译真武810E系列fatbin。
例外情况,-cubin不支持编译混合cubin产物,默认情况下(-arch=sm_80)将只编译真武810E系列产物。
1.5 CUDA Inline PTX 兼容
NV 除了提供了 CUDA C/C++的device 编程接口,开发者也可以通过inline ptx的方式,实现更加low level的能力。ppu-clang为了兼容目的,也提供了对大部分Inline PTX指令的支持,具体的使用方式开发者可以直接参考nv官方的ptx参考手册和Inline PTX的使用手册。
1.6 编程上的使用限制
从CUDA C/C++和Inline PTX的使用上,目前整体支持到了ptx-8.8版本,但同NV CUDA-12.9依然存在一定的差异:
部分支持 Texture 和 Surface 相关的 Cuda C++ 扩展 API、Inline PTX 指令的功能;
不支持 Dynamic Parallelism 相关的 Cuda C++ 扩展 API、Inline PTX 指令的功能,运行时会报错;
不支持 Inline PTX 中 ld/st 相关指令带有 {.level::eviction_priority} 和 {.level::prefetch_size} 的特性,但不影响编译和运行的过程(忽略行为);
不支持 Inline PTX 中 cache eviction policy 相关的指令和操作数,编译会报错;
针对ptx 7.6之上的新增指令,存在因硬件架构不支持的因素而运行时报错,但不影响编译的过程,运行时会报错;
Device 文件编译流程包括 Cuda Device C++ 文件 -> Llvm(Hgvm) IR -> Device Binary的过程, 但不包含输出 ptx 格式的文件过程; 针对其他平台的代码编译(或Codegen)环节,如带有 ptx 格式的编译环节,需要进行代码的调整;
部分兼容CUDA mma ptx及相关数据搬运指令,范围包括特定数据类型(.u8/.s8/.tf32/.bf16/.f16)下的dense mma指令;
2. 编译阶段
PPU软件栈使用HGGC的标识来同CUDA作为区分,方便进行独立编译或兼容CUDA的生态。
2.1 ppu-clang编译器识别宏
ppu-clang的预定义了以下的宏:
HGGCCC
在编译cu/hggc源文件时此宏处于被定义的状态
HGGC_ARCH
在编译cu/hggc源文件中的device代码时,此宏处于被定义的状态
HGGCCC_VER_MAJOR
定义为ppu编译器的major版本号
HGGCCC_VER_MINOR
定义为ppu编译器的minor版本号
HGGCCC_VER_BUILD
定义为ppu编译器的build版本号
2.2 ppu-clang的编译阶段
整个编译过程是一个转换过程,可以通过compiler driver进行驱动,每个编译阶段被分成更小的编译步骤执行。尽管提供了类似-###的选项来支持显示它的编译阶段,但是仅用于调试,版本变化都可能带来小的编译步骤的改变,脚本的构建不应该依赖小的编译步骤。 以下列出了可识别的文件名后缀和支持的编译阶段。
2.4 支持的输入文件后缀
下表定义了ppu-clang如何解释其输入文件:
输入文件后缀 | 描述 |
.cu, .hggc | CUDA/HGGC源文件,包含host代码和device代码 |
.hggci | 预处理后的文件 |
.bc | 编译的中间代码文件 |
opt.bc | 优化后的编译中间文件 |
.o | 链接前的目标文件 |
.out | 链接后的目标文件 |
.hgfb | PPU fatbin文件 |
.c | C源文件 |
.cc .cxx .cpp | C++源文件 |
.lib | 库文件 |
.so | shared object文件 |
2.4 支持阶段的输入输出
下表指定了支持的编译阶段,以及ppu-clang对该阶段的执行。它还列出了此阶段生成的输出文件的默认名称,当未使用选项明确指定输出文件名称时,该名称将生效。
编译阶段 | ppu-clang选项 | 默认输出文件名 | ||
Driver Tool | 选项 | |||
device端 | 预编译阶段 | clang | -E -triple alippu | .ppu.hggci |
生成IR阶段 | clang | -x hggc | .ppu.bc | |
优化阶段 | opt | .ppu.opt.bc | ||
编译阶段 | llc | .ppu.o | ||
链接阶段 | lld | .ppu.out | ||
fatbin生成阶段 | clang-offload-bundler | .hgfb | ||
host端 | 预编译阶段 | clang | -E -triple x86_64_unknown-linux-gnu | host.hggci |
生成IR阶段 | clang | -cc1 | host.bc | |
编译优化阶段 | clang | -cc1as | .s | |
链接阶段 | ld | .o, .obj |
3. ppu-clang编译流程
使用ppu-clang对CUDA源码编译的整体编译工作流程为: 输入程序经过预处理后,进行device代码编译,编译为二进制代码,它们被放置在一个二进制文件中。输入程序再次进行预处理以进行host代码编译, device端生成的二进制代码,按照CUDA 特定的 C++ 扩展转换为标准 C++ 结构,然后 C++ 宿主编译器将带有嵌入式device代码和host代码一起编译成可被host端调用的可执行文件。 对于ppu的编译来说,整体上分为device和host两段编译。所有的编译流程如图所示:

4. ppu-clang选项
4.1 选项类型和符号
ppu-clang识别三种类型的选项:布尔选项、单值选项和列表选项。 布尔选项没有参数;它们要么在命令行上指定,要么不在。单值选项最多只能指定一次,列表选项可以重复。单值选项和列表选项必须有参数,该参数必须跟在选项本身的名称后面,后面是多个空格或等号字符之一。 部分选项除了支持长名称以外,还支持短名称,功能可以等价替换,默认使用长名称来进行表示。
4.2 选项说明
本节介绍的ppu-clang选项,长选项命令在第一列说明,短选项命令在第二列。使用重复的列表选项将会被识别报错。
4.2.1 文件和路径规范的选项
-cxx-isystem
Add directory to the C++ SYSTEM include search path
-D =
Define to (or 1 if omitted)
-fbuiltin-module-map
Load the clang builtins module map file.
-o
Write output to .
-fdebug-macro
Emit macro debug information
-fdiagnostics-absolute-paths
Print absolute paths in diagnostics
-fexceptions
Enable support for exception handling
–fhggc-libdevice-path=
Path to libdevice library files
-fprebuilt-module-path=
Specify the prebuilt module path
-F
Add directory to framework include search path
--hggc-libdevice-path=
Specify the directory that contains the libdevice library files.
-I
Restrict all prior -I flags to double-quoted inclusion and remove current directory from include path
-ibuiltininc
Enable builtin #include directories even when -nostdinc is used before or after -ibuiltininc. Using -nobuiltininc after the option disables it
-idirafter
Add directory to AFTER include search path
-iframeworkwithsysroot
Add directory to SYSTEM framework search path, absolute paths are relative to -isysroot
-iframework
Add directory to SYSTEM framework search path
-include-pch
Include precompiled header file
-include
Include file before parsing
-isystem
Add directory to SYSTEM include search path
-isysroot
Set the system root directory (usually /)
-I
Add directory to the end of the list of include search paths
-l
add the library in linking stage.
-L
Add directory to library search path
-module-dependency-dir
Directory to dump module dependencies to
-MF
Write depfile output from -MMD, -MD, -MM, or -M to
-MP
Create phony target for each dependency (other than main file)
-nobuiltininc
Disable builtin #include directories
-nostdinc++
Disable standard #include directories for the C++ standard library
-output-path=
Path for output target in dependency file.
-print-resource-dir
Print the resource directory pathname
-print-runtime-dir
Print the directory pathname containing clangs runtime libraries
-print-search-dirs
Print the paths used for finding libraries and programs
-stdlib++-isystem
Use directory as the C++ standard library include path
-U
Undefine macro
-working-directory
Resolve file paths relative to the specified directory
-cxx-isystem
Add directory to the C++ SYSTEM include search path
–hgas-path=
Path to hgas (used for assemble hggc device code)
--archiver-binary=
Specify the archiver tool path
4.2.2 指定编译阶段的选项
-c
Only run preprocess, compile, and assemble steps
--hggc-device-only -E
only support device preprocess
-emit-ast
Emit Clang AST files for source inputs
-emit-interface-stubs
Generate Interface Stub Files.
-emit-llvm
Use the LLVM representation for assembler and object files
-emit-hggcbin
Emit hggc bin files
-emit-merged-ifs
Generate Interface Stub Files, emit merged text not binary.
-fas-with-hgas
Assemble hggc device code with hgas
-dD
Print macro definitions in -E mode in addition to normal output
-dI
Print include directives in -E mode in addition to normal output
-dM
Print macro definitions in -E mode instead of normal output
-D =
Define to (or 1 if omitted)
-fc+±abi=
C++ ABI to use. This will override the target C++ ABI.
-mstackrealign
Force realign the stack at entry to every function.
-mix-E
support device compile and host preprocess. The option is not supported with -fgpu-rdc.
-fchar8_t
Enable C++ builtin type char8_t
-fcolor-diagnostics
Enable colors in diagnostics
-fcxx-exceptions
Enable C++ exceptions
-fdebug-macro
Emit macro debug information
-fexceptions
Enable support for exception handling
-fgnu89-inline
Use the gnu89 inline semantics
-fkeep-static-consts
Keep static const variables if unused
-fno-c+±static-destructors
Disable C++ static destructor registration
-fno-char8_t
Disable C++ builtin type char8_t
-fno-color-diagnostics
Disable colors in diagnostics
-fno-debug-macro
Do not emit macro debug information
-M
Write a depfile containing user and system headers. Like -MD, but also implies -E and writes to stdout by default.
-MM
Like -MMD, but also implies -E and writes to stdout by default.
-MMD
Write a depfile containing user headers
-MD
Write a depfile containing user and system headers.
-pedantic
Warn on language extensions
4.2.3 指定编译器/链接器行为的选项
-x
Treat subsequent input files as having type
-std=
Determine the language standard. This option is currently only supported when compiling C or C++.The compiler can accept several base standards, such as c90 or c98, and GNU dialects of those standards, such as gnu90 or gnu98. When a base standard is specified, the compiler accepts all programs following that standard plus those using GNU extensions that do not contradict it. For example, -std=c90 turns off certain features of GCC that are incompatible with ISO C90, such as the “asm” and “typeof” keywords, but not other GNU extensions that do not have a meaning in ISO C90, such as omitting the middle term of a “?:” expression. On the other hand, when a GNU dialect of a standard is specified, all features supported by the compiler are enabled, even when those features change the meaning of the base standard. As a result, some strict-conforming programs may be rejected. The particular standard is used by -Wpedantic to identify which features are GNU extensions given that version of the standard. For example -std=gnu90 -Wpedantic warns about C++ style // comments, while -std=gnu99 -Wpedantic does not.option: {c11, c1x, gnu90, gnu89, gnu99, gnu9x, gnu11, gnu1x, c11, c0x, c14, c1y, c1z,c20}
c11 c1x
ISO C11, the 2011 revision of the ISO C standard. This standard is substantially completely supported, modulo bugs, floating-point issues (mainly but not entirely relating to optional C11 features from Annexes F and G) and the optional Annexes K (Bounds-checking interfaces) and L (Analyzability). The name c1x is deprecated.
c++11 c++0x
The 2011 ISO C++ standard plus amendments. The name c++0x is deprecated.
c++14 c++1y
The 2014 ISO C++ standard plus amendments. The name c++1y is deprecated.
c++1z
The next revision of the ISO C++ standard, tentatively planned for 2017. Support is highly experimental, and will almost certainly change in incompatible ways in future releases.
c++20
The next revision of the ISO C++ standard, tentatively planned for 2020. Support is highly experimental, and will almost certainly change in incompatible ways in future releases.
-Xarch_host -g
Generate source-level debug information for host code.
-Xarch_device -g
Generate source-level debug information for Device code.
-gline-tables-only
Emit debug line number tables only
-O
Specify optimization level for host code.
-ftemplate-backtrace-limit
Set the maximum number of entries to print in a template instantiation backtrace (0 = no limit).
-fno-exceptions
Disable exception handling
--no-host-device-initializer-list
Do not treat std::initializer_list membere functionn as host device.
--expt-relaxed-constexpr
Allow host code to invoke device constexpr functions, and device code to invoke host constexpr functions.
--expt-extended-lambda
Allow host, device annotations in lambda declarations.
-fdlto/-fno-dlto
Perform or Disable link-time optimization of device code. Link-time optimization must be specified at both compile and link time; at compile time it stores high-level intermediate code, then at link time it links together and optimizes the intermediate code.
--host-linker-script=<use-lcs / gen-lcs>
Generate a host linker script.
--augment-host-linker-script
Generate an host linker script that augments an existing host linker script, need to used in combination with --host-linker-script.
--host-relocatable-link
Generate an host linker script that can be used in host relocatable link, need to used in combination with --host-linker-script.
-ccc-gcc-name=
specify the name for native GCC compiler.
4.2.4 传给特定编译阶段的选项
-Xassembler
Pass to the assembler
-Xclang
Pass to the clang compiler
-Xhggclink
Pass to the hggclinker
-Xhggcllc
Pass to the llc
-Xlinker
Pass to the linker
-Xarchive
Pass to the archiver.
-z
Pass -z to the linker
**--options-file ,... **
Include command line options from specified file.
-optf ,...
Include command line options from specified file.
4.2.5 指导编译器驱动行为选项
–compatible-mode
Use sdk compatible mode for hggc. false by default.
–default-stream-per-thread
Normal HGGC stream per thread, does not implicitly synchronize with other streams
-fgpu-rdc
Generate relocatable device code, also known as separate compilation mode.
-fno-gpu-rdc
Disable relocatable device code compilation.
–hggc-compile-host-device
Compile HGGC code for both host and device (default). Has no effect on non-HGGC compilations.
–hggc-default-device
Make device default space specifier.
–hggc-device-only
Compile HGGC code for device only
–hggc-enable-host-splitter
Compile HGGC host code by using native host compiler. false by default.
–hggc-host-only
Compile HGGC code for host only. Has no effect on non-HGGC compilations.
–hggc-link
Link clang-offload-bundler bundles for HGGC
-Xarch_host
set option for host code only.
-Xarch_device
set option for device code only.
-###
Print (but do not run) the commands to run for this compilation
-v
Show commands to run and use verbose output
-MT
Specify name of main file output in depfile
-save-temps=
Save intermediate compilation results.
-save-temps
Save intermediate compilation results.
--save-temps-path=
Path to save intermediate compilation results.
--backend-option-file=
Path to option files
4.2.6 通用选项
--entries=
Specify the global entry functions for which code must be generated. May be specified more than once.
-ftime-report
Print the time taken by each Pass.
--help
Print help information.
-w
disable all warning messages.
-Wreorder
Warn when member initializers are reordered.
-Wdefault-stream-launch
Warn when kernel launch default stream argument is used.
-Wmissing-launch-bounds
Warn when missing launch bounds.
-Werror=
treat specified kinds warings as errors.
-Werror=all-warnings: treat all warnings as errors.
-Werror=cross-execution-space-call: error when cross execution space call.
-Werror=reorder: error when member initializers are reordered.
-Werror=missing-launch-bounds: error when missing launch bounds
-Werror=deprecated-declarations: error when use a deprecated entity.
-Werror=default-stream-launch: error when kernel launch default stream argument is used.
--expt-extended-lambda
Allow host, device annotations in lambda declarations.
4.2.7 数学库相关选项
–use_fast_math
Make use of fast math library.–use_fast_math implies --ftz=true --prec-div=false --prec-sqrt=false --fmad=true.
–ftz {true | false}
Control single-precision denormals support.–ftz=true flushes denormal values to zero and --ftz=false preserves denormal values. –use_fast_math implies --ftz=true.Allowed Values {true | false}DefaultThis option is set to false and ppu-clang preserves denormal values.
–prec-div {true | false}
This option controls single-precision floating-point division and reciprocals. –prec-div=true enables the IEEE round-to-nearest mode and --prec-div=false enables the fast approximation mode. –use_fast_math implies --prec-div=false.Allowed Values {true | false}DefaultThis option is set to true and ppu-clang enables the IEEE round-to-nearest mode.
–prec-sqrt {true | false}
This option controls single-precision floating-point square root. –prec-sqrt=true enables the IEEE round-to-nearest mode and --prec-sqrt=false enables the fast approximation mode. –use_fast_math implies --prec-sqrt=false.Allowed Values {true | false}DefaultThis option is set to true and ppu-clang enables the IEEE round-to-nearest mode.
–fmad {true | false}
This option enables (disables) the contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA). –use_fast_math implies --fmad=true.Allowed Values {true | false}DefaultThis option is set to true and ppu-clang enables the contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA).
4.2.8 编译后端的option
-ppu-max-vreg-count=
Set the maximum register number that device function can use.
-ppu-restrict
Assert all kernel ptr is restrict pointer.
--resource-usage
Print resource usage for kernel function.
-ppu-dlcm
Set the default load cache policy.
-ppu-dscm
Set the default store cache policy.
-ppu-flcm
Force set the load cache policy.
-ppu-fscm
Force set the store cache policy.
-ppu-warn-double-usage
Warning if double(s) are used in an instruction.
-ppu-warn-lmem-usage
Warning if local memory is used.
-ppu-warn-spills
Warning if registers are spilled to local memory.
-ppu-suppress-stack-size-warning
Suppress the warning that stack size cannot be determined.
-ppu-maxntid
Specify the maximum number of threads that a thread block can have.
-ppu-minblockscu
Specify the minimum number of CTAs to be mapped to an SM.
4.2.8 混合编译option
--ppu-arch==
ppu arch,可以通过传递多个
--ppu-archoption提供全系列支持的混合编译产物--ppu-arch==ppu001:只编译真武810E系列编译产物;
--ppu-archppu001 --ppu-archppu0015:编译全系列的混合编译产物。