编译器使用指南

更新时间:
复制为 MD 格式

1. 介绍

ppu-clang是基于llvm13.0.1进行开发的编译器,兼容CUDA编程模型及语言。输入的CUDA C/C++源码经过一系列转换,生成在PPU上运行的可执行程序。

1.1 CUDA Programing Model 介绍

ppu-clang工具包针对CUDA C/C++源码进行编译,将其转换为两部分程序进行运行,一部分是CPU上的进程运行主控程序,另一部分使用PPU作为协处理加速器,运行的device代码来处理单批次、多批次的并行作业,无需主机进程的干预,在并行执行中取得最佳性能。 CUDA源码的总体指导规范被称为CUDA Programming Model,有关 CUDA 编程模型的更多信息,请参阅 CUDA C++ Programming Guide。目前ppu-clang编译器支持的最高版本为CUDA-12.9。

1.2 CUDA C/C++

CUDA C/C是一种基于C语言的扩展,包括C++语言实现的函数集合,Host/Device代码区分的属性,以及不同类型的数据存储标识。这些扩展的函数有很多参数,调用方式与标准C程序非常相似,在执行上为了能更好的调用GPGPU上的并行线程进行了扩展。目前host端编译器支持情况为,gcc host compiler的版本支持范围在[5.5 - 14.2],clang host compiler的支持范围在[clang 9 - clang 18]。

1.3 ppu-clang编译器的作用

ppu-clang编译器的整个编译过程包含CUDA源码的拆分、预处理、编译和binary合并等多个步骤,ppu-clang负责驱动所有的子程序,向开发者隐藏CUDA编译的具体细节。它支持一系列常规编译器的选项,例如自定义宏、自定义路径、自定义库路径等,也支持一部分只有PPU上才能接受的编译选项。在接收编译选项后,将其分离传导给合适的Host/Device编译器工具,并非所有的编译选项都会传到Device编译器部分。

1.4 与NVCC的编译兼容

为了降低开发项目的维护/移植代价,SDK Package提供了CUDA兼容的能力。对于CUDA C/C++源码的编译,开发者既可以使用标准的ppu-clang工具链,也可以继续保留在NV CUDA SDK下的nvcc编译方式。

image.png

如上图所示,开发者原有的nvcc命令行在编译的第一步,会通过nvcc wrapper的转换工具,转换成ppu-clang支持的命令行。开发者可以直接参考nv官方的CUDA nvcc编译器使用手册。

目前编译器经过cuda wrapper工具封装后,绝大部分选项都支持, 其余 30 个未支持option与 dynamic parallelism/host splitter/仅限windows 系统使用/error number等相关 。

1.4.1 host/device分离部分

目前编译器host编译器仅支持gcc。不支持的host端的选项如下:

  • -nohdinitlist, –no-host-device-initializer-list

1.4.2 与NVCC平台相关特殊选项

本部分选项基本与NVCC工具链相关,目前编译器不支持。受影响的选项有:

  • -target-dir, –target-directory

  • -no-compress, –no-compress

  • –no-align-double

  • -nodlink, –no-device-link

  • -preserve-relocs, –preserve-relocs

  • -dump-callgraph, –dump-callgraph

1.4.3 宏兼容说明

NVCC中使用了许多预定义宏,nvcc wrapper中对大部分宏进行了封装支持,只有部分与机器平台强相关的宏暂不支持,下表列出了暂时不支持的宏:

  • CUDACC_EWP

  • NVCC_DIAG_PRAGMA_SUPPORT

1.4.4 混合编译说明

用户如使用NVCC命令行,默认情况下(-arch=sm_80)编译产物为全系列支持的混合fatbin产物; 用户可以通过特定扩展option来指定具体的编译产物:

  • -arch=sm_80a 将只编译真武810E系列fatbin。

  • 例外情况,-cubin不支持编译混合cubin产物,默认情况下(-arch=sm_80)将只编译真武810E系列产物。

1.5 CUDA Inline PTX 兼容

NV 除了提供了 CUDA C/C++的device 编程接口,开发者也可以通过inline ptx的方式,实现更加low level的能力。ppu-clang为了兼容目的,也提供了对大部分Inline PTX指令的支持,具体的使用方式开发者可以直接参考nv官方的ptx参考手册和Inline PTX的使用手册。

1.6 编程上的使用限制

CUDA C/C++和Inline PTX的使用上,目前整体支持到了ptx-8.8版本,但同NV CUDA-12.9依然存在一定的差异:

  • 部分支持 Texture 和 Surface 相关的 Cuda C++ 扩展 API、Inline PTX 指令的功能;

  • 不支持 Dynamic Parallelism 相关的 Cuda C++ 扩展 API、Inline PTX 指令的功能,运行时会报错;

  • 不支持 Inline PTX 中 ld/st 相关指令带有 {.level::eviction_priority} 和 {.level::prefetch_size} 的特性,但不影响编译和运行的过程(忽略行为);

  • 不支持 Inline PTX 中 cache eviction policy 相关的指令和操作数,编译会报错;

  • 针对ptx 7.6之上的新增指令,存在因硬件架构不支持的因素而运行时报错,但不影响编译的过程,运行时会报错;

  • Device 文件编译流程包括 Cuda Device C++ 文件 -> Llvm(Hgvm) IR -> Device Binary的过程, 但不包含输出 ptx 格式的文件过程; 针对其他平台的代码编译(或Codegen)环节,如带有 ptx 格式的编译环节,需要进行代码的调整;

  • 部分兼容CUDA mma ptx及相关数据搬运指令,范围包括特定数据类型(.u8/.s8/.tf32/.bf16/.f16)下的dense mma指令;

2. 编译阶段

PPU软件栈使用HGGC的标识来同CUDA作为区分,方便进行独立编译或兼容CUDA的生态。

2.1 ppu-clang编译器识别宏

ppu-clang的预定义了以下的宏:

HGGCCC

  • 在编译cu/hggc源文件时此宏处于被定义的状态

HGGC_ARCH

  • 在编译cu/hggc源文件中的device代码时,此宏处于被定义的状态

HGGCCC_VER_MAJOR

  • 定义为ppu编译器的major版本号

HGGCCC_VER_MINOR

  • 定义为ppu编译器的minor版本号

HGGCCC_VER_BUILD

  • 定义为ppu编译器的build版本号

2.2 ppu-clang的编译阶段

整个编译过程是一个转换过程,可以通过compiler driver进行驱动,每个编译阶段被分成更小的编译步骤执行。尽管提供了类似-###的选项来支持显示它的编译阶段,但是仅用于调试,版本变化都可能带来小的编译步骤的改变,脚本的构建不应该依赖小的编译步骤。 以下列出了可识别的文件名后缀和支持的编译阶段。

2.4 支持的输入文件后缀

下表定义了ppu-clang如何解释其输入文件:

输入文件后缀

描述

.cu, .hggc

CUDA/HGGC源文件,包含host代码和device代码

.hggci

预处理后的文件

.bc

编译的中间代码文件

opt.bc

优化后的编译中间文件

.o

链接前的目标文件

.out

链接后的目标文件

.hgfb

PPU fatbin文件

.c

C源文件

.cc .cxx .cpp

C++源文件

.lib

库文件

.so

shared object文件

2.4 支持阶段的输入输出

下表指定了支持的编译阶段,以及ppu-clang对该阶段的执行。它还列出了此阶段生成的输出文件的默认名称,当未使用选项明确指定输出文件名称时,该名称将生效。

编译阶段

ppu-clang选项

默认输出文件名

Driver Tool

选项

device

预编译阶段

clang

-E -triple alippu

.ppu.hggci

生成IR阶段

clang

-x hggc

.ppu.bc

优化阶段

opt

.ppu.opt.bc

编译阶段

llc

.ppu.o

链接阶段

lld

.ppu.out

fatbin生成阶段

clang-offload-bundler

.hgfb

host

预编译阶段

clang

-E -triple x86_64_unknown-linux-gnu

host.hggci

生成IR阶段

clang

-cc1

host.bc

编译优化阶段

clang

-cc1as

.s

链接阶段

ld

.o, .obj

3. ppu-clang编译流程

使用ppu-clangCUDA源码编译的整体编译工作流程为: 输入程序经过预处理后,进行device代码编译,编译为二进制代码,它们被放置在一个二进制文件中。输入程序再次进行预处理以进行host代码编译, device端生成的二进制代码,按照CUDA 特定的 C++ 扩展转换为标准 C++ 结构,然后 C++ 宿主编译器将带有嵌入式device代码和host代码一起编译成可被host端调用的可执行文件。 对于ppu的编译来说,整体上分为devicehost两段编译。所有的编译流程如图所示:

image.png

4. ppu-clang选项

4.1 选项类型和符号

ppu-clang识别三种类型的选项:布尔选项、单值选项和列表选项。 布尔选项没有参数;它们要么在命令行上指定,要么不在。单值选项最多只能指定一次,列表选项可以重复。单值选项和列表选项必须有参数,该参数必须跟在选项本身的名称后面,后面是多个空格或等号字符之一。 部分选项除了支持长名称以外,还支持短名称,功能可以等价替换,默认使用长名称来进行表示。

4.2 选项说明

本节介绍的ppu-clang选项,长选项命令在第一列说明,短选项命令在第二列。使用重复的列表选项将会被识别报错。

4.2.1 文件和路径规范的选项

-cxx-isystem

  • Add directory to the C++ SYSTEM include search path

-D =

  • Define to (or 1 if omitted)

-fbuiltin-module-map

  • Load the clang builtins module map file.

-o

  • Write output to .

-fdebug-macro

  • Emit macro debug information

-fdiagnostics-absolute-paths

  • Print absolute paths in diagnostics

-fexceptions

  • Enable support for exception handling

–fhggc-libdevice-path=

  • Path to libdevice library files

-fprebuilt-module-path=

  • Specify the prebuilt module path

-F

  • Add directory to framework include search path

--hggc-libdevice-path=

  • Specify the directory that contains the libdevice library files.

-I

  • Restrict all prior -I flags to double-quoted inclusion and remove current directory from include path

-ibuiltininc

  • Enable builtin #include directories even when -nostdinc is used before or after -ibuiltininc. Using -nobuiltininc after the option disables it

-idirafter

  • Add directory to AFTER include search path

-iframeworkwithsysroot

  • Add directory to SYSTEM framework search path, absolute paths are relative to -isysroot

-iframework

  • Add directory to SYSTEM framework search path

-include-pch

  • Include precompiled header file

-include

  • Include file before parsing

-isystem

  • Add directory to SYSTEM include search path

-isysroot

  • Set the system root directory (usually /)

-I

  • Add directory to the end of the list of include search paths

-l

  • add the library in linking stage.

-L

  • Add directory to library search path

-module-dependency-dir

  • Directory to dump module dependencies to

-MF

  • Write depfile output from -MMD, -MD, -MM, or -M to

-MP

  • Create phony target for each dependency (other than main file)

-nobuiltininc

  • Disable builtin #include directories

-nostdinc++

  • Disable standard #include directories for the C++ standard library

-output-path=

  • Path for output target in dependency file.

-print-resource-dir

  • Print the resource directory pathname

-print-runtime-dir

  • Print the directory pathname containing clangs runtime libraries

-print-search-dirs

  • Print the paths used for finding libraries and programs

-stdlib++-isystem

  • Use directory as the C++ standard library include path

-U

  • Undefine macro

-working-directory

  • Resolve file paths relative to the specified directory

-cxx-isystem

  • Add directory to the C++ SYSTEM include search path

–hgas-path=

  • Path to hgas (used for assemble hggc device code)

--archiver-binary=

  • Specify the archiver tool path

4.2.2 指定编译阶段的选项

-c

  • Only run preprocess, compile, and assemble steps

--hggc-device-only -E

  • only support device preprocess

-emit-ast

  • Emit Clang AST files for source inputs

-emit-interface-stubs

  • Generate Interface Stub Files.

-emit-llvm

  • Use the LLVM representation for assembler and object files

-emit-hggcbin

  • Emit hggc bin files

-emit-merged-ifs

  • Generate Interface Stub Files, emit merged text not binary.

-fas-with-hgas

  • Assemble hggc device code with hgas

-dD

  • Print macro definitions in -E mode in addition to normal output

-dI

  • Print include directives in -E mode in addition to normal output

-dM

  • Print macro definitions in -E mode instead of normal output

-D =

  • Define to (or 1 if omitted)

-fc+±abi=

  • C++ ABI to use. This will override the target C++ ABI.

-mstackrealign

  • Force realign the stack at entry to every function.

-mix-E

  • support device compile and host preprocess. The option is not supported with -fgpu-rdc.

-fchar8_t

  • Enable C++ builtin type char8_t

-fcolor-diagnostics

  • Enable colors in diagnostics

-fcxx-exceptions

  • Enable C++ exceptions

-fdebug-macro

  • Emit macro debug information

-fexceptions

  • Enable support for exception handling

-fgnu89-inline

  • Use the gnu89 inline semantics

-fkeep-static-consts

  • Keep static const variables if unused

-fno-c+±static-destructors

  • Disable C++ static destructor registration

-fno-char8_t

  • Disable C++ builtin type char8_t

-fno-color-diagnostics

  • Disable colors in diagnostics

-fno-debug-macro

  • Do not emit macro debug information

-M

  • Write a depfile containing user and system headers. Like -MD, but also implies -E and writes to stdout by default.

-MM

  • Like -MMD, but also implies -E and writes to stdout by default.

-MMD

  • Write a depfile containing user headers

-MD

  • Write a depfile containing user and system headers.

-pedantic

  • Warn on language extensions

4.2.3 指定编译器/链接器行为的选项

-x

  • Treat subsequent input files as having type

-std=

  • Determine the language standard. This option is currently only supported when compiling C or C++.The compiler can accept several base standards, such as c90 or c98, and GNU dialects of those standards, such as gnu90 or gnu98. When a base standard is specified, the compiler accepts all programs following that standard plus those using GNU extensions that do not contradict it. For example, -std=c90 turns off certain features of GCC that are incompatible with ISO C90, such as the “asm” and “typeof” keywords, but not other GNU extensions that do not have a meaning in ISO C90, such as omitting the middle term of a “?:” expression. On the other hand, when a GNU dialect of a standard is specified, all features supported by the compiler are enabled, even when those features change the meaning of the base standard. As a result, some strict-conforming programs may be rejected. The particular standard is used by -Wpedantic to identify which features are GNU extensions given that version of the standard. For example -std=gnu90 -Wpedantic warns about C++ style // comments, while -std=gnu99 -Wpedantic does not.option: {c11, c1x, gnu90, gnu89, gnu99, gnu9x, gnu11, gnu1x, c11, c0x, c14, c1y, c1z,c20}

  • c11 c1x

ISO C11, the 2011 revision of the ISO C standard. This standard is substantially completely supported, modulo bugs, floating-point issues (mainly but not entirely relating to optional C11 features from Annexes F and G) and the optional Annexes K (Bounds-checking interfaces) and L (Analyzability). The name c1x is deprecated.

  • c++11 c++0x

The 2011 ISO C++ standard plus amendments. The name c++0x is deprecated.

  • c++14 c++1y

The 2014 ISO C++ standard plus amendments. The name c++1y is deprecated.

  • c++1z

The next revision of the ISO C++ standard, tentatively planned for 2017. Support is highly experimental, and will almost certainly change in incompatible ways in future releases.

  • c++20

The next revision of the ISO C++ standard, tentatively planned for 2020. Support is highly experimental, and will almost certainly change in incompatible ways in future releases.

-Xarch_host -g

  • Generate source-level debug information for host code.

-Xarch_device -g

  • Generate source-level debug information for Device code.

-gline-tables-only

  • Emit debug line number tables only

-O

  • Specify optimization level for host code.

-ftemplate-backtrace-limit

  • Set the maximum number of entries to print in a template instantiation backtrace (0 = no limit).

-fno-exceptions

  • Disable exception handling

--no-host-device-initializer-list

  • Do not treat std::initializer_list membere functionn as host device.

--expt-relaxed-constexpr

  • Allow host code to invoke device constexpr functions, and device code to invoke host constexpr functions.

--expt-extended-lambda

  • Allow host, device annotations in lambda declarations.

-fdlto/-fno-dlto

  • Perform or Disable link-time optimization of device code. Link-time optimization must be specified at both compile and link time; at compile time it stores high-level intermediate code, then at link time it links together and optimizes the intermediate code.

--host-linker-script=<use-lcs / gen-lcs>

  • Generate a host linker script.

--augment-host-linker-script

  • Generate an host linker script that augments an existing host linker script, need to used in combination with --host-linker-script.

--host-relocatable-link

  • Generate an host linker script that can be used in host relocatable link, need to used in combination with --host-linker-script.

-ccc-gcc-name=

  • specify the name for native GCC compiler.

4.2.4 传给特定编译阶段的选项

-Xassembler

  • Pass to the assembler

-Xclang

  • Pass to the clang compiler

-Xhggclink

  • Pass to the hggclinker

-Xhggcllc

  • Pass to the llc

-Xlinker

  • Pass to the linker

-Xarchive

  • Pass to the archiver.

-z

  • Pass -z to the linker

**--options-file ,... **

  • Include command line options from specified file.

-optf ,...

  • Include command line options from specified file.

4.2.5 指导编译器驱动行为选项

–compatible-mode

  • Use sdk compatible mode for hggc. false by default.

–default-stream-per-thread

  • Normal HGGC stream per thread, does not implicitly synchronize with other streams

-fgpu-rdc

  • Generate relocatable device code, also known as separate compilation mode.

-fno-gpu-rdc

  • Disable relocatable device code compilation.

–hggc-compile-host-device

  • Compile HGGC code for both host and device (default). Has no effect on non-HGGC compilations.

–hggc-default-device

  • Make device default space specifier.

–hggc-device-only

  • Compile HGGC code for device only

–hggc-enable-host-splitter

  • Compile HGGC host code by using native host compiler. false by default.

–hggc-host-only

  • Compile HGGC code for host only. Has no effect on non-HGGC compilations.

–hggc-link

  • Link clang-offload-bundler bundles for HGGC

-Xarch_host

  • set option for host code only.

-Xarch_device

  • set option for device code only.

-###

  • Print (but do not run) the commands to run for this compilation

-v

  • Show commands to run and use verbose output

-MT

  • Specify name of main file output in depfile

-save-temps=

  • Save intermediate compilation results.

-save-temps

  • Save intermediate compilation results.

--save-temps-path=

  • Path to save intermediate compilation results.

--backend-option-file=

  • Path to option files

4.2.6 通用选项

--entries=

  • Specify the global entry functions for which code must be generated. May be specified more than once.

-ftime-report

  • Print the time taken by each Pass.

--help

  • Print help information.

-w

  • disable all warning messages.

-Wreorder

  • Warn when member initializers are reordered.

-Wdefault-stream-launch

  • Warn when kernel launch default stream argument is used.

-Wmissing-launch-bounds

  • Warn when missing launch bounds.

-Werror=

  • treat specified kinds warings as errors.

  • -Werror=all-warnings: treat all warnings as errors.

  • -Werror=cross-execution-space-call: error when cross execution space call.

  • -Werror=reorder: error when member initializers are reordered.

  • -Werror=missing-launch-bounds: error when missing launch bounds

  • -Werror=deprecated-declarations: error when use a deprecated entity.

  • -Werror=default-stream-launch: error when kernel launch default stream argument is used.

--expt-extended-lambda

  • Allow host, device annotations in lambda declarations.

4.2.7 数学库相关选项

–use_fast_math

  • Make use of fast math library.–use_fast_math implies --ftz=true --prec-div=false --prec-sqrt=false --fmad=true.

–ftz {true | false}

  • Control single-precision denormals support.–ftz=true flushes denormal values to zero and --ftz=false preserves denormal values. –use_fast_math implies --ftz=true.Allowed Values {true | false}DefaultThis option is set to false and ppu-clang preserves denormal values.

–prec-div {true | false}

  • This option controls single-precision floating-point division and reciprocals. –prec-div=true enables the IEEE round-to-nearest mode and --prec-div=false enables the fast approximation mode. –use_fast_math implies --prec-div=false.Allowed Values {true | false}DefaultThis option is set to true and ppu-clang enables the IEEE round-to-nearest mode.

–prec-sqrt {true | false}

  • This option controls single-precision floating-point square root. –prec-sqrt=true enables the IEEE round-to-nearest mode and --prec-sqrt=false enables the fast approximation mode. –use_fast_math implies --prec-sqrt=false.Allowed Values {true | false}DefaultThis option is set to true and ppu-clang enables the IEEE round-to-nearest mode.

–fmad {true | false}

  • This option enables (disables) the contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA). –use_fast_math implies --fmad=true.Allowed Values {true | false}DefaultThis option is set to true and ppu-clang enables the contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA).

4.2.8 编译后端的option

-ppu-max-vreg-count=

  • Set the maximum register number that device function can use.

-ppu-restrict

  • Assert all kernel ptr is restrict pointer.

--resource-usage

  • Print resource usage for kernel function.

-ppu-dlcm

  • Set the default load cache policy.

-ppu-dscm

  • Set the default store cache policy.

-ppu-flcm

  • Force set the load cache policy.

-ppu-fscm

  • Force set the store cache policy.

-ppu-warn-double-usage

  • Warning if double(s) are used in an instruction.

-ppu-warn-lmem-usage

  • Warning if local memory is used.

-ppu-warn-spills

  • Warning if registers are spilled to local memory.

-ppu-suppress-stack-size-warning

  • Suppress the warning that stack size cannot be determined.

-ppu-maxntid

  • Specify the maximum number of threads that a thread block can have.

-ppu-minblockscu

  • Specify the minimum number of CTAs to be mapped to an SM.

4.2.8 混合编译option

--ppu-arch==

  • ppu arch,可以通过传递多个--ppu-arch option提供全系列支持的混合编译产物

  • --ppu-arch==ppu001:只编译真武810E系列编译产物;

  • --ppu-archppu001 --ppu-archppu0015:编译全系列的混合编译产物。