CUDA兼容性 (v1.4.1)

更新时间:
复制为 MD 格式

本文为您介绍PPUCUDA的兼容情况。

PPU CUDA兼容方案、原理及优势

CUDA(Compute Unified Device Architecture)是由NVIDIA推出的一种并行计算架构,允许开发者利用NVIDIA显卡的强大计算能力来加速一些计算密集型任务。它提供了一种编程模型和API,使得程序员可以使用C、C++和Fortran等语言编写与GPU并行计算相关的代码。通过CUDA,开发者能够在显卡上执行大量并行计算,从而提升程序的性能,广泛应用于科学计算、图像处理、机器学习等领域。PPUCUDA进行了兼容,具体方案原理和优势,请参见CUDA兼容讨论.pptx

CUDA Sample兼容性

CUDA Sample兼容性列表

CUDA Sample

Status

Comments

simpleVoteIntrinsics

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

vectorAdd_nvrtc

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

deviceQuery

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

reduction

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

tf32TensorCoreGemm

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

shfl_scan

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

warpAggregatedAtomicsCG

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

concurrentKernels

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

bf16TensorCoreGemm

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

bandwidthTest

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

UnifiedMemoryPerf

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

binaryPartitionCG

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

conjugateGradientMultiBlockCG

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

cudaCompressibleMemory

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

cudaTensorCoreGemm

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

globalToShmemAsyncCopy

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

matrixMul

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

matrixMulDrv

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

nvJPEG

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

nvJPEG_encoder

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

p2pBandwidthLatencyTest

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleAWBarrier

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleCudaGraphs

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleZeroCopy

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleDrvRuntime

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

vectorAddMMAP

❌ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

11.1sample在编译流程中存在对ptx的依赖,PPU不支持ptx。

simpleIPC

✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

streamOrderedAllocation

✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

streamOrderedAllocationIPC

✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simplePrintf

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleTemplates

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleOccupancy

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

topologyQuery

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

clock

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

cppIntegration

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

dwtHaar1D

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

vectorAdd

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

vectorAddDrv

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

scalarProd

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleVoteIntrinsics_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

SobolQRNG

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleCooperativeGroups

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleAtomicIntrinsics

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

cudaOpenMP

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

fp16ScalarProduct

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

inlinePTX

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleMPI

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

template

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleHyperQ

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

reductionMultiBlockCG

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

threadFenceReduction

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

mergeSort

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

convolutionSeparable

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

FDTD3d

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

matrixMulCUBLAS

❌ 11.5 ❌ 11.6 ❌ 11.7 ❌ 11.8 ❌ 12.0 ❌ 12.1 ❌ 12.2 ❌ 12.3 ❌ 12.4 ❌ 12.5 ❌ 12.6

计算结果跟nv存在精度上的差异,原因在于matrixMul计算方法差异导致,PPU选择了性能更好的实现。

sortingNetworks

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

fastWalshTransform

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

alignedTypes

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

deviceQueryDrv

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

scan

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

BlackScholes

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

transpose

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

histogram

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

MC_SingleAsianOptionP

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

MC_EstimatePiInlineP

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

quasirandomGenerator

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

binomialOptions

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

MonteCarloMultiGPU

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

UnifiedMemoryStreams

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

asyncAPI

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

c++11_cuda

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

cppOverload

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

cuHook

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

eigenvalues

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

interval

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

newdelete

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

radixSortThrust

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

segmentationTreeThrust

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleAssert

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleAttributes

❌ 11.5 ❌ 11.6 ❌ 11.7 ❌ 11.8 ❌ 12.0 ❌ 12.1 ❌ 12.2 ❌ 12.3 ❌ 12.4 ❌ 12.5 ❌ 12.6

pass,但执行时间会超长,原因在于

"Maximum y- or z-dimension of a grid of thread blocks" NV上是65535, 而PPU上是2^31-1,sample代码中依赖到这个值,导致运行时间超级长。

simpleMultiCopy

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleMultiGPU

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleP2P

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleSeparateCompilation

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleStreams

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

threadMigration

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

vectorAddDrv

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

binomialOptions_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

clock_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

inlinePTX_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

matrixMul_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

quasirandomGenerator_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleAssert_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleAtomicIntrinsics_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleTemplates_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

simpleVoteIntrinsics_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

BlackScholes_nvrtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

libNVVM

❌ 12.3 ❌ 12.4 ❌ 12.5 ❌ 12.6

目前PPU兼容llvm ir for nvgpu的定义,但不完全兼容nv官方的nvvm ir定义。

StreamPriorities

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

MC_EstimatePiInlineQ

curand相关api 暂未完整支持。

MC_EstimatePiP

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

MC_EstimatePiQ

curand相关api (curandCreateGenerator) 暂未完整支持。

MersenneTwisterGP11213

curand相关api 暂未完整支持。

batchCUBLAS

curand相关api (cublasSetMatrix) 暂未完整支持。

batchedLabelMarkersAndLabelCompressionNPP

nppiLabelMarkersUFGetBufferSize_32u_C1R

nppiCompressMarkerLabelsGetBufferSize_32u_C1R

nppiLabelMarkersUFBatch_8u32u_C1R_Advanced_Ctx

nppiCompressMarkerLabelsUFBatch_32u_C1IR_Advanced_Ctx

以上api 暂未支持

boxFilterNPP

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

conjugateGradientCudaGraphs

cusparse 库暂未完整支持

conjugateGradient

conjugateGradientMultiDeviceCG

conjugateGradientPrecond

conjugateGradientUM

cuSolverDn_LinearSolver

cusparse库(cusolverSpCreate/cusolverSpCreate) 暂未完整支持

cuSolverRf

cuSolverSp_LinearSolver

cuSolverSp_LowlevelCholesky

cuSolverSp_LowlevelQR

graphMemoryFootprint

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

graphMemoryNodes

关于mem allocatemem freeapis,目前头文件未到该版本故还没定义

immaTensorCoreGemm

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

jacobiCudaGraphs

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

matrixMulDynlinkJIT

DynlinkJIT暂时不支持

memMapIPCDrv

cuMemGetAllocationGranularity 获取到的 granularity2MB,ppu设计的是 8MB,代码中检查部分需要修改。

nbody

Graphics, openGL 相关api 不支持

Mandelbrot

particles

oceanFFT

simpleCUDA2GL

simpleGL

recursiveGaussian

ptxjit

PPU不支持ptx

randomFog

缺少grahical display的能力。

simpleCUBLAS

CUBLAS API实现不全

simpleCUBLASXT

simpleCUBLAS_LU

simpleCUFFT

cuFFT库未支持

simpleCUFFT_2d_MGPU

simpleCUFFT_MGPU

simpleCUFFT_callback

systemWideAtomics

FilterBorderControlNPP

NPP相关API暂时未支持。

watershedSegmentationNPP

streamOrderedAllocationP2P

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

EGLStream_CUDA_CrossGPU

EGL 相关API 不支持

EGLStream_CUDA_Interop

EGLStreams_CUDA_Interop

EGLSync_CUDAEvent_Interop

GLES 不支持

cuDLALayerwiseStatsStandalone

cuDLALayerwiseStatsHybrid

simpleGLES_EGLOutput

fluidsGLES

nbody_opengles

simpleGLES

simpleGLES_screen

nbody_screen

cuDLAHybridMode

cuDLAStandaloneMode

cuDLAErrorReporting

cudaNvSciNvMedia

cdpAdvancedQuicksort

CUDA CDP特性不支持

cdpBezierTessellation

cdpQuadtree

cdpSimplePrint

cdpSimpleQuicksort

cudaNvSci

libnvscibuf.so not found

nvsci 暂不支持

dmmaTensorCoreGemm

PPU tensor core不支持Double MMA指令。

dxtc

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

freeImageInteropNPP

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

histEqualizationNPP

✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

cannyEdgeDetectorNPP

✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6

convolutionTexture

PPU不支持texture以及GL的功能

bindlessTexture

bicubicTexture

HSOpticalFlow

simpleLayeredTexture

simplePitchLinearTexture

simpleSurfaceWrite

simpleTexture

simpleTexture3D

simpleTextureDrv

simpleCubemapTexture

volumeFiltering

volumeRender

vulkanImageCUDA

stereoDisparity

boxFilter

bilateralFilter

postProcessGL

imageDenoising

fluidsGL

smokeParticles

lineOfSight

marchingCubes

convolutionFFT2D

dct8x8

NV12toBGRandResize

SobelFilter

FunctionPointers

simpleVulkan

Vulkan-Cuda 相关feature 暂不支持

simpleVulkanMMAP

simpleD3D10

PPU不支持D3D graphics相关API

fluidsD3D9

simpleD3D11

simpleD3D11Texture

simpleD3D10RenderTarget

simpleD3D10Texture

SLID3D10Texture

VFlockingD3D10

simpleD3D12

simpleD3D9Texture

simpleD3D9

cudaGraphsPerfScaling

❌ 12.5 ❌ 12.6

cudaGraphUpload API暂不支持

CUDA Runtime APIs不支持列表

对齐到CUDA Runtime 12.3(对于CUDA Runtime 12.3 Spec中标注为[DEPRECATED]模块整体不加以支持,此处不列出),除以下API不支持外:

不支持的API列表

功能模块

CUDA Runtime API 名称

备注

Device Management

cudaDeviceFlushGPUDirectRDMAWrites

RDMA Flush,不影响RDMA使用

cudaDeviceGetNvSciSyncAttributes

NvSci Lib,目前无实际需求

cudaDeviceGetTexture1DLinearMaxWidth

Graphics相关,AI场景无需求

External Resource Interoperability

cudaDestroyExternalMemory

Graphics API 交互相关,目前只支持CUDA API

cudaExternalMemoryGetMappedBuffer

cudaExternalMemoryGetMappedMipmappedArray

cudaImportExternalMemory

cudaImportExternalSemaphore

cudaSignalExternalSemaphoresAsync

cudaWaitExternalSemaphoresAsync

Execution

Control

cudaSetDoubleForDevice

deprecated as of CUDA 7.5

永不支持

cudaSetDoubleForHost

deprecated as of CUDA 7.5

永不支持

Occupancy

cudaOccupancyMaxActiveClusters

cluster相关

cudaOccupancyMaxPotentialClusterSize

Memory Management

cudaArrayGetInfo

cudaMemAdvise_v2外,都属于Array相关,AI场景无需求

cudaArrayGetMemoryRequirements

cudaArrayGetPlane

cudaArrayGetSparseProperties

cudaFreeArray

cudaFreeMipmappedArray

cudaGetMipmappedArrayLevel

cudaMalloc3DArray

cudaMallocArray

cudaMallocMipmappedArray

cudaMemAdvise_v2

cudaMemcpy2DArrayToArray

cudaMemcpy2DFromArray

cudaMemcpy2DFromArrayAsync

cudaMemcpy2DToArray

cudaMemcpy2DToArrayAsync

cudaMipmappedArrayGetMemoryRequirements

cudaMipmappedArrayGetSparseProperties

OpenGL Interoperability

cudaGLGetDevices

Graphics相关,AI场景无需求

cudaGraphicsGLRegisterBuffer

cudaGraphicsGLRegisterImage

cudaWGLGetDevice

Direct3D 9 Interoperability

cudaD3D9GetDevice

cudaD3D9GetDevices

cudaD3D9GetDirect3DDevice

cudaD3D9SetDirect3DDevice

cudaGraphicsD3D9RegisterResource

Direct3D 10 Interoperability

cudaD3D10GetDevice

cudaD3D10GetDevices

cudaGraphicsD3D10RegisterResource

Direct3D 11 Interoperability

cudaD3D11GetDevice

cudaD3D11GetDevices

cudaGraphicsD3D11RegisterResource

VDPAU Interoperability

cudaGraphicsVDPAURegisterOutputSurface

cudaGraphicsVDPAURegisterVideoSurface

cudaVDPAUGetDevice

cudaVDPAUSetVDPAUDevice

EGL Interoperability

cudaEGLStreamConsumerAcquireFrame

cudaEGLStreamConsumerConnect

cudaEGLStreamConsumerConnectWithFlags

cudaEGLStreamConsumerDisconnect

cudaEGLStreamConsumerReleaseFrame

cudaEGLStreamProducerConnect

cudaEGLStreamProducerDisconnect

cudaEGLStreamProducerPresentFrame

cudaEGLStreamProducerReturnFrame

cudaEventCreateFromEGLSync

cudaGraphicsEGLRegisterImage

cudaGraphicsResourceGetMappedEglFrame

Graphics Interoperability

cudaGraphicsMapResources

cudaGraphicsResourceGetMappedMipmappedArray

cudaGraphicsResourceGetMappedPointer

cudaGraphicsResourceSetMapFlags

cudaGraphicsSubResourceGetMappedArray

cudaGraphicsUnmapResources

cudaGraphicsUnregisterResource

Surface Object Management

cudaCreateSurfaceObject

cudaDestroySurfaceObject

cudaGetSurfaceObjectResourceDesc

Graph Management

cudaGetCurrentGraphExec

device funciton

需要CDP支持

cudaGraphAddExternalSemaphoresSignalNode

主要为Graphics相关,AI场景无需求

cudaGraphAddExternalSemaphoresWaitNode

cudaGraphExecExternalSemaphoresSignalNodeSetParams

cudaGraphExecExternalSemaphoresWaitNodeSetParams

cudaGraphExecGetFlags

cudaGraphExternalSemaphoresSignalNodeGetParams

cudaGraphExternalSemaphoresSignalNodeSetParams

cudaGraphExternalSemaphoresWaitNodeGetParams

cudaGraphExternalSemaphoresWaitNodeSetParams

cudaGraphInstantiateWithParams

cudaGraphNodeGetEnabled

cudaGraphNodeSetEnabled

cudaGraphUpload

cudaGraphConditionalHandleCreate

cudaGraphSetConditional

此外还存在少量的API有部分枚举不支持的情况,未列出,以实际运行返回cudaErrorNotSupported或者cudaErrorInvalidValue结果为准。

CUDA cuDNN支持情况

对比cudnn 8.5.0,cudnn APIs支持状况如下表:

  • 目前常用的NN场景使用的 APIs 大部分均已支持并调优。

  • 对标Ampere所有API都是软件可支持的,目前无PPU硬件限制的因素;后续软件版本将根据优先级逐步完善。

  • 目前API支持率为: 196/263 = 74.5%。

  • 去除26deprecated APIs,API支持率为: 196/237 = 82.7%。

cuDNN APIs支持情况

api

cudnn 8.5.0

ppu 1.4

状态

功能说明

cudnnCreateRNNDescriptor

Yes

Yes

cudnnDestroyRNNDescriptor

Yes

Yes

cudnnSetRNNDescriptor_v8

Yes

Yes

cudnnGetRNNDescriptor_v8

Yes

Yes

cudnnSetRNNDescriptor_v6

Yes

Yes

cudnnGetRNNDescriptor_v6

Yes

Yes

cudnnSetRNNMatrixMathType

Yes

Yes

cudnnGetRNNMatrixMathType

Yes

Yes

cudnnSetRNNBiasMode

Yes

Yes

cudnnGetRNNBiasMode

Yes

Yes

cudnnRNNSetClip_v8

Yes

Yes

cudnnRNNGetClip_v8

Yes

Yes

cudnnRNNSetClip

Yes

Yes

cudnnRNNGetClip

Yes

Yes

cudnnSetRNNProjectionLayers

Yes

Yes

cudnnGetRNNProjectionLayers

Yes

Yes

cudnnGetRNNWorkspaceSize

Yes

Yes

cudnnGetRNNTrainingReserveSize

Yes

Yes

cudnnGetRNNTempSpaceSizes

Yes

Yes

cudnnGetRNNParamsSize

Yes

Yes

cudnnGetRNNWeightSpaceSize

Yes

Yes

cudnnGetRNNLinLayerMatrixParams

Yes

Yes

cudnnGetRNNLinLayerBiasParams

Yes

Yes

cudnnGetRNNWeightParams

Yes

Yes

cudnnRNNForwardInference

Yes

Yes

cudnnCreateRNNDataDescriptor

Yes

Yes

cudnnDestroyRNNDataDescriptor

Yes

Yes

cudnnSetRNNDataDescriptor

Yes

Yes

cudnnGetRNNDataDescriptor

Yes

Yes

cudnnRNNForward

Yes

Yes

cudnnCreateSeqDataDescriptor

Yes

Yes

cudnnDestroySeqDataDescriptor

Yes

Yes

cudnnSetSeqDataDescriptor

Yes

Yes

cudnnGetSeqDataDescriptor

Yes

Yes

cudnnCreateAttnDescriptor

Yes

Yes

cudnnDestroyAttnDescriptor

Yes

Yes

cudnnSetAttnDescriptor

Yes

Yes

cudnnGetAttnDescriptor

Yes

Yes

cudnnGetMultiHeadAttnBuffers

Yes

Yes

cudnnGetMultiHeadAttnWeights

Yes

Yes

cudnnMultiHeadAttnForward

Yes

Yes

cudnnAdvInferVersionCheck

Yes

Yes

cudnnRNNForwardTraining

Yes

Yes

cudnnRNNBackwardData

Yes

Yes

cudnnRNNBackwardData_v8

Yes

Yes

cudnnRNNBackwardWeights

Yes

Yes

cudnnRNNBackwardWeights_v8

Yes

Yes

cudnnDestroyCTCLossDescriptor

Yes

Yes

cudnnCreateCTCLossDescriptor

Yes

Yes

cudnnSetCTCLossDescriptor

Yes

Yes

cudnnGetCTCLossDescriptor

Yes

Yes

cudnnSetCTCLossDescriptorEx

Yes

Yes

cudnnGetCTCLossDescriptorEx

Yes

Yes

cudnnSetCTCLossDescriptor_v8

Yes

Yes

cudnnGetCTCLossDescriptor_v8

Yes

Yes

cudnnGetCTCLossWorkspaceSize_v8

Yes

Yes

cudnnCTCLoss

Yes

Yes

cudnnGetCTCLossWorkspaceSize

Yes

Yes

cudnnCTCLoss_v8

Yes

Yes

cudnnAdvTrainVersionCheck

Yes

Yes

cudnnBackendCreateDescriptor

Yes

Yes

cudnnBackendDestroyDescriptor

Yes

Yes

cudnnBackendInitialize

Yes

Yes

cudnnBackendFinalize

Yes

Yes

cudnnBackendSetAttribute

Yes

Yes

cudnnBackendGetAttribute

Yes

Yes

cudnnBackendExecute

Yes

Yes

cudnnCreateConvolutionDescriptor

Yes

Yes

cudnnDestroyConvolutionDescriptor

Yes

Yes

cudnnSetConvolution2dDescriptor

Yes

Yes

cudnnGetConvolution2dDescriptor

Yes

Yes

cudnnSetConvolutionNdDescriptor

Yes

Yes

cudnnGetConvolutionNdDescriptor

Yes

Yes

cudnnSetConvolutionMathType

Yes

Yes

cudnnGetConvolutionMathType

Yes

Yes

cudnnSetConvolutionGroupCount

Yes

Yes

cudnnGetConvolutionGroupCount

Yes

Yes

cudnnGetConvolution2dForwardOutputDim

Yes

Yes

cudnnGetConvolutionNdForwardOutputDim

Yes

Yes

cudnnGetConvolutionForwardAlgorithmMaxCount

Yes

Yes

cudnnGetConvolutionBackwardDataAlgorithmMaxCount

Yes

Yes

cudnnGetConvolutionForwardWorkspaceSize

Yes

Yes

cudnnGetConvolutionForwardAlgorithm_v7

Yes

Yes

cudnnFindConvolutionForwardAlgorithm

Yes

Yes

cudnnFindConvolutionForwardAlgorithmEx

Yes

Yes

cudnnConvolutionForward

Yes

Yes

cudnnConvolutionBiasActivationForward

Yes

Yes

cudnnGetConvolutionBackwardDataWorkspaceSize

Yes

Yes

cudnnFindConvolutionBackwardDataAlgorithm

Yes

Yes

cudnnFindConvolutionBackwardDataAlgorithmEx

Yes

Yes

cudnnGetConvolutionBackwardDataAlgorithm_v7

Yes

Yes

cudnnConvolutionBackwardData

Yes

Yes

cudnnGetFoldedConvBackwardDataDescriptors

Yes

Yes

cudnnCnnInferVersionCheck

Yes

Yes

cudnnGetConvolutionBackwardFilterWorkspaceSize

Yes

Yes

cudnnGetConvolutionBackwardFilterAlgorithmMaxCount

Yes

Yes

cudnnFindConvolutionBackwardFilterAlgorithm

Yes

Yes

cudnnFindConvolutionBackwardFilterAlgorithmEx

Yes

Yes

cudnnGetConvolutionBackwardFilterAlgorithm_v7

Yes

Yes

cudnnConvolutionBackwardFilter

Yes

Yes

cudnnCnnTrainVersionCheck

Yes

Yes

cudnnGetVersion

Yes

Yes

cudnnGetProperty

Yes

Yes

cudnnGetErrorString

Yes

Yes

cudnnCreate

Yes

Yes

cudnnDestroy

Yes

Yes

cudnnSetStream

Yes

Yes

cudnnGetStream

Yes

Yes

cudnnGetCudartVersion

Yes

Yes

cudnnCreateTensorDescriptor

Yes

Yes

cudnnDestroyTensorDescriptor

Yes

Yes

cudnnSetTensor4dDescriptor

Yes

Yes

cudnnSetTensor4dDescriptorEx

Yes

Yes

cudnnGetTensor4dDescriptor

Yes

Yes

cudnnSetTensorNdDescriptor

Yes

Yes

cudnnSetTensorNdDescriptorEx

Yes

Yes

cudnnGetTensorNdDescriptor

Yes

Yes

cudnnGetTensorSizeInBytes

Yes

Yes

cudnnCreateFilterDescriptor

Yes

Yes

cudnnDestroyFilterDescriptor

Yes

Yes

cudnnSetFilter4dDescriptor

Yes

Yes

cudnnGetFilter4dDescriptor

Yes

Yes

cudnnSetFilterNdDescriptor

Yes

Yes

cudnnGetFilterNdDescriptor

Yes

Yes

cudnnGetFilterSizeInBytes

Yes

Yes

cudnnDeriveBNTensorDescriptor

Yes

Yes

cudnnBatchNormalizationForwardInference

Yes

Yes

cudnnCreateOpTensorDescriptor

Yes

Yes

cudnnDestroyOpTensorDescriptor

Yes

Yes

cudnnSetOpTensorDescriptor

Yes

Yes

cudnnGetOpTensorDescriptor

Yes

Yes

cudnnCreatePoolingDescriptor

Yes

Yes

cudnnSetPooling2dDescriptor

Yes

Yes

cudnnSetPoolingNdDescriptor

Yes

Yes

cudnnGetPoolingNdForwardOutputDim

Yes

Yes

cudnnGetPooling2dForwardOutputDim

Yes

Yes

cudnnDestroyPoolingDescriptor

Yes

Yes

cudnnPoolingForward

Yes

Yes

cudnnCreateActivationDescriptor

Yes

Yes

cudnnSetActivationDescriptor

Yes

Yes

cudnnGetActivationDescriptor

Yes

Yes

cudnnDestroyActivationDescriptor

Yes

Yes

cudnnActivationForward

Yes

Yes

cudnnCreateDropoutDescriptor

Yes

Yes

cudnnDestroyDropoutDescriptor

Yes

Yes

cudnnDropoutGetStatesSize

Yes

Yes

cudnnDropoutGetReserveSpaceSize

Yes

Yes

cudnnSetDropoutDescriptor

Yes

Yes

cudnnRestoreDropoutDescriptor

Yes

Yes

cudnnGetDropoutDescriptor

Yes

Yes

cudnnDropoutForward

Yes

Yes

cudnnSoftmaxForward

Yes

Yes

cudnnAddTensor

Yes

Yes

cudnnScaleTensor

Yes

Yes

cudnnOpTensor

Yes

Yes

cudnnTransformTensor

Yes

Yes

cudnnCreateTensorTransformDescriptor

Yes

Yes

cudnnDestroyTensorTransformDescriptor

Yes

Yes

cudnnSetTensorTransformDescriptor

Yes

Yes

cudnnGetTensorTransformDescriptor

Yes

Yes

cudnnTransformTensorEx

Yes

Yes

cudnnInitTransformDest

Yes

Yes

cudnnTransformFilter

Yes

Yes

cudnnCreateReduceTensorDescriptor

Yes

Yes

cudnnDestroyReduceTensorDescriptor

Yes

Yes

cudnnSetReduceTensorDescriptor

Yes

Yes

cudnnGetReduceTensorDescriptor

Yes

Yes

cudnnReduceTensor

Yes

Yes

cudnnGetReductionWorkspaceSize

Yes

Yes

cudnnGetReductionIndicesSize

Yes

Yes

cudnnCreateLRNDescriptor

Yes

Yes

cudnnDestroyLRNDescriptor

Yes

Yes

cudnnGetLRNDescriptor

Yes

Yes

cudnnSetLRNDescriptor

Yes

Yes

cudnnLRNCrossChannelForward

Yes

Yes

cudnnCreateSpatialTransformerDescriptor

Yes

Yes

cudnnDestroySpatialTransformerDescriptor

Yes

Yes

cudnnSetSpatialTransformerNdDescriptor

Yes

Yes

cudnnSpatialTfGridGeneratorForward

Yes

Yes

cudnnSpatialTfSamplerForward

Yes

Yes

cudnnOpsInferVersionCheck

Yes

Yes

cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize

Yes

Yes

cudnnGetBatchNormalizationBackwardExWorkspaceSize

Yes

Yes

cudnnGetBatchNormalizationTrainingExReserveSpaceSize

Yes

Yes

cudnnBatchNormalizationForwardTraining

Yes

Yes

cudnnBatchNormalizationForwardTrainingEx

Yes

Yes

cudnnBatchNormalizationBackward

Yes

Yes

cudnnBatchNormalizationBackwardEx

Yes

Yes

cudnnPoolingBackward

Yes

Yes

cudnnActivationBackward

Yes

Yes

cudnnDropoutBackward

Yes

Yes

cudnnSoftmaxBackward

Yes

Yes

cudnnLRNCrossChannelBackward

Yes

Yes

cudnnSpatialTfSamplerBackward

Yes

Yes

cudnnSpatialTfGridGeneratorBackward

Yes

Yes

cudnnOpsTrainVersionCheck

Yes

Yes

cudnnQueryRuntimeError

Yes

No

辅助函数,用于检查 BN是否有numerical overflows

cudnnSetTensor

Yes

No

辅助函数,设置Tensor 为定值

cudnnGetPooling2dDescriptor

Yes

No

辅助函数,get pooling Nd Descriptor

cudnnGetPoolingNdDescriptor

Yes

No

cudnnSetActivationDescriptorSwishBeta

Yes

No

activation函数中swish_betaset/get

cudnnGetActivationDescriptorSwishBeta

Yes

No

cudnnCreateAlgorithmDescriptor

Yes

No

deprecated in cuDNN 8.0

AlgorithmDescriptor 参数相关操作

cudnnSetAlgorithmDescriptor

Yes

No

cudnnGetAlgorithmDescriptor

Yes

No

cudnnCopyAlgorithmDescriptor

Yes

No

cudnnDestroyAlgorithmDescriptor

Yes

No

cudnnCreateAlgorithmPerformance

Yes

No

AlgorithmPerformance 参数相关操作

cudnnSetAlgorithmPerformance

Yes

No

cudnnGetAlgorithmPerformance

Yes

No

cudnnDestroyAlgorithmPerformance

Yes

No

cudnnGetAlgorithmSpaceSize

Yes

No

deprecated in cuDNN 8.0

Algo 的 metadata的存储

cudnnSaveAlgorithm

Yes

No

cudnnRestoreAlgorithm

Yes

No

cudnnSetCallback

Yes

No

回调函数相关

cudnnGetCallback

Yes

No

cudnnSetConvolutionReorderType

Yes

No

conv ReorderTypeset/get

cudnnGetConvolutionReorderType

Yes

No

cudnnIm2Col

Yes

No

im2Col, 构造前向相关矩阵

cudnnReorderFilterAndBias

Yes

No

reorder the filter and bias.

cudnnConvolutionBackwardBias

Yes

No

compute conv grad with bias.

cudnnCreateFusedOpsConstParamPack

Yes

No

cudnnFusedOps 相关计算,可用backend API替换

cudnnDestroyFusedOpsConstParamPack

Yes

No

cudnnSetFusedOpsConstParamPackAttribute

Yes

No

cudnnGetFusedOpsConstParamPackAttribute

Yes

No

cudnnCreateFusedOpsVariantParamPack

Yes

No

cudnnDestroyFusedOpsVariantParamPack

Yes

No

cudnnSetFusedOpsVariantParamPackAttribute

Yes

No

cudnnGetFusedOpsVariantParamPackAttribute

Yes

No

cudnnCreateFusedOpsPlan

Yes

No

cudnnDestroyFusedOpsPlan

Yes

No

cudnnMakeFusedOpsPlan

Yes

No

cudnnFusedOpsExecute

Yes

No

cudnnCreatePersistentRNNPlan

Yes

No

deprecated in cuDNN 8.0

rnn Persistent 新算法相关操作

cudnnDestroyPersistentRNNPlan

Yes

No

cudnnSetPersistentRNNPlan

Yes

No

cudnnBuildRNNDynamic

Yes

No

cudnnSetRNNPaddingMode

Yes

No

deprecated in cuDNN 8.0

rnn Padding相关操作。可用acdnnSetRNNDescriptor_v8()

cudnnGetRNNPaddingMode

Yes

No

cudnnRNNForwardInferenceEx

Yes

No

可用 acdnnRNNForward()

cudnnSetRNNAlgorithmDescriptor

Yes

No

rnn 算法寻优

cudnnGetRNNForwardInferenceAlgorithmMaxCount

Yes

No

cudnnFindRNNForwardInferenceAlgorithmEx

Yes

No

cudnnRNNForwardTrainingEx

Yes

No

deprecated in cuDNN 8.0

可用 acdnnRNNForward()

cudnnRNNBackwardDataEx

Yes

No

可用 acdnnRNNBackwardData_v8()

cudnnRNNBackwardWeightsEx

Yes

No

可用 acdnnRNNBackwardWeights_v8()

cudnnGetRNNForwardTrainingAlgorithmMaxCount

Yes

No

rnn 算法寻优

cudnnFindRNNForwardTrainingAlgorithmEx

Yes

No

cudnnGetRNNBackwardDataAlgorithmMaxCount

Yes

No

cudnnFindRNNBackwardDataAlgorithmEx

Yes

No

cudnnGetRNNBackwardWeightsAlgorithmMaxCount

Yes

No

cudnnFindRNNBackwardWeightsAlgorithmEx

Yes

No

cudnnMultiHeadAttnBackwardData

Yes

No

MultiHeadAttn 反向

cudnnMultiHeadAttnBackwardWeights

Yes

No

cudnnDivisiveNormalizationForward

Yes

No

前向 DivisiveNormalization 层计算

cudnnDeriveNormTensorDescriptor

Yes

No

导出normalization 层Tensor描述

cudnnNormalizationForwardInference

Yes

No

前向 Normalization 层计算

cudnnDivisiveNormalizationBackward

Yes

No

反向 DivisiveNormalization 层计算

cudnnGetNormalizationForwardTrainingWorkspaceSize

Yes

No

Normalization 层辅助api,ws 大小获取

cudnnGetNormalizationBackwardWorkspaceSize

Yes

No

cudnnGetNormalizationTrainingReserveSpaceSize

Yes

No

cudnnNormalizationForwardTraining

Yes

No

前向 Normalization 层计算

cudnnNormalizationBackward

Yes

No

反向 Normalization 层计算

CUDA cuBLAS支持情况

对比cublas 11.9.2,cublas APIs支持状况如下表:

  • 目前常用的NN场景使用的APIs大部分均已支持并调优;

  • 对标Ampere所有API都是软件可支持的,目前无PPU硬件限制的因素;后续软件版本将根据优先级逐步完善。

  • 目前API支持率为: 89/290 = 30.7%,不支持API主要分为以下几类:

    • 复数数据类型:131

    • 特殊类型矩阵(对称、压缩、三角等)类型:36

    • 功能辅助(set/get,memcpy等)类型:18

    • batched gemv:12

    • 不常见算法(矩阵加geam,求逆matinv):4

  • 由于AI场景里不涉及复数类型和特殊矩阵类型,AI场景API支持率为89/123 = 72.4%。

cuBLAS APIs支持状况

api

cublas 11.9.2

ppu 1.4

cublasCreate_v2

Yes

Yes

cublasDestroy_v2

Yes

Yes

cublasGetProperty

Yes

Yes

cublasSetStream_v2

Yes

Yes

cublasGetStream_v2

Yes

Yes

cublasGetMathMode

Yes

Yes

cublasSetMathMode

Yes

Yes

cublasGetPointerMode_v2

Yes

Yes

cublasSetPointerMode_v2

Yes

Yes

cublasSetWorkspace_v2

Yes

Yes

cublasGetStatusString

Yes

Yes

cublasIamaxEx

Yes

Yes

cublasIsamax_v2

Yes

Yes

cublasIdamax_v2

Yes

Yes

cublasIaminEx

Yes

Yes

cublasIsamin_v2

Yes

Yes

cublasIdamin_v2

Yes

Yes

cublasAsumEx

Yes

Yes

cublasSasum_v2

Yes

Yes

cublasDasum_v2

Yes

Yes

cublasAxpyEx

Yes

Yes

cublasSaxpy_v2

Yes

Yes

cublasDaxpy_v2

Yes

Yes

cublasCopyEx

Yes

Yes

cublasScopy_v2

Yes

Yes

cublasDcopy_v2

Yes

Yes

cublasDotEx

Yes

Yes

cublasSdot_v2

Yes

Yes

cublasDdot_v2

Yes

Yes

cublasNrm2Ex

Yes

Yes

cublasSnrm2_v2

Yes

Yes

cublasDnrm2_v2

Yes

Yes

cublasRotEx

Yes

Yes

cublasSrot_v2

Yes

Yes

cublasDrot_v2

Yes

Yes

cublasRotgEx

Yes

Yes

cublasSrotg_v2

Yes

Yes

cublasDrotg_v2

Yes

Yes

cublasRotmEx

Yes

Yes

cublasSrotm_v2

Yes

Yes

cublasDrotm_v2

Yes

Yes

cublasRotmgEx

Yes

Yes

cublasSrotmg_v2

Yes

Yes

cublasDrotmg_v2

Yes

Yes

cublasScalEx

Yes

Yes

cublasSscal_v2

Yes

Yes

cublasDscal_v2

Yes

Yes

cublasSwapEx

Yes

Yes

cublasSswap_v2

Yes

Yes

cublasDswap_v2

Yes

Yes

cublasSgemv_v2

Yes

Yes

cublasDgemv_v2

Yes

Yes

cublasSgemm_v2

Yes

Yes

cublasDgemm_v2

Yes

Yes

cublasHgemm

Yes

Yes

cublasSgemmEx

Yes

Yes

cublasGemmEx

Yes

Yes

cublasHgemmBatched

Yes

Yes

cublasSgemmBatched

Yes

Yes

cublasGemmBatchedEx

Yes

Yes

cublasGemmStridedBatchedEx

Yes

Yes

cublasSgemmStridedBatched

Yes

Yes

cublasDgemmBatched

Yes

Yes

cublasDgemmStridedBatched

Yes

Yes

cublasHgemmStridedBatched

Yes

Yes

cublasSgetrfBatched

Yes

Yes

cublasDgetrfBatched

Yes

Yes

cublasSgetrsBatched

Yes

Yes

cublasDgetrsBatched

Yes

Yes

cublasSger_v2

Yes

Yes

cublasDger_v2

Yes

Yes

cublasSsyr_v2

Yes

Yes

cublasDsyr_v2

Yes

Yes

cublasSspr_v2

Yes

Yes

cublasDspr_v2

Yes

Yes

cublasSsyr2_v2

Yes

Yes

cublasDsyr2_v2

Yes

Yes

cublasSspr2_v2

Yes

Yes

cublasDspr2_v2

Yes

Yes

cublasStrsm_v2

Yes

Yes

cublasDtrsm_v2

Yes

Yes

cublasStrsmBatched

Yes

Yes

cublasDtrsmBatched

Yes

Yes

cublasSgetriBatched

Yes

Yes

cublasDgetriBatched

Yes

Yes

cublasSgeqrfBatched

Yes

Yes

cublasDgeqrfBatched

Yes

Yes

cublasSgelsBatched

Yes

Yes

cublasDgelsBatched

Yes

Yes

cublasGetVersion_v2

Yes

No

cublasGetCudartVersion

Yes

No

cublasGetAtomicsMode

Yes

No

cublasSetAtomicsMode

Yes

No

cublasGetSmCountTarget

Yes

No

cublasSetSmCountTarget

Yes

No

cublasGetStatusName

Yes

No

cublasLoggerConfigure

Yes

No

cublasSetLoggerCallback

Yes

No

cublasGetLoggerCallback

Yes

No

cublasSetVector

Yes

No

cublasGetVector

Yes

No

cublasSetMatrix

Yes

No

cublasGetMatrix

Yes

No

cublasSetVectorAsync

Yes

No

cublasGetVectorAsync

Yes

No

cublasSetMatrixAsync

Yes

No

cublasGetMatrixAsync

Yes

No

cublasSgemvBatched

Yes

No

cublasDgemvBatched

Yes

No

cublasCgemvBatched

Yes

No

cublasZgemvBatched

Yes

No

cublasHSHgemvBatched

Yes

No

cublasHSSgemvBatched

Yes

No

cublasTSTgemvBatched

Yes

No

cublasTSSgemvBatched

Yes

No

cublasSgemvStridedBatched

Yes

No

cublasDgemvStridedBatched

Yes

No

cublasCgemvStridedBatched

Yes

No

cublasZgemvStridedBatched

Yes

No

cublasHSHgemvStridedBatched

Yes

No

cublasHSSgemvStridedBatched

Yes

No

cublasTSTgemvStridedBatched

Yes

No

cublasTSSgemvStridedBatched

Yes

No

cublasScnrm2_v2

Yes

No

cublasDznrm2_v2

Yes

No

cublasDotcEx

Yes

No

cublasCdotu_v2

Yes

No

cublasCdotc_v2

Yes

No

cublasZdotu_v2

Yes

No

cublasZdotc_v2

Yes

No

cublasCscal_v2

Yes

No

cublasCsscal_v2

Yes

No

cublasZscal_v2

Yes

No

cublasZdscal_v2

Yes

No

cublasCaxpy_v2

Yes

No

cublasZaxpy_v2

Yes

No

cublasCcopy_v2

Yes

No

cublasZcopy_v2

Yes

No

cublasCswap_v2

Yes

No

cublasZswap_v2

Yes

No

cublasIcamax_v2

Yes

No

cublasIzamax_v2

Yes

No

cublasIcamin_v2

Yes

No

cublasIzamin_v2

Yes

No

cublasScasum_v2

Yes

No

cublasDzasum_v2

Yes

No

cublasCrot_v2

Yes

No

cublasCsrot_v2

Yes

No

cublasZrot_v2

Yes

No

cublasZdrot_v2

Yes

No

cublasCrotg_v2

Yes

No

cublasZrotg_v2

Yes

No

cublasCgemv_v2

Yes

No

cublasZgemv_v2

Yes

No

cublasCgemm_v2

Yes

No

cublasCgemm3m

Yes

No

cublasCgemm3mEx

Yes

No

cublasZgemm_v2

Yes

No

cublasZgemm3m

Yes

No

cublasCgemmEx

Yes

No

cublasCgemmBatched

Yes

No

cublasCgemm3mBatched

Yes

No

cublasZgemmBatched

Yes

No

cublasCgemmStridedBatched

Yes

No

cublasCgemm3mStridedBatched

Yes

No

cublasZgemmStridedBatched

Yes

No

cublasCgetrfBatched

Yes

No

cublasZgetrfBatched

Yes

No

cublasCgetrsBatched

Yes

No

cublasZgetrsBatched

Yes

No

cublasSgbmv_v2

Yes

No

cublasDgbmv_v2

Yes

No

cublasCgbmv_v2

Yes

No

cublasZgbmv_v2

Yes

No

cublasStrmv_v2

Yes

No

cublasDtrmv_v2

Yes

No

cublasCtrmv_v2

Yes

No

cublasZtrmv_v2

Yes

No

cublasStbmv_v2

Yes

No

cublasDtbmv_v2

Yes

No

cublasCtbmv_v2

Yes

No

cublasZtbmv_v2

Yes

No

cublasStpmv_v2

Yes

No

cublasDtpmv_v2

Yes

No

cublasCtpmv_v2

Yes

No

cublasZtpmv_v2

Yes

No

cublasStrsv_v2

Yes

No

cublasDtrsv_v2

Yes

No

cublasCtrsv_v2

Yes

No

cublasZtrsv_v2

Yes

No

cublasStpsv_v2

Yes

No

cublasDtpsv_v2

Yes

No

cublasCtpsv_v2

Yes

No

cublasZtpsv_v2

Yes

No

cublasStbsv_v2

Yes

No

cublasDtbsv_v2

Yes

No

cublasCtbsv_v2

Yes

No

cublasZtbsv_v2

Yes

No

cublasSsymv_v2

Yes

No

cublasDsymv_v2

Yes

No

cublasCsymv_v2

Yes

No

cublasZsymv_v2

Yes

No

cublasChemv_v2

Yes

No

cublasZhemv_v2

Yes

No

cublasSsbmv_v2

Yes

No

cublasDsbmv_v2

Yes

No

cublasChbmv_v2

Yes

No

cublasZhbmv_v2

Yes

No

cublasSspmv_v2

Yes

No

cublasDspmv_v2

Yes

No

cublasChpmv_v2

Yes

No

cublasZhpmv_v2

Yes

No

cublasCgeru_v2

Yes

No

cublasCgerc_v2

Yes

No

cublasZgeru_v2

Yes

No

cublasZgerc_v2

Yes

No

cublasCsyr_v2

Yes

No

cublasZsyr_v2

Yes

No

cublasCher_v2

Yes

No

cublasZher_v2

Yes

No

cublasChpr_v2

Yes

No

cublasZhpr_v2

Yes

No

cublasCsyr2_v2

Yes

No

cublasZsyr2_v2

Yes

No

cublasCher2_v2

Yes

No

cublasZher2_v2

Yes

No

cublasChpr2_v2

Yes

No

cublasZhpr2_v2

Yes

No

cublasSsyrk_v2

Yes

No

cublasDsyrk_v2

Yes

No

cublasCsyrk_v2

Yes

No

cublasZsyrk_v2

Yes

No

cublasCsyrkEx

Yes

No

cublasCsyrk3mEx

Yes

No

cublasCherk_v2

Yes

No

cublasZherk_v2

Yes

No

cublasCherkEx

Yes

No

cublasCherk3mEx

Yes

No

cublasSsyr2k_v2

Yes

No

cublasDsyr2k_v2

Yes

No

cublasCsyr2k_v2

Yes

No

cublasZsyr2k_v2

Yes

No

cublasCher2k_v2

Yes

No

cublasZher2k_v2

Yes

No

cublasSsyrkx

Yes

No

cublasDsyrkx

Yes

No

cublasCsyrkx

Yes

No

cublasZsyrkx

Yes

No

cublasCherkx

Yes

No

cublasZherkx

Yes

No

cublasSsymm_v2

Yes

No

cublasDsymm_v2

Yes

No

cublasCsymm_v2

Yes

No

cublasZsymm_v2

Yes

No

cublasChemm_v2

Yes

No

cublasZhemm_v2

Yes

No

cublasCtrsm_v2

Yes

No

cublasZtrsm_v2

Yes

No

cublasStrmm_v2

Yes

No

cublasDtrmm_v2

Yes

No

cublasCtrmm_v2

Yes

No

cublasZtrmm_v2

Yes

No

cublasSgeam

Yes

No

cublasDgeam

Yes

No

cublasCgeam

Yes

No

cublasZgeam

Yes

No

cublasCgetriBatched

Yes

No

cublasZgetriBatched

Yes

No

cublasCtrsmBatched

Yes

No

cublasZtrsmBatched

Yes

No

cublasSmatinvBatched

Yes

No

cublasDmatinvBatched

Yes

No

cublasCmatinvBatched

Yes

No

cublasZmatinvBatched

Yes

No

cublasCgeqrfBatched

Yes

No

cublasZgeqrfBatched

Yes

No

cublasCgelsBatched

Yes

No

cublasZgelsBatched

Yes

No

cublasSdgmm

Yes

No

cublasDdgmm

Yes

No

cublasCdgmm

Yes

No

cublasZdgmm

Yes

No

cublasStpttr

Yes

No

cublasDtpttr

Yes

No

cublasCtpttr

Yes

No

cublasZtpttr

Yes

No

cublasStrttp

Yes

No

cublasDtrttp

Yes

No

cublasCtrttp

Yes

No

cublasZtrttp

Yes

No