CUDA兼容性
本文为您介绍PPU对CUDA的兼容情况。
PPU CUDA兼容方案、原理及优势
CUDA(Compute Unified Device Architecture)是由NVIDIA推出的一种并行计算架构,允许开发者利用NVIDIA显卡的强大计算能力来加速一些计算密集型任务。它提供了一种编程模型和API,使得程序员可以使用C、C++和Fortran等语言编写与GPU并行计算相关的代码。通过CUDA,开发者能够在显卡上执行大量并行计算,从而提升程序的性能,广泛应用于科学计算、图像处理、机器学习等领域。PPU对CUDA进行了兼容,具体方案原理和优势,请参见CUDA兼容讨论.pptx。
CUDA Sample兼容性
CUDA Sample兼容性列表
CUDA Sample | Status | Comments |
simpleVoteIntrinsics | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
vectorAdd_nvrtc | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
deviceQuery | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
reduction | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
tf32TensorCoreGemm | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
shfl_scan | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
warpAggregatedAtomicsCG | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 |
|
concurrentKernels | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
bf16TensorCoreGemm | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
bandwidthTest | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
UnifiedMemoryPerf | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
binaryPartitionCG | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 |
|
conjugateGradientMultiBlockCG | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
cudaCompressibleMemory | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
cudaTensorCoreGemm | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
globalToShmemAsyncCopy | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
matrixMul | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
matrixMulDrv | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
nvJPEG | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
nvJPEG_encoder | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
p2pBandwidthLatencyTest | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleAWBarrier | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleCudaGraphs | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleZeroCopy | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleDrvRuntime | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
vectorAddMMAP | ❌ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 |
11.1的sample在编译流程中存在对ptx的依赖,PPU不支持ptx。 |
simpleIPC | ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 |
|
streamOrderedAllocation | ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
streamOrderedAllocationIPC | ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simplePrintf | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleTemplates | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleOccupancy | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
topologyQuery | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
clock | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
cppIntegration | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
dwtHaar1D | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
vectorAdd | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
vectorAddDrv | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
scalarProd | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleVoteIntrinsics_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
SobolQRNG | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleCooperativeGroups | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleAtomicIntrinsics | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
cudaOpenMP | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
fp16ScalarProduct | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
inlinePTX | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleMPI | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
template | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleHyperQ | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
reductionMultiBlockCG | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
threadFenceReduction | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
mergeSort | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
convolutionSeparable | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
FDTD3d | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
matrixMulCUBLAS | ❌ 11.5 ❌ 11.6 ❌ 11.7 ❌ 11.8 ❌ 12.0 ❌ 12.1 ❌ 12.2 ❌ 12.3 ❌ 12.4 ❌ 12.5 ❌ 12.6 | 计算结果跟nv存在精度上的差异,原因在于matrixMul计算方法差异导致,PPU选择了性能更好的实现。 |
sortingNetworks | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
fastWalshTransform | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
alignedTypes | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
deviceQueryDrv | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
scan | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
BlackScholes | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
transpose | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
histogram | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
MC_SingleAsianOptionP | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
MC_EstimatePiInlineP | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
quasirandomGenerator | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
binomialOptions | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
MonteCarloMultiGPU | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
UnifiedMemoryStreams | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
asyncAPI | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
c++11_cuda | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
cppOverload | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
cuHook | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
eigenvalues | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
interval | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
newdelete | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
radixSortThrust | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
segmentationTreeThrust | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleAssert | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleAttributes | ❌ 11.5 ❌ 11.6 ❌ 11.7 ❌ 11.8 ❌ 12.0 ❌ 12.1 ❌ 12.2 ❌ 12.3 ❌ 12.4 ❌ 12.5 ❌ 12.6 | 能pass,但执行时间会超长,原因在于 "Maximum y- or z-dimension of a grid of thread blocks" 在NV上是65535, 而PPU上是2^31-1,sample代码中依赖到这个值,导致运行时间超级长。 |
simpleMultiCopy | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleMultiGPU | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleP2P | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleSeparateCompilation | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleStreams | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
threadMigration | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
vectorAddDrv | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
binomialOptions_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
clock_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
inlinePTX_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
matrixMul_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
quasirandomGenerator_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleAssert_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleAtomicIntrinsics_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleTemplates_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
simpleVoteIntrinsics_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
BlackScholes_nvrtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
libNVVM | ❌ 12.3 ❌ 12.4 ❌ 12.5 ❌ 12.6 | 目前PPU兼容llvm ir for nvgpu的定义,但不完全兼容nv官方的nvvm ir定义。 |
StreamPriorities | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
MC_EstimatePiInlineQ | ❌ | curand相关api 暂未完整支持。
|
MC_EstimatePiP | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
MC_EstimatePiQ | ❌ | curand相关api (curandCreateGenerator) 暂未完整支持。 |
MersenneTwisterGP11213 | ❌ | curand相关api 暂未完整支持。 |
batchCUBLAS | ❌ | curand相关api (cublasSetMatrix) 暂未完整支持。 |
batchedLabelMarkersAndLabelCompressionNPP | ❌ | nppiLabelMarkersUFGetBufferSize_32u_C1R nppiCompressMarkerLabelsGetBufferSize_32u_C1R nppiLabelMarkersUFBatch_8u32u_C1R_Advanced_Ctx nppiCompressMarkerLabelsUFBatch_32u_C1IR_Advanced_Ctx 以上api 暂未支持 |
boxFilterNPP | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 |
|
conjugateGradientCudaGraphs | ❌ | cusparse 库暂未完整支持 |
conjugateGradient | ❌ | |
conjugateGradientMultiDeviceCG | ❌ | |
conjugateGradientPrecond | ❌ | |
conjugateGradientUM | ❌ | |
cuSolverDn_LinearSolver | ❌ | cusparse库(cusolverSpCreate/cusolverSpCreate) 暂未完整支持 |
cuSolverRf | ❌ | |
cuSolverSp_LinearSolver | ❌ | |
cuSolverSp_LowlevelCholesky | ❌ | |
cuSolverSp_LowlevelQR | ❌ | |
graphMemoryFootprint | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
graphMemoryNodes | ❌ | 关于mem allocate和mem free的apis,目前头文件未到该版本故还没定义 |
immaTensorCoreGemm | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
jacobiCudaGraphs | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
matrixMulDynlinkJIT | ❌ | DynlinkJIT暂时不支持 |
memMapIPCDrv | ❌ | cuMemGetAllocationGranularity 获取到的 granularity 是2MB,ppu设计的是 8MB,代码中检查部分需要修改。 |
nbody | ❌ | Graphics, openGL 相关api 不支持 |
Mandelbrot | ❌ | |
particles | ❌ | |
oceanFFT | ❌ | |
simpleCUDA2GL | ❌ | |
simpleGL | ❌ | |
recursiveGaussian | ❌ | |
ptxjit | ❌ | PPU不支持ptx |
randomFog | ❌ | 缺少grahical display的能力。 |
simpleCUBLAS | ❌ | CUBLAS API实现不全 |
simpleCUBLASXT | ❌ | |
simpleCUBLAS_LU | ❌ | |
simpleCUFFT | ❌ | cuFFT库未支持 |
simpleCUFFT_2d_MGPU | ❌ | |
simpleCUFFT_MGPU | ❌ | |
simpleCUFFT_callback | ❌ | |
systemWideAtomics | ❌ |
|
FilterBorderControlNPP | ❌ | NPP相关API暂时未支持。 |
watershedSegmentationNPP | ❌ | |
streamOrderedAllocationP2P | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
EGLStream_CUDA_CrossGPU | ❌ | EGL 相关API 不支持 |
EGLStream_CUDA_Interop | ❌ | |
EGLStreams_CUDA_Interop | ❌ | |
EGLSync_CUDAEvent_Interop | ❌ | GLES 不支持 |
cuDLALayerwiseStatsStandalone | ❌ | |
cuDLALayerwiseStatsHybrid | ❌ | |
simpleGLES_EGLOutput | ❌ | |
fluidsGLES | ❌ | |
nbody_opengles | ❌ | |
simpleGLES | ❌ | |
simpleGLES_screen | ❌ | |
nbody_screen | ❌ | |
cuDLAHybridMode | ❌ | |
cuDLAStandaloneMode | ❌ | |
cuDLAErrorReporting | ❌ | |
cudaNvSciNvMedia | ❌ | |
cdpAdvancedQuicksort | ❌ | CUDA CDP特性不支持 |
cdpBezierTessellation | ❌ | |
cdpQuadtree | ❌ | |
cdpSimplePrint | ❌ | |
cdpSimpleQuicksort | ❌ | |
cudaNvSci | ❌ | libnvscibuf.so not found nvsci 暂不支持 |
dmmaTensorCoreGemm | ❌ | PPU tensor core不支持Double MMA指令。 |
dxtc | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
freeImageInteropNPP | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
histEqualizationNPP | ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
cannyEdgeDetectorNPP | ✅ 11.1 ✅ 11.2 ✅ 11.3 ✅ 11.4 ✅ 11.5 ✅ 11.6 ✅ 11.7 ✅ 11.8 ✅ 12.0 ✅ 12.1 ✅ 12.2 ✅ 12.3 ✅ 12.4 ✅ 12.5 ✅ 12.6 | |
convolutionTexture | ❌ | PPU不支持texture以及GL的功能 |
bindlessTexture | ❌ | |
bicubicTexture | ❌ | |
HSOpticalFlow | ❌ | |
simpleLayeredTexture | ❌ | |
simplePitchLinearTexture | ❌ | |
simpleSurfaceWrite | ❌ | |
simpleTexture | ❌ | |
simpleTexture3D | ❌ | |
simpleTextureDrv | ❌ | |
simpleCubemapTexture | ❌ | |
volumeFiltering | ❌ | |
volumeRender | ❌ | |
vulkanImageCUDA | ❌ | |
stereoDisparity | ❌ | |
boxFilter | ❌ | |
bilateralFilter | ❌ | |
postProcessGL | ❌ | |
imageDenoising | ❌ | |
fluidsGL | ❌ | |
smokeParticles | ❌ | |
lineOfSight | ❌ | |
marchingCubes | ❌ | |
convolutionFFT2D | ❌ | |
dct8x8 | ❌ | |
NV12toBGRandResize | ❌ | |
SobelFilter | ❌ | |
FunctionPointers | ❌ | |
simpleVulkan | ❌ | Vulkan-Cuda 相关feature 暂不支持 |
simpleVulkanMMAP | ❌ | |
simpleD3D10 | ❌ | PPU不支持D3D graphics相关API |
fluidsD3D9 | ❌ | |
simpleD3D11 | ❌ | |
simpleD3D11Texture | ❌ | |
simpleD3D10RenderTarget | ❌ | |
simpleD3D10Texture | ❌ | |
SLID3D10Texture | ❌ | |
VFlockingD3D10 | ❌ | |
simpleD3D12 | ❌ | |
simpleD3D9Texture | ❌ | |
simpleD3D9 | ❌ | |
cudaGraphsPerfScaling | ❌ 12.5 ❌ 12.6 | cudaGraphUpload API暂不支持 |
CUDA Runtime APIs不支持列表
对齐到CUDA Runtime 12.3(对于CUDA Runtime 12.3 Spec中标注为[DEPRECATED]模块整体不加以支持,此处不列出)。
不支持的API列表
功能模块 | CUDA Runtime API 名称 | 备注 |
Device Management | RDMA Flush,不影响RDMA使用 | |
NvSci Lib,目前无实际需求 | ||
Graphics相关,AI场景无需求 | ||
External Resource Interoperability | Graphics API 交互相关,目前只支持CUDA API | |
Execution Control | deprecated as of CUDA 7.5 永不支持 | |
deprecated as of CUDA 7.5 永不支持 | ||
Occupancy | cluster相关 | |
Memory Management | 除cudaMemAdvise_v2外,都属于Array相关,AI场景无需求 | |
OpenGL Interoperability | Graphics相关,AI场景无需求 | |
Direct3D 9 Interoperability | ||
Direct3D 10 Interoperability | ||
Direct3D 11 Interoperability | ||
VDPAU Interoperability | ||
EGL Interoperability | ||
Graphics Interoperability | ||
Texture Object Management | ||
Surface Object Management | ||
Graph Management | device funciton 需要CDP支持 | |
主要为Graphics相关,AI场景无需求 | ||
此外还存在少量的API有部分枚举不支持的情况,未列出,以实际运行返回cudaErrorNotSupported或者cudaErrorInvalidValue结果为准。
CUDA cuDNN支持情况
对比cudnn 8.5.0,cudnn APIs支持状况如下表:
目前常用的NN场景使用的 APIs 大部分均已支持并调优。
对标Ampere所有API都是软件可支持的,目前无PPU硬件限制的因素;后续软件版本将根据优先级逐步完善。
目前API支持率为: 196/263 = 74.5%。
去除26个deprecated APIs,API支持率为: 196/237 = 82.7%。
cuDNN APIs支持情况
api | cudnn 8.5.0 | ppu 1.4 | 状态 | 功能说明 |
cudnnCreateRNNDescriptor | Yes | Yes | ||
cudnnDestroyRNNDescriptor | Yes | Yes | ||
cudnnSetRNNDescriptor_v8 | Yes | Yes | ||
cudnnGetRNNDescriptor_v8 | Yes | Yes | ||
cudnnSetRNNDescriptor_v6 | Yes | Yes | ||
cudnnGetRNNDescriptor_v6 | Yes | Yes | ||
cudnnSetRNNMatrixMathType | Yes | Yes | ||
cudnnGetRNNMatrixMathType | Yes | Yes | ||
cudnnSetRNNBiasMode | Yes | Yes | ||
cudnnGetRNNBiasMode | Yes | Yes | ||
cudnnRNNSetClip_v8 | Yes | Yes | ||
cudnnRNNGetClip_v8 | Yes | Yes | ||
cudnnRNNSetClip | Yes | Yes | ||
cudnnRNNGetClip | Yes | Yes | ||
cudnnSetRNNProjectionLayers | Yes | Yes | ||
cudnnGetRNNProjectionLayers | Yes | Yes | ||
cudnnGetRNNWorkspaceSize | Yes | Yes | ||
cudnnGetRNNTrainingReserveSize | Yes | Yes | ||
cudnnGetRNNTempSpaceSizes | Yes | Yes | ||
cudnnGetRNNParamsSize | Yes | Yes | ||
cudnnGetRNNWeightSpaceSize | Yes | Yes | ||
cudnnGetRNNLinLayerMatrixParams | Yes | Yes | ||
cudnnGetRNNLinLayerBiasParams | Yes | Yes | ||
cudnnGetRNNWeightParams | Yes | Yes | ||
cudnnRNNForwardInference | Yes | Yes | ||
cudnnCreateRNNDataDescriptor | Yes | Yes | ||
cudnnDestroyRNNDataDescriptor | Yes | Yes | ||
cudnnSetRNNDataDescriptor | Yes | Yes | ||
cudnnGetRNNDataDescriptor | Yes | Yes | ||
cudnnRNNForward | Yes | Yes | ||
cudnnCreateSeqDataDescriptor | Yes | Yes | ||
cudnnDestroySeqDataDescriptor | Yes | Yes | ||
cudnnSetSeqDataDescriptor | Yes | Yes | ||
cudnnGetSeqDataDescriptor | Yes | Yes | ||
cudnnCreateAttnDescriptor | Yes | Yes | ||
cudnnDestroyAttnDescriptor | Yes | Yes | ||
cudnnSetAttnDescriptor | Yes | Yes | ||
cudnnGetAttnDescriptor | Yes | Yes | ||
cudnnGetMultiHeadAttnBuffers | Yes | Yes | ||
cudnnGetMultiHeadAttnWeights | Yes | Yes | ||
cudnnMultiHeadAttnForward | Yes | Yes | ||
cudnnAdvInferVersionCheck | Yes | Yes | ||
cudnnRNNForwardTraining | Yes | Yes | ||
cudnnRNNBackwardData | Yes | Yes | ||
cudnnRNNBackwardData_v8 | Yes | Yes | ||
cudnnRNNBackwardWeights | Yes | Yes | ||
cudnnRNNBackwardWeights_v8 | Yes | Yes | ||
cudnnDestroyCTCLossDescriptor | Yes | Yes | ||
cudnnCreateCTCLossDescriptor | Yes | Yes | ||
cudnnSetCTCLossDescriptor | Yes | Yes | ||
cudnnGetCTCLossDescriptor | Yes | Yes | ||
cudnnSetCTCLossDescriptorEx | Yes | Yes | ||
cudnnGetCTCLossDescriptorEx | Yes | Yes | ||
cudnnSetCTCLossDescriptor_v8 | Yes | Yes | ||
cudnnGetCTCLossDescriptor_v8 | Yes | Yes | ||
cudnnGetCTCLossWorkspaceSize_v8 | Yes | Yes | ||
cudnnCTCLoss | Yes | Yes | ||
cudnnGetCTCLossWorkspaceSize | Yes | Yes | ||
cudnnCTCLoss_v8 | Yes | Yes | ||
cudnnAdvTrainVersionCheck | Yes | Yes | ||
cudnnBackendCreateDescriptor | Yes | Yes | ||
cudnnBackendDestroyDescriptor | Yes | Yes | ||
cudnnBackendInitialize | Yes | Yes | ||
cudnnBackendFinalize | Yes | Yes | ||
cudnnBackendSetAttribute | Yes | Yes | ||
cudnnBackendGetAttribute | Yes | Yes | ||
cudnnBackendExecute | Yes | Yes | ||
cudnnCreateConvolutionDescriptor | Yes | Yes | ||
cudnnDestroyConvolutionDescriptor | Yes | Yes | ||
cudnnSetConvolution2dDescriptor | Yes | Yes | ||
cudnnGetConvolution2dDescriptor | Yes | Yes | ||
cudnnSetConvolutionNdDescriptor | Yes | Yes | ||
cudnnGetConvolutionNdDescriptor | Yes | Yes | ||
cudnnSetConvolutionMathType | Yes | Yes | ||
cudnnGetConvolutionMathType | Yes | Yes | ||
cudnnSetConvolutionGroupCount | Yes | Yes | ||
cudnnGetConvolutionGroupCount | Yes | Yes | ||
cudnnGetConvolution2dForwardOutputDim | Yes | Yes | ||
cudnnGetConvolutionNdForwardOutputDim | Yes | Yes | ||
cudnnGetConvolutionForwardAlgorithmMaxCount | Yes | Yes | ||
cudnnGetConvolutionBackwardDataAlgorithmMaxCount | Yes | Yes | ||
cudnnGetConvolutionForwardWorkspaceSize | Yes | Yes | ||
cudnnGetConvolutionForwardAlgorithm_v7 | Yes | Yes | ||
cudnnFindConvolutionForwardAlgorithm | Yes | Yes | ||
cudnnFindConvolutionForwardAlgorithmEx | Yes | Yes | ||
cudnnConvolutionForward | Yes | Yes | ||
cudnnConvolutionBiasActivationForward | Yes | Yes | ||
cudnnGetConvolutionBackwardDataWorkspaceSize | Yes | Yes | ||
cudnnFindConvolutionBackwardDataAlgorithm | Yes | Yes | ||
cudnnFindConvolutionBackwardDataAlgorithmEx | Yes | Yes | ||
cudnnGetConvolutionBackwardDataAlgorithm_v7 | Yes | Yes | ||
cudnnConvolutionBackwardData | Yes | Yes | ||
cudnnGetFoldedConvBackwardDataDescriptors | Yes | Yes | ||
cudnnCnnInferVersionCheck | Yes | Yes | ||
cudnnGetConvolutionBackwardFilterWorkspaceSize | Yes | Yes | ||
cudnnGetConvolutionBackwardFilterAlgorithmMaxCount | Yes | Yes | ||
cudnnFindConvolutionBackwardFilterAlgorithm | Yes | Yes | ||
cudnnFindConvolutionBackwardFilterAlgorithmEx | Yes | Yes | ||
cudnnGetConvolutionBackwardFilterAlgorithm_v7 | Yes | Yes | ||
cudnnConvolutionBackwardFilter | Yes | Yes | ||
cudnnCnnTrainVersionCheck | Yes | Yes | ||
cudnnGetVersion | Yes | Yes | ||
cudnnGetProperty | Yes | Yes | ||
cudnnGetErrorString | Yes | Yes | ||
cudnnCreate | Yes | Yes | ||
cudnnDestroy | Yes | Yes | ||
cudnnSetStream | Yes | Yes | ||
cudnnGetStream | Yes | Yes | ||
cudnnGetCudartVersion | Yes | Yes | ||
cudnnCreateTensorDescriptor | Yes | Yes | ||
cudnnDestroyTensorDescriptor | Yes | Yes | ||
cudnnSetTensor4dDescriptor | Yes | Yes | ||
cudnnSetTensor4dDescriptorEx | Yes | Yes | ||
cudnnGetTensor4dDescriptor | Yes | Yes | ||
cudnnSetTensorNdDescriptor | Yes | Yes | ||
cudnnSetTensorNdDescriptorEx | Yes | Yes | ||
cudnnGetTensorNdDescriptor | Yes | Yes | ||
cudnnGetTensorSizeInBytes | Yes | Yes | ||
cudnnCreateFilterDescriptor | Yes | Yes | ||
cudnnDestroyFilterDescriptor | Yes | Yes | ||
cudnnSetFilter4dDescriptor | Yes | Yes | ||
cudnnGetFilter4dDescriptor | Yes | Yes | ||
cudnnSetFilterNdDescriptor | Yes | Yes | ||
cudnnGetFilterNdDescriptor | Yes | Yes | ||
cudnnGetFilterSizeInBytes | Yes | Yes | ||
cudnnDeriveBNTensorDescriptor | Yes | Yes | ||
cudnnBatchNormalizationForwardInference | Yes | Yes | ||
cudnnCreateOpTensorDescriptor | Yes | Yes | ||
cudnnDestroyOpTensorDescriptor | Yes | Yes | ||
cudnnSetOpTensorDescriptor | Yes | Yes | ||
cudnnGetOpTensorDescriptor | Yes | Yes | ||
cudnnCreatePoolingDescriptor | Yes | Yes | ||
cudnnSetPooling2dDescriptor | Yes | Yes | ||
cudnnSetPoolingNdDescriptor | Yes | Yes | ||
cudnnGetPoolingNdForwardOutputDim | Yes | Yes | ||
cudnnGetPooling2dForwardOutputDim | Yes | Yes | ||
cudnnDestroyPoolingDescriptor | Yes | Yes | ||
cudnnPoolingForward | Yes | Yes | ||
cudnnCreateActivationDescriptor | Yes | Yes | ||
cudnnSetActivationDescriptor | Yes | Yes | ||
cudnnGetActivationDescriptor | Yes | Yes | ||
cudnnDestroyActivationDescriptor | Yes | Yes | ||
cudnnActivationForward | Yes | Yes | ||
cudnnCreateDropoutDescriptor | Yes | Yes | ||
cudnnDestroyDropoutDescriptor | Yes | Yes | ||
cudnnDropoutGetStatesSize | Yes | Yes | ||
cudnnDropoutGetReserveSpaceSize | Yes | Yes | ||
cudnnSetDropoutDescriptor | Yes | Yes | ||
cudnnRestoreDropoutDescriptor | Yes | Yes | ||
cudnnGetDropoutDescriptor | Yes | Yes | ||
cudnnDropoutForward | Yes | Yes | ||
cudnnSoftmaxForward | Yes | Yes | ||
cudnnAddTensor | Yes | Yes | ||
cudnnScaleTensor | Yes | Yes | ||
cudnnOpTensor | Yes | Yes | ||
cudnnTransformTensor | Yes | Yes | ||
cudnnCreateTensorTransformDescriptor | Yes | Yes | ||
cudnnDestroyTensorTransformDescriptor | Yes | Yes | ||
cudnnSetTensorTransformDescriptor | Yes | Yes | ||
cudnnGetTensorTransformDescriptor | Yes | Yes | ||
cudnnTransformTensorEx | Yes | Yes | ||
cudnnInitTransformDest | Yes | Yes | ||
cudnnTransformFilter | Yes | Yes | ||
cudnnCreateReduceTensorDescriptor | Yes | Yes | ||
cudnnDestroyReduceTensorDescriptor | Yes | Yes | ||
cudnnSetReduceTensorDescriptor | Yes | Yes | ||
cudnnGetReduceTensorDescriptor | Yes | Yes | ||
cudnnReduceTensor | Yes | Yes | ||
cudnnGetReductionWorkspaceSize | Yes | Yes | ||
cudnnGetReductionIndicesSize | Yes | Yes | ||
cudnnCreateLRNDescriptor | Yes | Yes | ||
cudnnDestroyLRNDescriptor | Yes | Yes | ||
cudnnGetLRNDescriptor | Yes | Yes | ||
cudnnSetLRNDescriptor | Yes | Yes | ||
cudnnLRNCrossChannelForward | Yes | Yes | ||
cudnnCreateSpatialTransformerDescriptor | Yes | Yes | ||
cudnnDestroySpatialTransformerDescriptor | Yes | Yes | ||
cudnnSetSpatialTransformerNdDescriptor | Yes | Yes | ||
cudnnSpatialTfGridGeneratorForward | Yes | Yes | ||
cudnnSpatialTfSamplerForward | Yes | Yes | ||
cudnnOpsInferVersionCheck | Yes | Yes | ||
cudnnGetBatchNormalizationForwardTrainingExWorkspaceSize | Yes | Yes | ||
cudnnGetBatchNormalizationBackwardExWorkspaceSize | Yes | Yes | ||
cudnnGetBatchNormalizationTrainingExReserveSpaceSize | Yes | Yes | ||
cudnnBatchNormalizationForwardTraining | Yes | Yes | ||
cudnnBatchNormalizationForwardTrainingEx | Yes | Yes | ||
cudnnBatchNormalizationBackward | Yes | Yes | ||
cudnnBatchNormalizationBackwardEx | Yes | Yes | ||
cudnnPoolingBackward | Yes | Yes | ||
cudnnActivationBackward | Yes | Yes | ||
cudnnDropoutBackward | Yes | Yes | ||
cudnnSoftmaxBackward | Yes | Yes | ||
cudnnLRNCrossChannelBackward | Yes | Yes | ||
cudnnSpatialTfSamplerBackward | Yes | Yes | ||
cudnnSpatialTfGridGeneratorBackward | Yes | Yes | ||
cudnnOpsTrainVersionCheck | Yes | Yes | ||
cudnnQueryRuntimeError | Yes | No | 辅助函数,用于检查 BN是否有numerical overflows | |
cudnnSetTensor | Yes | No | 辅助函数,设置Tensor 为定值 | |
cudnnGetPooling2dDescriptor | Yes | No | 辅助函数,get pooling Nd Descriptor | |
cudnnGetPoolingNdDescriptor | Yes | No | ||
cudnnSetActivationDescriptorSwishBeta | Yes | No | activation函数中swish_beta的set/get | |
cudnnGetActivationDescriptorSwishBeta | Yes | No | ||
cudnnCreateAlgorithmDescriptor | Yes | No | deprecated in cuDNN 8.0 | AlgorithmDescriptor 参数相关操作 |
cudnnSetAlgorithmDescriptor | Yes | No | ||
cudnnGetAlgorithmDescriptor | Yes | No | ||
cudnnCopyAlgorithmDescriptor | Yes | No | ||
cudnnDestroyAlgorithmDescriptor | Yes | No | ||
cudnnCreateAlgorithmPerformance | Yes | No | AlgorithmPerformance 参数相关操作 | |
cudnnSetAlgorithmPerformance | Yes | No | ||
cudnnGetAlgorithmPerformance | Yes | No | ||
cudnnDestroyAlgorithmPerformance | Yes | No | ||
cudnnGetAlgorithmSpaceSize | Yes | No | deprecated in cuDNN 8.0 | Algo 的 metadata的存储 |
cudnnSaveAlgorithm | Yes | No | ||
cudnnRestoreAlgorithm | Yes | No | ||
cudnnSetCallback | Yes | No | 回调函数相关 | |
cudnnGetCallback | Yes | No | ||
cudnnSetConvolutionReorderType | Yes | No | conv ReorderType的set/get | |
cudnnGetConvolutionReorderType | Yes | No | ||
cudnnIm2Col | Yes | No | im2Col, 构造前向相关矩阵 | |
cudnnReorderFilterAndBias | Yes | No | reorder the filter and bias. | |
cudnnConvolutionBackwardBias | Yes | No | compute conv grad with bias. | |
cudnnCreateFusedOpsConstParamPack | Yes | No | cudnnFusedOps 相关计算,可用backend API替换 | |
cudnnDestroyFusedOpsConstParamPack | Yes | No | ||
cudnnSetFusedOpsConstParamPackAttribute | Yes | No | ||
cudnnGetFusedOpsConstParamPackAttribute | Yes | No | ||
cudnnCreateFusedOpsVariantParamPack | Yes | No | ||
cudnnDestroyFusedOpsVariantParamPack | Yes | No | ||
cudnnSetFusedOpsVariantParamPackAttribute | Yes | No | ||
cudnnGetFusedOpsVariantParamPackAttribute | Yes | No | ||
cudnnCreateFusedOpsPlan | Yes | No | ||
cudnnDestroyFusedOpsPlan | Yes | No | ||
cudnnMakeFusedOpsPlan | Yes | No | ||
cudnnFusedOpsExecute | Yes | No | ||
cudnnCreatePersistentRNNPlan | Yes | No | deprecated in cuDNN 8.0 | rnn Persistent 新算法相关操作 |
cudnnDestroyPersistentRNNPlan | Yes | No | ||
cudnnSetPersistentRNNPlan | Yes | No | ||
cudnnBuildRNNDynamic | Yes | No | ||
cudnnSetRNNPaddingMode | Yes | No | deprecated in cuDNN 8.0 | rnn Padding相关操作。可用acdnnSetRNNDescriptor_v8() |
cudnnGetRNNPaddingMode | Yes | No | ||
cudnnRNNForwardInferenceEx | Yes | No | 可用 acdnnRNNForward() | |
cudnnSetRNNAlgorithmDescriptor | Yes | No | rnn 算法寻优 | |
cudnnGetRNNForwardInferenceAlgorithmMaxCount | Yes | No | ||
cudnnFindRNNForwardInferenceAlgorithmEx | Yes | No | ||
cudnnRNNForwardTrainingEx | Yes | No | deprecated in cuDNN 8.0 | 可用 acdnnRNNForward() |
cudnnRNNBackwardDataEx | Yes | No | 可用 acdnnRNNBackwardData_v8() | |
cudnnRNNBackwardWeightsEx | Yes | No | 可用 acdnnRNNBackwardWeights_v8() | |
cudnnGetRNNForwardTrainingAlgorithmMaxCount | Yes | No | rnn 算法寻优 | |
cudnnFindRNNForwardTrainingAlgorithmEx | Yes | No | ||
cudnnGetRNNBackwardDataAlgorithmMaxCount | Yes | No | ||
cudnnFindRNNBackwardDataAlgorithmEx | Yes | No | ||
cudnnGetRNNBackwardWeightsAlgorithmMaxCount | Yes | No | ||
cudnnFindRNNBackwardWeightsAlgorithmEx | Yes | No | ||
cudnnMultiHeadAttnBackwardData | Yes | No | MultiHeadAttn 反向 | |
cudnnMultiHeadAttnBackwardWeights | Yes | No | ||
cudnnDivisiveNormalizationForward | Yes | No | 前向 DivisiveNormalization 层计算 | |
cudnnDeriveNormTensorDescriptor | Yes | No | 导出normalization 层Tensor描述 | |
cudnnNormalizationForwardInference | Yes | No | 前向 Normalization 层计算 | |
cudnnDivisiveNormalizationBackward | Yes | No | 反向 DivisiveNormalization 层计算 | |
cudnnGetNormalizationForwardTrainingWorkspaceSize | Yes | No | Normalization 层辅助api,ws 大小获取 | |
cudnnGetNormalizationBackwardWorkspaceSize | Yes | No | ||
cudnnGetNormalizationTrainingReserveSpaceSize | Yes | No | ||
cudnnNormalizationForwardTraining | Yes | No | 前向 Normalization 层计算 | |
cudnnNormalizationBackward | Yes | No | 反向 Normalization 层计算 |
CUDA cuBLAS支持情况
对比cublas 11.9.2,cublas APIs支持状况如下表:
目前常用的NN场景使用的APIs大部分均已支持并调优;
对标Ampere所有API都是软件可支持的,目前无PPU硬件限制的因素;后续软件版本将根据优先级逐步完善。
目前API支持率为: 89/290 = 30.7%,不支持API主要分为以下几类:
复数数据类型:131个
特殊类型矩阵(对称、压缩、三角等)类型:36个
功能辅助(set/get,memcpy等)类型:18个
batched gemv:12个
不常见算法(矩阵加geam,求逆matinv):4个
由于AI场景里不涉及复数类型和特殊矩阵类型,AI场景API支持率为89/123 = 72.4%。
cuBLAS APIs支持状况
api | cublas 11.9.2 | ppu 1.4 |
cublasCreate_v2 | Yes | Yes |
cublasDestroy_v2 | Yes | Yes |
cublasGetProperty | Yes | Yes |
cublasSetStream_v2 | Yes | Yes |
cublasGetStream_v2 | Yes | Yes |
cublasGetMathMode | Yes | Yes |
cublasSetMathMode | Yes | Yes |
cublasGetPointerMode_v2 | Yes | Yes |
cublasSetPointerMode_v2 | Yes | Yes |
cublasSetWorkspace_v2 | Yes | Yes |
cublasGetStatusString | Yes | Yes |
cublasIamaxEx | Yes | Yes |
cublasIsamax_v2 | Yes | Yes |
cublasIdamax_v2 | Yes | Yes |
cublasIaminEx | Yes | Yes |
cublasIsamin_v2 | Yes | Yes |
cublasIdamin_v2 | Yes | Yes |
cublasAsumEx | Yes | Yes |
cublasSasum_v2 | Yes | Yes |
cublasDasum_v2 | Yes | Yes |
cublasAxpyEx | Yes | Yes |
cublasSaxpy_v2 | Yes | Yes |
cublasDaxpy_v2 | Yes | Yes |
cublasCopyEx | Yes | Yes |
cublasScopy_v2 | Yes | Yes |
cublasDcopy_v2 | Yes | Yes |
cublasDotEx | Yes | Yes |
cublasSdot_v2 | Yes | Yes |
cublasDdot_v2 | Yes | Yes |
cublasNrm2Ex | Yes | Yes |
cublasSnrm2_v2 | Yes | Yes |
cublasDnrm2_v2 | Yes | Yes |
cublasRotEx | Yes | Yes |
cublasSrot_v2 | Yes | Yes |
cublasDrot_v2 | Yes | Yes |
cublasRotgEx | Yes | Yes |
cublasSrotg_v2 | Yes | Yes |
cublasDrotg_v2 | Yes | Yes |
cublasRotmEx | Yes | Yes |
cublasSrotm_v2 | Yes | Yes |
cublasDrotm_v2 | Yes | Yes |
cublasRotmgEx | Yes | Yes |
cublasSrotmg_v2 | Yes | Yes |
cublasDrotmg_v2 | Yes | Yes |
cublasScalEx | Yes | Yes |
cublasSscal_v2 | Yes | Yes |
cublasDscal_v2 | Yes | Yes |
cublasSwapEx | Yes | Yes |
cublasSswap_v2 | Yes | Yes |
cublasDswap_v2 | Yes | Yes |
cublasSgemv_v2 | Yes | Yes |
cublasDgemv_v2 | Yes | Yes |
cublasSgemm_v2 | Yes | Yes |
cublasDgemm_v2 | Yes | Yes |
cublasHgemm | Yes | Yes |
cublasSgemmEx | Yes | Yes |
cublasGemmEx | Yes | Yes |
cublasHgemmBatched | Yes | Yes |
cublasSgemmBatched | Yes | Yes |
cublasGemmBatchedEx | Yes | Yes |
cublasGemmStridedBatchedEx | Yes | Yes |
cublasSgemmStridedBatched | Yes | Yes |
cublasDgemmBatched | Yes | Yes |
cublasDgemmStridedBatched | Yes | Yes |
cublasHgemmStridedBatched | Yes | Yes |
cublasSgetrfBatched | Yes | Yes |
cublasDgetrfBatched | Yes | Yes |
cublasSgetrsBatched | Yes | Yes |
cublasDgetrsBatched | Yes | Yes |
cublasSger_v2 | Yes | Yes |
cublasDger_v2 | Yes | Yes |
cublasSsyr_v2 | Yes | Yes |
cublasDsyr_v2 | Yes | Yes |
cublasSspr_v2 | Yes | Yes |
cublasDspr_v2 | Yes | Yes |
cublasSsyr2_v2 | Yes | Yes |
cublasDsyr2_v2 | Yes | Yes |
cublasSspr2_v2 | Yes | Yes |
cublasDspr2_v2 | Yes | Yes |
cublasStrsm_v2 | Yes | Yes |
cublasDtrsm_v2 | Yes | Yes |
cublasStrsmBatched | Yes | Yes |
cublasDtrsmBatched | Yes | Yes |
cublasSgetriBatched | Yes | Yes |
cublasDgetriBatched | Yes | Yes |
cublasSgeqrfBatched | Yes | Yes |
cublasDgeqrfBatched | Yes | Yes |
cublasSgelsBatched | Yes | Yes |
cublasDgelsBatched | Yes | Yes |
cublasGetVersion_v2 | Yes | No |
cublasGetCudartVersion | Yes | No |
cublasGetAtomicsMode | Yes | No |
cublasSetAtomicsMode | Yes | No |
cublasGetSmCountTarget | Yes | No |
cublasSetSmCountTarget | Yes | No |
cublasGetStatusName | Yes | No |
cublasLoggerConfigure | Yes | No |
cublasSetLoggerCallback | Yes | No |
cublasGetLoggerCallback | Yes | No |
cublasSetVector | Yes | No |
cublasGetVector | Yes | No |
cublasSetMatrix | Yes | No |
cublasGetMatrix | Yes | No |
cublasSetVectorAsync | Yes | No |
cublasGetVectorAsync | Yes | No |
cublasSetMatrixAsync | Yes | No |
cublasGetMatrixAsync | Yes | No |
cublasSgemvBatched | Yes | No |
cublasDgemvBatched | Yes | No |
cublasCgemvBatched | Yes | No |
cublasZgemvBatched | Yes | No |
cublasHSHgemvBatched | Yes | No |
cublasHSSgemvBatched | Yes | No |
cublasTSTgemvBatched | Yes | No |
cublasTSSgemvBatched | Yes | No |
cublasSgemvStridedBatched | Yes | No |
cublasDgemvStridedBatched | Yes | No |
cublasCgemvStridedBatched | Yes | No |
cublasZgemvStridedBatched | Yes | No |
cublasHSHgemvStridedBatched | Yes | No |
cublasHSSgemvStridedBatched | Yes | No |
cublasTSTgemvStridedBatched | Yes | No |
cublasTSSgemvStridedBatched | Yes | No |
cublasScnrm2_v2 | Yes | No |
cublasDznrm2_v2 | Yes | No |
cublasDotcEx | Yes | No |
cublasCdotu_v2 | Yes | No |
cublasCdotc_v2 | Yes | No |
cublasZdotu_v2 | Yes | No |
cublasZdotc_v2 | Yes | No |
cublasCscal_v2 | Yes | No |
cublasCsscal_v2 | Yes | No |
cublasZscal_v2 | Yes | No |
cublasZdscal_v2 | Yes | No |
cublasCaxpy_v2 | Yes | No |
cublasZaxpy_v2 | Yes | No |
cublasCcopy_v2 | Yes | No |
cublasZcopy_v2 | Yes | No |
cublasCswap_v2 | Yes | No |
cublasZswap_v2 | Yes | No |
cublasIcamax_v2 | Yes | No |
cublasIzamax_v2 | Yes | No |
cublasIcamin_v2 | Yes | No |
cublasIzamin_v2 | Yes | No |
cublasScasum_v2 | Yes | No |
cublasDzasum_v2 | Yes | No |
cublasCrot_v2 | Yes | No |
cublasCsrot_v2 | Yes | No |
cublasZrot_v2 | Yes | No |
cublasZdrot_v2 | Yes | No |
cublasCrotg_v2 | Yes | No |
cublasZrotg_v2 | Yes | No |
cublasCgemv_v2 | Yes | No |
cublasZgemv_v2 | Yes | No |
cublasCgemm_v2 | Yes | No |
cublasCgemm3m | Yes | No |
cublasCgemm3mEx | Yes | No |
cublasZgemm_v2 | Yes | No |
cublasZgemm3m | Yes | No |
cublasCgemmEx | Yes | No |
cublasCgemmBatched | Yes | No |
cublasCgemm3mBatched | Yes | No |
cublasZgemmBatched | Yes | No |
cublasCgemmStridedBatched | Yes | No |
cublasCgemm3mStridedBatched | Yes | No |
cublasZgemmStridedBatched | Yes | No |
cublasCgetrfBatched | Yes | No |
cublasZgetrfBatched | Yes | No |
cublasCgetrsBatched | Yes | No |
cublasZgetrsBatched | Yes | No |
cublasSgbmv_v2 | Yes | No |
cublasDgbmv_v2 | Yes | No |
cublasCgbmv_v2 | Yes | No |
cublasZgbmv_v2 | Yes | No |
cublasStrmv_v2 | Yes | No |
cublasDtrmv_v2 | Yes | No |
cublasCtrmv_v2 | Yes | No |
cublasZtrmv_v2 | Yes | No |
cublasStbmv_v2 | Yes | No |
cublasDtbmv_v2 | Yes | No |
cublasCtbmv_v2 | Yes | No |
cublasZtbmv_v2 | Yes | No |
cublasStpmv_v2 | Yes | No |
cublasDtpmv_v2 | Yes | No |
cublasCtpmv_v2 | Yes | No |
cublasZtpmv_v2 | Yes | No |
cublasStrsv_v2 | Yes | No |
cublasDtrsv_v2 | Yes | No |
cublasCtrsv_v2 | Yes | No |
cublasZtrsv_v2 | Yes | No |
cublasStpsv_v2 | Yes | No |
cublasDtpsv_v2 | Yes | No |
cublasCtpsv_v2 | Yes | No |
cublasZtpsv_v2 | Yes | No |
cublasStbsv_v2 | Yes | No |
cublasDtbsv_v2 | Yes | No |
cublasCtbsv_v2 | Yes | No |
cublasZtbsv_v2 | Yes | No |
cublasSsymv_v2 | Yes | No |
cublasDsymv_v2 | Yes | No |
cublasCsymv_v2 | Yes | No |
cublasZsymv_v2 | Yes | No |
cublasChemv_v2 | Yes | No |
cublasZhemv_v2 | Yes | No |
cublasSsbmv_v2 | Yes | No |
cublasDsbmv_v2 | Yes | No |
cublasChbmv_v2 | Yes | No |
cublasZhbmv_v2 | Yes | No |
cublasSspmv_v2 | Yes | No |
cublasDspmv_v2 | Yes | No |
cublasChpmv_v2 | Yes | No |
cublasZhpmv_v2 | Yes | No |
cublasCgeru_v2 | Yes | No |
cublasCgerc_v2 | Yes | No |
cublasZgeru_v2 | Yes | No |
cublasZgerc_v2 | Yes | No |
cublasCsyr_v2 | Yes | No |
cublasZsyr_v2 | Yes | No |
cublasCher_v2 | Yes | No |
cublasZher_v2 | Yes | No |
cublasChpr_v2 | Yes | No |
cublasZhpr_v2 | Yes | No |
cublasCsyr2_v2 | Yes | No |
cublasZsyr2_v2 | Yes | No |
cublasCher2_v2 | Yes | No |
cublasZher2_v2 | Yes | No |
cublasChpr2_v2 | Yes | No |
cublasZhpr2_v2 | Yes | No |
cublasSsyrk_v2 | Yes | No |
cublasDsyrk_v2 | Yes | No |
cublasCsyrk_v2 | Yes | No |
cublasZsyrk_v2 | Yes | No |
cublasCsyrkEx | Yes | No |
cublasCsyrk3mEx | Yes | No |
cublasCherk_v2 | Yes | No |
cublasZherk_v2 | Yes | No |
cublasCherkEx | Yes | No |
cublasCherk3mEx | Yes | No |
cublasSsyr2k_v2 | Yes | No |
cublasDsyr2k_v2 | Yes | No |
cublasCsyr2k_v2 | Yes | No |
cublasZsyr2k_v2 | Yes | No |
cublasCher2k_v2 | Yes | No |
cublasZher2k_v2 | Yes | No |
cublasSsyrkx | Yes | No |
cublasDsyrkx | Yes | No |
cublasCsyrkx | Yes | No |
cublasZsyrkx | Yes | No |
cublasCherkx | Yes | No |
cublasZherkx | Yes | No |
cublasSsymm_v2 | Yes | No |
cublasDsymm_v2 | Yes | No |
cublasCsymm_v2 | Yes | No |
cublasZsymm_v2 | Yes | No |
cublasChemm_v2 | Yes | No |
cublasZhemm_v2 | Yes | No |
cublasCtrsm_v2 | Yes | No |
cublasZtrsm_v2 | Yes | No |
cublasStrmm_v2 | Yes | No |
cublasDtrmm_v2 | Yes | No |
cublasCtrmm_v2 | Yes | No |
cublasZtrmm_v2 | Yes | No |
cublasSgeam | Yes | No |
cublasDgeam | Yes | No |
cublasCgeam | Yes | No |
cublasZgeam | Yes | No |
cublasCgetriBatched | Yes | No |
cublasZgetriBatched | Yes | No |
cublasCtrsmBatched | Yes | No |
cublasZtrsmBatched | Yes | No |
cublasSmatinvBatched | Yes | No |
cublasDmatinvBatched | Yes | No |
cublasCmatinvBatched | Yes | No |
cublasZmatinvBatched | Yes | No |
cublasCgeqrfBatched | Yes | No |
cublasZgeqrfBatched | Yes | No |
cublasCgelsBatched | Yes | No |
cublasZgelsBatched | Yes | No |
cublasSdgmm | Yes | No |
cublasDdgmm | Yes | No |
cublasCdgmm | Yes | No |
cublasZdgmm | Yes | No |
cublasStpttr | Yes | No |
cublasDtpttr | Yes | No |
cublasCtpttr | Yes | No |
cublasZtpttr | Yes | No |
cublasStrttp | Yes | No |
cublasDtrttp | Yes | No |
cublasCtrttp | Yes | No |
cublasZtrttp | Yes | No |