PPU编程指南 (v1.4)
在 PPU平台上开发应用程序,用户可以使用CUDA 语言编写程序,通过 PPU SDK 编译后在PPU上运行。
前言
在阿里云PAI平台上,可以直接使用真武810E算力。真武810E算力在软件栈上完全兼容CUDA软件栈,用户可以直接使用CUDA语言编写程序,在PAI-DSW平台上使用编译后,在DLC、EAS等模块中使用真武810E运行。本文描述了真武810E算力上的编程模型、CUDA Sample最佳实践、CuBLAS最佳实践、Cutlass最佳实践以及NVCC、CUDA Sample兼容说明。
请注意GPU CUDA SDK编译的二进制Binary不可以在PPU平台上运行,请勿安装带有预编译GPU Binary的包到PPU运行环境,避免未知的运行错误。

PPU编程模型
PAI Serverless on 真武810E用户可以使用与CUDA 一样的《CUDA C++ Programming Model》( https://docs.nvidia.com/cuda/cuda-c-programming-guide/#programming-model )编写程序,完整的代码可以参考CUDA Sample(https://github.com/NVIDIA/cuda-samples)。
Kernel函数
在真武810E上可以使用和Nvidia CUDA一样的形式编写Kernel函数:
Kernel函数使用 __global__装饰器来声明, 通过<<<...>>>execution configuration C++语法来定义CUDA线程数量。每个执行Kernel函数的线程都会被赋予一个唯一的线程 ID,这个 ID 可以通过内建变量在Kernel函数中访问。
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}线程层级
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}如下图所示, 真武810E与NVidia GPU 一样分为 Thread-Block-Grid 三个层级,每个Block的Thread数量是有限制的,因为同一个Block块中的所有Thread都预计会驻留在同一个处理器核心上,并且必须共享该核心的有限内存资源。

显存层级

如上图所示,真武810E Thread在执行过程中可能会访问来自多个Memory空间的数据,每个Thread都有local memory。每个Block有Share Memory,所有属于该Block的Thread都可以访问,并且其生命周期与该线程块相同。所有线程可以访问Global Memory。
启动运行环境
在PAI中,可以直接使用PAI-DSW运行环境编译真武810E程序。在创建DSW实例时,使用真武810E资源,如仅用于编译测试,可仅配置少量GPU卡数、CPU、内存资源。

使用PAI官方镜像(参考PAI-PPU-V1.4.X 官方镜像 Release Note),该镜像中默认集成最新的PPU SDK软件包,并集成相应的PPU版本pytorch、megatron-patch等训练框架及其依赖的PPU版本软件包,方便用户快速构建上层应用程序。其中,PPU SDK是针对真武810E芯片输出的软件开发包,可供创建使用PPU加速的高性能应用。借助该工具包,你可以在PPU软件栈上开发、优化、部署相关的应用。此工具包中包含编译链工具、头文件、多个加速库、不同的运行配置,以及一个用于构建和部署应用的运行时库。

可按实际需要进行数据集、SSH配置,完成后启动DSW实例:

CUDA Sample最佳实践
本章节以CUDA Sample为例提供从启动真武810E运行环境,下载代码,编译以及运行的最佳实践示例。
vectorAdd向量加法
用例代码
/** * Vector addition: C = A + B. * * This sample is a very basic sample that implements element by element * vector addition. It is the same as the sample illustrating Chapter 2 * of the programming guide with some additions like error checking. */ #include <stdio.h> // For the CUDA runtime routines (prefixed with "cuda_") #include <cuda_runtime.h> #include <helper_cuda.h> /** * CUDA Kernel Device code * * Computes the vector addition of A and B into C. The 3 vectors have the same * number of elements numElements. */ __global__ void vectorAdd(const float *A, const float *B, float *C, int numElements) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < numElements) { C[i] = A[i] + B[i] + 0.0f; } } /** * Host main routine */ int main(void) { // Error code to check return values for CUDA calls cudaError_t err = cudaSuccess; // Print the vector length to be used, and compute its size int numElements = 50000; size_t size = numElements * sizeof(float); printf("[Vector addition of %d elements]\n", numElements); // Allocate the host input vector A float *h_A = (float *)malloc(size); // Allocate the host input vector B float *h_B = (float *)malloc(size); // Allocate the host output vector C float *h_C = (float *)malloc(size); // Verify that allocations succeeded if (h_A == NULL || h_B == NULL || h_C == NULL) { fprintf(stderr, "Failed to allocate host vectors!\n"); exit(EXIT_FAILURE); } // Initialize the host input vectors for (int i = 0; i < numElements; ++i) { h_A[i] = rand() / (float)RAND_MAX; h_B[i] = rand() / (float)RAND_MAX; } // Allocate the device input vector A float *d_A = NULL; err = cudaMalloc((void **)&d_A, size); if (err != cudaSuccess) { fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Allocate the device input vector B float *d_B = NULL; err = cudaMalloc((void **)&d_B, size); if (err != cudaSuccess) { fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Allocate the device output vector C float *d_C = NULL; err = cudaMalloc((void **)&d_C, size); if (err != cudaSuccess) { fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Copy the host input vectors A and B in host memory to the device input // vectors in // device memory printf("Copy input data from the host memory to the CUDA device\n"); err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); if (err != cudaSuccess) { fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); if (err != cudaSuccess) { fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Launch the Vector Add CUDA Kernel int threadsPerBlock = 256; int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock; printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock); vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements); err = cudaGetLastError(); if (err != cudaSuccess) { fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Copy the device result vector in device memory to the host result vector // in host memory. printf("Copy output data from the CUDA device to the host memory\n"); err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); if (err != cudaSuccess) { fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Verify that the result vector is correct for (int i = 0; i < numElements; ++i) { if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) { fprintf(stderr, "Result verification failed at element %d!\n", i); exit(EXIT_FAILURE); } } printf("Test PASSED\n"); // Free device global memory err = cudaFree(d_A); if (err != cudaSuccess) { fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } err = cudaFree(d_B); if (err != cudaSuccess) { fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } err = cudaFree(d_C); if (err != cudaSuccess) { fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Free host memory free(h_A); free(h_B); free(h_C); printf("Done\n"); return 0; }真武810E上代码与CUDA代码完全一致,无需修改。
代码下载
CUDA Sample场景下,可直接使用https://github.com/NVIDIA/cuda-samples的代码,在PAI-DSW的终端中执行如下命令。
#下载CUDA Sample代码到DSW运行环境 # donwload cuda sample git clone https://github.com/NVIDIA/cuda-samples.git # switch to specific sample version such as v12.4 cuda sample cd cuda-samples git tag git checkout -b v12.4
编译并测试
直接使用CUDA Sample代码中的Makefile进行编译,在PAI-DSW的终端中执行如下命令。
#进入CUDA Sample代码目录,Ex: cd Samples/0_Introduction/vectorAdd #编译并运行测试 make ./vectorAdd #输出 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
wmma矩阵乘法
用例代码
// CUDA sample demonstrating a integer GEMM computation using the Warp Matrix // Multiply and Accumulate API. // In this program, the compute_gemm kernel computes the result of a matrix // multiplication and addition: D = alpha * A * B + beta * C. The dimensions of // both C and D matrices are M_GLOBAL x N_GLOBAL. The A matrix is M_GLOBAL x // K_GLOBAL (row-major), the B matrix is K_GLOBAL x N_GLOBAL (column-major). In // that kernel, each CTA computes one 128 x 128 tile of the resulting matrix per // iteration. When the tile is computed, the CTA stores it to the global memory // and begins a new iteration, selecting a new 128 x 128 tile to compute. // Each CTA consists of eight warps. For the 128 x 128 tile, each warp computes // eight 16 x 16 subtiles, organized in a 2 x 4 two-dimensional array. Warps // compute the 16 x 16 subtiles using nvcuda::wmma::mma_sync operations by // moving through the K_GLOBAL dimension of the A and B matrices and // accumulating the intermediate result in the local thread state. // There are a number of simple optimizations used in the algorithm: // - The CTA copies the 128 x 128 tile of the C matrix from the global memory to // shared memory. After that is done, each warp loads the C matrix fragments // from shared memory, thus avoiding a random global memory access. // - On each internal iteration, the CTA copies a portion of the A and B // matrices from // global memory to shared memory. After that, all warps in the CTA reuse the // A and B data from shared memory, thus reducing the number of data copies // from global memory. // - The portions of the A and B matrices are stored in shared memory with an // additional // padding (skew) to reduce the number of shared memory access bank conflicts. // (See a detailed explanation near the SKEW_HALF macro definition.) // - When the CTA finishes computing the tiles of the resulting matrix, each // warp stores // its subtiles to shared memory. The CTA then copies the shared memory // contents to global memory, again avoiding redundant random global memory // accesses. // - Note that the CTA tile size is chosen to maximize the GPU register // utilization, // but carefully enough to avoid local memory use. #include <assert.h> #include <cuda.h> #include <mma.h> #include <stdio.h> // helper functions and utilities to work with CUDA #include <helper_cuda.h> #include <helper_functions.h> // Externally configurable parameters. #ifndef CPU_DEBUG // Set this to 1 to verify the correctness of the GPU-computed matrix. #define CPU_DEBUG 0 #endif #ifndef SHARED_MEMORY_LIMIT_64K // Set this to 0 to use more than 64 Kb of shared memory to cache data, to // improve the performance of the computations on GPU. // Note that you need a GPU that can have more than 64 Kb of shared memory // per multiprocessor. #define SHARED_MEMORY_LIMIT_64K 1 #endif // GPU configuration. #define WARP_SIZE 32 // MMA matrix tile dimensions. #define M 16 #define N 16 #define K 16 #define WMMA_M 16 #define WMMA_N 16 #define WMMA_K 16 // GEMM configuration. #define M_TILES 256 #define N_TILES 256 #define K_TILES 256 #define M_GLOBAL (M * M_TILES) #define N_GLOBAL (N * N_TILES) #define K_GLOBAL (K * K_TILES) #define C_LAYOUT wmma::mem_row_major // Implementation constants. #define WARPS_PER_BLOCK 8 #define THREADS_PER_BLOCK (WARP_SIZE * WARPS_PER_BLOCK) #if SHARED_MEMORY_LIMIT_64K // With only 64 Kb shared memory available, we can fit two 8-tile chunks of // the A and B matrix data, that are 16 * 16 * 8 * 8 * 2 = 32 Kb each // (i.e. two 8x8 arrays of tiles of 16x16 uint8_t-typed elements per CTA). // But we cannot account the 8 Kb total skew overhead, without which the // performance would be severely impacted. So we choose to reduce the chunk size // in half, i.e. the amount of A and B matrix data we cache in shared memory. // Accordingly, this doubles the number of outer iterations across the global K // dimension, which only slightly impacts the performance. #define CHUNK_K 8 #else #define CHUNK_K 16 #endif #define CHUNK_LINE_BYTES (CHUNK_K * K * sizeof(uint8_t)) #define WARP_COPY_BYTES (WARP_SIZE * sizeof(int4)) #define CHUNK_COPY_LINES_PER_WARP (WARP_COPY_BYTES / CHUNK_LINE_BYTES) #define CHUNK_COPY_LINE_LANES (WARP_SIZE / CHUNK_COPY_LINES_PER_WARP) #define BLOCK_ROW_WARPS 2 #define BLOCK_COL_WARPS 4 #define WARP_ROW_TILES 4 #define WARP_COL_TILES 2 #define BLOCK_ROW_TILES (WARP_ROW_TILES * BLOCK_ROW_WARPS) #define BLOCK_COL_TILES (WARP_COL_TILES * BLOCK_COL_WARPS) #define GLOBAL_MEM_STRIDE N_GLOBAL #define SHMEM_STRIDE (N * BLOCK_ROW_TILES) #define SHMEM_OFFSET (N * WARP_ROW_TILES) // The macro below is used to shift rows of the A matrix and columns of the B // matrix in shared memory to minimize possible bank conflicts. Before // performing the nvcuda::wmma::mma_sync operation, the warp must load the // matrix data using the nvcuda::wmma::load_matrix_sync operation. Although the // memory access pattern is not specified for that function, each lane in the // warp can read one or multiple matrix elements from different matrix rows or // columns. For shared memory, such access can result in bank conflicts if // different rows / columns of the matrix map to the same bank. By shifting each // row and column by a few bytes, we make sure that they map to different banks, // thus reducing the number of possible bank conflicts. The number of 32 // one-byte "uint8_t" elements is chosen as the minimum possible shift because // we must keep each row and column 256-bit aligned, as required by // nvcuda::wmma::load_matrix_sync. #define SKEW_UINT8 32 #define checkKernelErrors(expr) \ do { \ expr; \ \ cudaError_t __err = cudaGetLastError(); \ if (__err != cudaSuccess) { \ printf("Line %d: '%s' failed: %s\n", __LINE__, #expr, \ cudaGetErrorString(__err)); \ abort(); \ } \ } while (0) using namespace nvcuda; __host__ void init_host_matrices(uint8_t *a, uint8_t *b, int *c) { for (int i = 0; i < M_GLOBAL; i++) { for (int j = 0; j < K_GLOBAL; j++) { a[i * K_GLOBAL + j] = (uint8_t)(rand() % 3); } } for (int i = 0; i < N_GLOBAL; i++) { for (int j = 0; j < K_GLOBAL; j++) { b[i * K_GLOBAL + j] = (uint8_t)(rand() % 3); } } for (int t = 0; t < M_GLOBAL * N_GLOBAL; t++) { c[t] = (rand() % 3); } } __global__ void compute_gemm_imma(const uint8_t *A, const uint8_t *B, const int *C, int *D, int alpha, int beta) { extern __shared__ uint8_t shmem[][CHUNK_K * K + SKEW_UINT8]; // Warp and lane identification. const unsigned int warpId = threadIdx.x / WARP_SIZE; const unsigned int laneId = threadIdx.x % WARP_SIZE; // Offset in shared memory from which the B matrix is stored. const size_t shmem_idx_b_off = BLOCK_COL_TILES * M; // This pointer is used to access the C and D matrix tiles this warp computes. int *shmem_warp_tile_ptr = (int *)&shmem[0][0] + (warpId / 2) * SHMEM_STRIDE * K * 2 + (warpId % 2) * SHMEM_OFFSET; // This pointer is used to stream the C and D matrices block-wide tile to and // from shared memory. int *shmem_warp_stream_ptr = (int *)&shmem[0][0] + warpId * SHMEM_STRIDE * K; // Adjust the beta scaler, as it'll be multiplied by alpha at the end of // each tile computation. Technically this is not generally correct (may // result in a loss of precision). Zero still needs to be specially handled // though. beta /= alpha; // Each CTA slides along the 128 x 128 tiles from the top left corner of the // matrix to the right and down, and selects the next tile to compute. Once // there's no such tile, all warps in this CTA exit. for (unsigned int block_pos = blockIdx.x;; block_pos += gridDim.x) { const unsigned int block_tile_i = ((block_pos * BLOCK_ROW_TILES) / N_TILES) * (BLOCK_COL_TILES); const unsigned int block_tile_j = (block_pos * BLOCK_COL_TILES) % N_TILES; // Stop when there are no more D matrix tiles to compute in this CTA. if (block_tile_i >= M_TILES) { break; } // This warp's pointer to the C matrix data to copy memory from to shared // memory. const size_t gmem_idx = (block_tile_i + warpId) * M * GLOBAL_MEM_STRIDE + block_tile_j * N; const int *src_gmem_warp_stream_ptr = &C[gmem_idx]; // Stream multiple C tiles to shared memory. #pragma unroll for (int i = 0; i < K; i++) { typedef int4 copy_t; *((copy_t *)(shmem_warp_stream_ptr + SHMEM_STRIDE * i) + laneId) = *((copy_t *)(src_gmem_warp_stream_ptr + GLOBAL_MEM_STRIDE * i) + laneId); } __syncthreads(); // These fragments will accumulate the result of A and B matrix fragment // multiplications along the K_GLOBAL dimension. wmma::fragment<wmma::accumulator, M, N, K, int> c[WARP_COL_TILES] [WARP_ROW_TILES]; // Load the C matrix tiles into fragments from shared memory. #pragma unroll for (int i = 0; i < WARP_COL_TILES; i++) { #pragma unroll for (int j = 0; j < WARP_ROW_TILES; j++) { const int *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * K + j * N; wmma::load_matrix_sync(c[i][j], tile_ptr, SHMEM_STRIDE, C_LAYOUT); } } __syncthreads(); // Scale the C matrix. #pragma unroll for (int i = 0; i < WARP_COL_TILES; i++) { #pragma unroll for (int j = 0; j < WARP_ROW_TILES; j++) { #pragma unroll for (int t = 0; t < c[i][j].num_elements; t++) { c[i][j].x[t] *= beta; } } } // Select what warp copies what matrix to shared memory. // Warps 0-3 copy the A matrix, warps 4-7 copy the B matrix. const uint8_t *warp_ptr = (warpId < 4) ? (&A[block_tile_i * M * K_GLOBAL] + M * K_GLOBAL * (warpId % 4) * 2) : (&B[block_tile_j * N * K_GLOBAL] + N * K_GLOBAL * (warpId % 4) * 2); // Go through the global K dimension by a fixed step at a time. #pragma unroll for (int tile_k = 0; tile_k < K_TILES; tile_k += CHUNK_K) { // Copy slices of the A and B matrices to shared memory. // The first half of the warps in the CTA copy the A matrix, the rest copy // the B matrix. size_t shmem_idx = warpId < (WARPS_PER_BLOCK / 2) ? (M * (warpId % (WARPS_PER_BLOCK / 2)) * 2) : (N * (warpId % (WARPS_PER_BLOCK / 2)) * 2 + shmem_idx_b_off); // First half of the warp copies the first row / column of the matrix, // the second half of the warp copies the next. int4 *lane_ptr = (int4 *)(warp_ptr + tile_k * K + (laneId / CHUNK_COPY_LINE_LANES) * K_GLOBAL) + (laneId % CHUNK_COPY_LINE_LANES); // Shift the second half of the warp to the next row / column in the // shared memory. shmem_idx += laneId / CHUNK_COPY_LINE_LANES; #pragma unroll for (int i = 0; i < ((WARP_SIZE / 2) / CHUNK_COPY_LINES_PER_WARP) * 2; i++) { // Copy 16 bytes at once in each lane. *((int4 *)&shmem[shmem_idx][0] + (laneId % CHUNK_COPY_LINE_LANES)) = *lane_ptr; // Advance the global memory pointer and the shared memory index. lane_ptr = (int4 *)((uint8_t *)lane_ptr + K_GLOBAL * CHUNK_COPY_LINES_PER_WARP); shmem_idx += CHUNK_COPY_LINES_PER_WARP; } __syncthreads(); // Compute a grid of C matrix tiles in each warp. #pragma unroll for (int k_step = 0; k_step < CHUNK_K; k_step++) { wmma::fragment<wmma::matrix_a, M, N, K, uint8_t, wmma::row_major> a[WARP_COL_TILES]; wmma::fragment<wmma::matrix_b, M, N, K, uint8_t, wmma::col_major> b[WARP_ROW_TILES]; #pragma unroll for (int i = 0; i < WARP_COL_TILES; i++) { size_t shmem_idx_a = (warpId / 2) * M * 2 + (i * M); const uint8_t *tile_ptr = &shmem[shmem_idx_a][k_step * K]; wmma::load_matrix_sync(a[i], tile_ptr, K * CHUNK_K + SKEW_UINT8); #pragma unroll for (int j = 0; j < WARP_ROW_TILES; j++) { if (i == 0) { // Load the B matrix fragment once, because it is going to be // reused against the other A matrix fragments. size_t shmem_idx_b = shmem_idx_b_off + (WARP_ROW_TILES * N) * (warpId % 2) + (j * N); const uint8_t *tile_ptr = &shmem[shmem_idx_b][k_step * K]; wmma::load_matrix_sync(b[j], tile_ptr, K * CHUNK_K + SKEW_UINT8); } wmma::mma_sync(c[i][j], a[i], b[j], c[i][j]); } } } __syncthreads(); } // Store the D fragments to shared memory. #pragma unroll for (int i = 0; i < WARP_COL_TILES; i++) { #pragma unroll for (int j = 0; j < WARP_ROW_TILES; j++) { #pragma unroll // Uniform, point-wise transformations of ALL fragment elements by ALL // threads in the warp are well-defined even though element indices // within fragment storage are not defined. for (int t = 0; t < c[i][j].num_elements; t++) c[i][j].x[t] *= alpha; int *tile_ptr = shmem_warp_tile_ptr + i * SHMEM_STRIDE * K + j * N; wmma::store_matrix_sync(tile_ptr, c[i][j], SHMEM_STRIDE, C_LAYOUT); } } __syncthreads(); // Now that shared memory contains all the D tiles, stream them to global // memory. int *dst_gmem_warp_stream_ptr = &D[gmem_idx]; #pragma unroll for (int i = 0; i < K; i++) { *((int4 *)(dst_gmem_warp_stream_ptr + GLOBAL_MEM_STRIDE * i) + laneId) = *((int4 *)(shmem_warp_stream_ptr + SHMEM_STRIDE * i) + laneId); } __syncthreads(); } } // Performs an MxNxK GEMM (C=alpha*A*B + beta*C) assuming: // 1) Matrices are packed in memory. // 2) M, N and K are multiples of 16. // 3) Neither A nor B are transposed. // Note: This is a less performant version of the compute_gemm_imma kernel. It // is designed for // demonstration purposes only to show the CUDA WMMA API use without // relying on availability of the shared memory. __global__ void simple_wmma_gemm_imma(const uint8_t *a, const uint8_t *b, const int *c, int *d, int m_ld, int n_ld, int k_ld, int alpha, int beta) { // Leading dimensions. Packed with no transpositions. int lda = m_ld; int ldb = k_ld; int ldc = n_ld; // Tile using a 2D grid int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize; int warpN = (blockIdx.y * blockDim.y + threadIdx.y); // Declare the fragments wmma::fragment<wmma::matrix_a, WMMA_M, WMMA_N, WMMA_K, uint8_t, wmma::row_major> a_frag; wmma::fragment<wmma::matrix_b, WMMA_M, WMMA_N, WMMA_K, uint8_t, wmma::col_major> b_frag; wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, int> acc_frag; wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, int> c_frag; wmma::fill_fragment(acc_frag, 0.0f); // Loop over k for (int i = 0; i < k_ld; i += WMMA_K) { int aCol = i; int aRow = warpM * WMMA_M; int bCol = i; int bRow = warpN * WMMA_N; // Bounds checking if (aRow < m_ld && aCol < k_ld && bRow < k_ld && bCol < n_ld) { // Load the inputs wmma::load_matrix_sync(a_frag, a + aCol + aRow * lda, lda); wmma::load_matrix_sync(b_frag, b + bCol + bRow * ldb, ldb); // Perform the matrix multiplication wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag); } } // Load in the current value of c, scale it by beta, and add this our result // scaled by alpha int cCol = warpN * WMMA_N; int cRow = warpM * WMMA_M; if (cRow < m_ld && cCol < n_ld) { wmma::load_matrix_sync(c_frag, c + cCol + cRow * ldc, ldc, wmma::mem_row_major); for (int i = 0; i < c_frag.num_elements; i++) { c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i]; } // Store the output wmma::store_matrix_sync(d + cCol + cRow * ldc, c_frag, ldc, wmma::mem_row_major); } } __host__ void matMultiplyOnHost(uint8_t *A, uint8_t *B, int *C, int alpha, int beta, int numARows, int numAColumns, int numBRows, int numBColumns, int numCRows, int numCColumns) { for (int i = 0; i < numCRows; i++) { for (int j = 0; j < numCColumns; j++) { int temp = 0; for (int k = 0; k < numAColumns; k++) { temp += A[i * numAColumns + k] * B[j * numBRows + k]; } C[i * numCColumns + j] = temp * alpha + beta * C[i * numCColumns + j]; } } } int main(int argc, char **argv) { printf("Initializing...\n"); int dev = findCudaDevice(argc, (const char **)argv); cudaDeviceProp deviceProp; checkCudaErrors(cudaGetDeviceProperties(&deviceProp, dev)); // Tensor cores require a GPU of Volta (SM72) architecture or higher. if (deviceProp.major < 7 || (deviceProp.major <= 7 && deviceProp.minor < 2)) { printf( "immaTensorCoreGemm requires SM 7.2 or higher to use Tensor Cores. " "Exiting...\n"); exit(EXIT_WAIVED); } printf("M: %d (%d x %d)\n", M_GLOBAL, M, M_TILES); printf("N: %d (%d x %d)\n", N_GLOBAL, N, N_TILES); printf("K: %d (%d x %d)\n", K_GLOBAL, K, K_TILES); uint8_t *A_h = NULL; uint8_t *B_h = NULL; int *C_h = NULL; #if CPU_DEBUG int *result_hD = NULL; int *result_host = NULL; #endif A_h = (uint8_t *)malloc(sizeof(uint8_t) * M_GLOBAL * K_GLOBAL); B_h = (uint8_t *)malloc(sizeof(uint8_t) * K_GLOBAL * N_GLOBAL); C_h = (int *)malloc(sizeof(int) * M_GLOBAL * N_GLOBAL); #if CPU_DEBUG result_hD = (int *)malloc(sizeof(int) * M_GLOBAL * N_GLOBAL); result_host = (int *)malloc(sizeof(int) * M_GLOBAL * N_GLOBAL); #endif uint8_t *A = NULL; uint8_t *B = NULL; int *C = NULL; int *D = NULL; checkCudaErrors( cudaMalloc(reinterpret_cast<void **>(&A), sizeof(uint8_t) * M_GLOBAL * K_GLOBAL)); checkCudaErrors( cudaMalloc(reinterpret_cast<void **>(&B), sizeof(uint8_t) * N_GLOBAL * K_GLOBAL)); checkCudaErrors(cudaMalloc(reinterpret_cast<void **>(&C), sizeof(int) * M_GLOBAL * N_GLOBAL)); checkCudaErrors(cudaMalloc(reinterpret_cast<void **>(&D), sizeof(int) * M_GLOBAL * N_GLOBAL)); assert(((unsigned long long)A) % 128 == 0); assert(((unsigned long long)B) % 128 == 0); assert(((unsigned long long)C) % 128 == 0); assert(((unsigned long long)D) % 128 == 0); init_host_matrices(A_h, B_h, C_h); checkCudaErrors(cudaMemcpy(A, A_h, sizeof(uint8_t) * M_GLOBAL * K_GLOBAL, cudaMemcpyHostToDevice)); checkCudaErrors(cudaMemcpy(B, B_h, sizeof(uint8_t) * N_GLOBAL * K_GLOBAL, cudaMemcpyHostToDevice)); checkCudaErrors(cudaMemcpy(C, C_h, sizeof(int) * M_GLOBAL * N_GLOBAL, cudaMemcpyHostToDevice)); checkCudaErrors(cudaMemset(D, 0, sizeof(int) * M_GLOBAL * N_GLOBAL)); printf("Preparing data for GPU...\n"); assert(((unsigned long long)A) % 128 == 0); assert(((unsigned long long)B) % 128 == 0); assert(((unsigned long long)C) % 128 == 0); assert(((unsigned long long)D) % 128 == 0); enum { // Compute the right amount of shared memory to request. // We need shared memory to hold per-CTA C and D matrix tiles, and to cache // per-CTA chunks // of the A and B matrices. Therefore, the right amount to request is the // maximum of those // two numbers. SHMEM_SZ = MAX(sizeof(uint8_t) * (BLOCK_COL_TILES * M) * (CHUNK_K * K + SKEW_UINT8) * 2, M * (BLOCK_ROW_WARPS * WARP_ROW_TILES) * N * (BLOCK_COL_WARPS * WARP_COL_TILES) * sizeof(int)) }; printf("Required shared memory size: %lu Kb\n", SHMEM_SZ / 1024UL); int alpha = 1; int beta = 1; cudaEvent_t start, stop; checkCudaErrors(cudaEventCreate(&start)); checkCudaErrors(cudaEventCreate(&stop)); checkCudaErrors(cudaEventRecord(start)); // If enough shared memory available on the GPU use high performant kernel if (deviceProp.sharedMemPerMultiprocessor >= SHMEM_SZ) { printf("Computing... using high performance kernel compute_gemm_imma \n"); checkCudaErrors(cudaFuncSetAttribute( compute_gemm_imma, cudaFuncAttributeMaxDynamicSharedMemorySize, SHMEM_SZ)); checkKernelErrors( (compute_gemm_imma<<<deviceProp.multiProcessorCount, THREADS_PER_BLOCK, SHMEM_SZ>>>(A, B, C, D, alpha, beta))); #if CPU_DEBUG checkCudaErrors(cudaMemcpy(result_hD, D, sizeof(int) * M_GLOBAL * N_GLOBAL, cudaMemcpyDeviceToHost)); #endif } else { dim3 gridDim; dim3 blockDim; // blockDim.x must be a multiple of warpSize // 128x4 means we have 16 warps and a block computes a 64x64 output tile blockDim.x = 128; blockDim.y = 4; gridDim.x = (M_GLOBAL + (WMMA_M * blockDim.x / 32 - 1)) / (WMMA_M * blockDim.x / 32); gridDim.y = (N_GLOBAL + WMMA_N * blockDim.y - 1) / (WMMA_N * blockDim.y); printf("Computing... using simple_wmma_gemm_imma kernel\n"); simple_wmma_gemm_imma<<<gridDim, blockDim>>>(A, B, C, D, M_GLOBAL, N_GLOBAL, K_GLOBAL, alpha, beta); #if CPU_DEBUG checkCudaErrors(cudaMemcpy(result_hD, D, sizeof(int) * M_GLOBAL * N_GLOBAL, cudaMemcpyDeviceToHost)); #endif } checkCudaErrors(cudaEventRecord(stop)); checkCudaErrors(cudaEventSynchronize(stop)); #if CPU_DEBUG printf("Verifying correctness of the computations...\n"); memcpy(result_host, C_h, sizeof(int) * M_GLOBAL * N_GLOBAL); matMultiplyOnHost(A_h, B_h, result_host, alpha, beta, M_GLOBAL, K_GLOBAL, K_GLOBAL, N_GLOBAL, M_GLOBAL, N_GLOBAL); for (int i = 0; i < N_GLOBAL * M_GLOBAL; i++) { if (abs(result_hD[i] - result_host[i]) > 0) { printf("mismatch i=%d result_hD=%d result_host=%d\n", i, result_hD[i], result_host[i]); } } free(result_host); free(result_hD); #endif float milliseconds = 0; checkCudaErrors(cudaEventElapsedTime(&milliseconds, start, stop)); printf("Time: %f ms\n", milliseconds); printf("TOPS: %.2f\n", (((double)M_GLOBAL * N_GLOBAL * K_GLOBAL * 2)/(milliseconds/1000.)) / 1e12); free(A_h); free(B_h); free(C_h); checkCudaErrors(cudaFree(reinterpret_cast<void *>(A))); checkCudaErrors(cudaFree(reinterpret_cast<void *>(B))); checkCudaErrors(cudaFree(reinterpret_cast<void *>(C))); checkCudaErrors(cudaFree(reinterpret_cast<void *>(D))); return EXIT_SUCCESS; }PPU代码与CUDA代码一致,使用wmma API的方式一致,无需修改。
代码下载
与上文vectorAdd向量加法中的代码下载步骤相同。
编译并测试
直接使用CUDA Sample代码中的Makefile进行编译。
#进入CUDA Sample代码目录,Ex: cd Samples/3_CUDA_Features/immaTensorCoreGemm #编译并运行测试 make ./immaTensorCoreGemm #输出 Initializing... GPU Device 0: "Ampere" with compute capability 8.0 M: 4096 (16 x 256) N: 4096 (16 x 256) K: 4096 (16 x 256) Preparing data for GPU... Required shared memory size: 64 Kb Computing... using high performance kernel compute_gemm_imma Time: 3.457640 ms TOPS: 39.75
CuBLAS最佳实践
CuBLAS矩阵乘法
用例代码
#include <cstdio> #include <cstdlib> #include <vector> #include <cublas_v2.h> #include <cuda_runtime.h> #include "cublas_utils.h" using data_type = double; int main(int argc, char *argv[]) { cublasHandle_t cublasH = NULL; cudaStream_t stream = NULL; const int m = 2; const int n = 2; const int k = 2; const int lda = 2; const int ldb = 2; const int ldc = 2; /* * A = | 1.0 | 2.0 | * | 3.0 | 4.0 | * * B = | 5.0 | 6.0 | * | 7.0 | 8.0 | */ const std::vector<data_type> A = {1.0, 2.0, 3.0, 4.0}; const std::vector<data_type> B = {5.0, 6.0, 7.0, 8.0}; std::vector<data_type> C(m * n); const data_type alpha = 1.0; const data_type beta = 0.0; data_type *d_A = nullptr; data_type *d_B = nullptr; data_type *d_C = nullptr; cublasOperation_t transa = CUBLAS_OP_N; cublasOperation_t transb = CUBLAS_OP_N; printf("A\n"); print_matrix(m, k, A.data(), lda); printf("=====\n"); printf("B\n"); print_matrix(k, n, B.data(), ldb); printf("=====\n"); /* step 1: create cublas handle, bind a stream */ CUBLAS_CHECK(cublasCreate(&cublasH)); CUDA_CHECK(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking)); CUBLAS_CHECK(cublasSetStream(cublasH, stream)); /* step 2: copy data to device */ CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_A), sizeof(data_type) * A.size())); CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_B), sizeof(data_type) * B.size())); CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_C), sizeof(data_type) * C.size())); CUDA_CHECK(cudaMemcpyAsync(d_A, A.data(), sizeof(data_type) * A.size(), cudaMemcpyHostToDevice, stream)); CUDA_CHECK(cudaMemcpyAsync(d_B, B.data(), sizeof(data_type) * B.size(), cudaMemcpyHostToDevice, stream)); /* step 3: compute */ CUBLAS_CHECK( cublasDgemm(cublasH, transa, transb, m, n, k, &alpha, d_A, lda, d_B, ldb, &beta, d_C, ldc)); /* step 4: copy data to host */ CUDA_CHECK(cudaMemcpyAsync(C.data(), d_C, sizeof(data_type) * C.size(), cudaMemcpyDeviceToHost, stream)); CUDA_CHECK(cudaStreamSynchronize(stream)); /* * C = | 23.0 | 31.0 | * | 34.0 | 46.0 | */ printf("C\n"); print_matrix(m, n, C.data(), ldc); printf("=====\n"); /* free resources */ CUDA_CHECK(cudaFree(d_A)); CUDA_CHECK(cudaFree(d_B)); CUDA_CHECK(cudaFree(d_C)); CUBLAS_CHECK(cublasDestroy(cublasH)); CUDA_CHECK(cudaStreamDestroy(stream)); CUDA_CHECK(cudaDeviceReset()); return EXIT_SUCCESS; }PPU代码与CUDA代码一致,使用CuBLAS API的方式一致,无需修改。
代码下载
CUDA Sample场景下,可直接使用https://github.com/NVIDIA/CUDALibrarySamples的代码:
#下载CUDA Sample代码到PPU Docker运行环境 # donwload cuda library sample git clone https://github.com/NVIDIA/CUDALibrarySamples.git
编译并测试
直接使用CUDALibrarySamples代码中的Makefile进行编译:
#进入CUDALibrarySample代码目录,Ex: cd CUDALibrarySamples/cuBLAS/Level-3/gemm/ #编译并运行测试 mkdir build cd build cmake .. make ./cublas_gemm_example #输出 A 1.00 3.00 2.00 4.00 ===== B 5.00 7.00 6.00 8.00 ===== C 23.00 31.00 34.00 46.00 =====
Cutlass最佳实践
turing_tensorop_gemm
用例代码
... // This code section describes the tile size a thread block will compute using ShapeMMAThreadBlock = #ifdef __HGGCCC__ cutlass::gemm::GemmShape<64, 64, 64>; // <- threadblock tile M = 64, N = 64, K = 64 #else cutlass::gemm::GemmShape<128, 256, 64>; // <- threadblock tile M = 128, N = 256, K = 64 #endif // This code section describes tile size a warp will compute #ifdef __HGGCCC__ using ShapeMMAWarp = cutlass::gemm::GemmShape<32, 32, 64>; // <- warp tile M = 32, N = 32, K = 64 #else using ShapeMMAWarp = cutlass::gemm::GemmShape<64, 64, 64>; // <- warp tile M = 64, N = 64, K = 64 #endif // This code section describes the size of MMA op #ifdef __HGGCCC__ using ShapeMMAOp = cutlass::gemm::GemmShape<16, 16, 32>; // <- MMA Op tile M = 16, N = 16, K = 16 #else using ShapeMMAOp = cutlass::gemm::GemmShape<8, 8, 16>; // <- MMA Op tile M = 8, N = 8, K = 16 #endif // This code section describes how threadblocks are scheduled on GPU using SwizzleThreadBlock = cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>; // <- ?? // This code section describes the epilogue part of the kernel using EpilogueOp = cutlass::epilogue::thread::LinearCombination< ElementOutput, // <- data type of output matrix 128 / cutlass::sizeof_bits<ElementOutput>::value, // <- the number of elements per vectorized // memory access. For a byte, it's 16 // elements. This becomes the vector width of // math instructions in the epilogue too ElementAccumulator, // <- data type of accumulator ElementComputeEpilogue>; // <- data type for alpha/beta in linear combination function // Number of pipelines you want to use constexpr int NumStages = 2; using Gemm = cutlass::gemm::device::Gemm<ElementInputA, LayoutInputA, ElementInputB, LayoutInputB, ElementOutput, LayoutOutput, ElementAccumulator, MMAOp, SmArch, ShapeMMAThreadBlock, ShapeMMAWarp, ShapeMMAOp, EpilogueOp, SwizzleThreadBlock, NumStages>; ... # 如需完整的PPU Cutlass源代码,可联系PAI Serverless PDSA。真武810E完全兼容Nvidia官方开源的Cutlass代码,但使用PPU优化版本Cutlass,可以通过宏__HGGCCC__来enable MMA指令以及对应的Memory Layout,在PPU上使用TensorCore加速性能,所以推荐使用PPU Cutlass。
代码下载
如需完整的PPU Cutlass源代码,可联系PAI Serverless PDSA。
编译并测试
#切换到PPU cutlass V1.3 RELEASE 分支 git checkout remotes/origin/gpgpu_1v3_release #Sample代码目录,Ex: cd examples/08_turing_tensorop_gemm/ #使用NPU的编译脚本编译 ../common/build_and_run.sh turing_tensorop_gemm ppu #运行程序,完成problem size (M = 5120, N = 4096 and K = 4096)的矩阵乘法 ./turing_tensorop_gemm.ppu #输出 Passed
相关文档
查看并运行更多的CUDA Sample,请参见CUDA兼容性 (v1.4.1)。
NVCC编译选项支持列表,请参见NVCC Options兼容性 (v1.4)。





