GPU Model | Release Year | NetBW (GB/s) | Compute (FP16 GFLOP/s) | MemBW (GB/s) | Ratio (FLOP/B) |
---|---|---|---|---|---|
V100 | 2017 | 300 | 125,000 | 900 | 139 |
Computer architecture:
SIMD: in time or in space
Time-space duality
CUDA core three key abstractions:
Grid → Block → Warp → Thread
SM (streaming multiprocessor) contains many cores. Each core execute instruction through SIMD (Single Instruction, Multiple Data) way, which means it applies the same intruction on multiple data elements at the same time.
Each SM can contain 32-64-128 cuda cores according to the architecture.
Each core can execute multiple warps at the same time. But each thread in these warps must execute the same instruction.
__global__
declaration
specifier<<<...>>>
to
specify number of CUDA threads usedEach block can hold n-dimensional threads. All threads in the same block reside on the same streaming multiprocessor (SM) core and share the limited resource of that core.
Blocks are organized into n-dimensional grid of thread blocks. The number of thread blocks in a grid is usually dictated by the size of the data being processed.
Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses.
Each SM possesses its own dedicated shared, cache, constant, and register memory. However, multiple SMs share the same global memory.
The entire memory is divided into different banks that can be accessed simultaneously. Banks share address and data buses (to minimize pin cost)
Can start and complete one bank access per cycle. Can sustain N concurrent accesses if all N go to different banks.
Function prototype
float serialFunction(...)
__global__ void kernel(...)
main()
cudaMalloc(&d_in, bytes)
cudaMemCpy(d_in, h_in, ...)
kernel<<execution configuration>>(args...)
cudaMemCpy(h_out, d_out, ...)
kernel
__shared__
__syncthreads()
constant memory
Constant memory is cached inside each GPU core and it is particularly fast when all threads of a warp access the same value
// Declare the mask as a global variable
#define MASK_WIDTH 5
float M[MASK_WIDTH];
__constant__ // Initialize the mask from the host
(M, M_h, Mask_Width * sizeof(float)); cudaMemcpyToSymbol
Variable declaration | Memory | Scope | Lifetime |
---|---|---|---|
int LocalVar |
register | thread | thread |
int localArr[N] |
global | thread | thread |
__device__ __shared__ int SharedVar |
shared | block | block |
__device__ |
global | grid | application |
__device__ __constant__ |
constant | grid | application |
__device__
is optional when used with
__shared__
and
__constant__
typically, 32 banks in nvidia gpus, bank = address % 32