Compute Architecture

This chapter elaborates on aspects of the GPU compute architecture that are important for CUDA programmers to reason about.

Architecture

A GPU is organized into an array of Streaming Multiprocessors (SMs).

e.g. A100 has 108 SMs

Each SM has several processing units called streaming processors or just CUDA cores. These cores share control logic and memory resources

e.g. A100 has 64 cores per SM, so 6,912 cores in the GPU

In terms of memory:

Each SM has a set of on-chip memory structures collectively called Memory
All SMs also have access to gigabytes of off-chip device memory, called Global Memory
For recent GPUs, these are HBM or HBM2
We will refer to this as DRAM (dynamic random access memory)

Block Scheduling

When a kernel is called, a grid of threads is launched. These threads are assigned to SMs on a block by block basis, i.e. all threads in a block are simultaneously assigned to the same SM.

Usually, multiple blocks are simultaneously assigned to one SM. The number of blocks that can be processed at the same time by an SM depends on hardware constraints.

This means that there is a limit on the number of blocks that are simultaneously executing on a CUDA GPU. The runtime system needs to maintain a list of blocks to execute and assign new blocks to SMs that have finished execution.

Since we are guaranteed that all threads on the same block are assigned to the same SM, it is possible for these threads on the same block to interact with each other. One interaction is barrier synchronization.

Synchronization

There is a __syncthreads() function to coorindate activities between threads in the same block. When a thread calls it, it will be held at the program location of the call until every thread in the block reaches that location.

Each __syncthreads call must be executed by all threads in a block. If we use control flows like if-else, each __syncthreads call is considered a different barrier. Hence, if some threads go through the if branch and others the else branch, we can end in a deadlock as we never reach the state where all threads hit the same __syncthreads call.

Note that by design, threads in different blocks cannot perform barrier synchronization with each other. This means that CUDA runtime can execute blocks in any order. This flexibility allows the runtime to efficiently allocate blocks to SMs according the GPU capacity.

Warps and SIMD

Here we go into further details on thread scheduling within a block. When a block is assigned to an SM, it gets further divided into 32-thread units called warps. Each warp comprises 32 threads of consecutive threadIdx values.

threads 0-31 form the first warp
threads 32-63 form the next warp etc.
If the number of threads is not a multiple of 32, the warp will be padded with extra dormant threads to complete the warp

Blocks are partitioned into warps based on thread indices. If the blockDim is one-dimensional, the partitioning and ordering is trivial. If there are two or three dimensions, the thread indices will be linearized using a row-major layout before partitioning and ordering into warps.

An SM is designed to execute all threads in a warp following the Single Instruction, Multiple Data (SIMD) model. This means that at any point in time, a single instruction is fetched and executed for all threads in a warp (for the most part, common exception is conditional branches).

For e.g., the A100 SM has 64 cores, split into 4 processing blocks with 16 cores each. Threads in the same warp get assigned to the same processing block, which fetches one instruction and executes it for all 32 threads simultaneously.

The advantage of SIMD is that the cost of the control hardware (the instruction fetch/dispatch unit) is shared across many execution units. This allows for a smaller percentage of the hardware to be dedicated to control flow.

Control Divergence

In the case of control divergence, i.e. if-else statements exist in the kernel code which depends on threadIdx, then the SIMD model breaks down. To accommodate such cases, the runtime makes multiple passes through each branch.

For example if we do if (threadIdx.x < 24), then one pass is done to compute those threads below the threshold, and another pass for those above the threshold. The threads which are not executing at any point in time are just put on hold / held inactive. The threads then reconverge after the conditional statement and continue executing.

Another example is to control the number of loops based on data:

N = a[threadIdx.x];
for (i=0; i < N; ++i) {
    // do something
}

The cost we pay for this flexibility is the extra number of passes we need to make.

The most common example of control divergence is handling boundary conditions due to the number of data points not being a multiple of the number of threads. For this case, only the last boundary case has control divergence. Hence when the size of data is large, the effects of control divergence is minimal.

An important implication of control divergence is that one cannot assume that all threads in a warp have the same execution timing. Hence we must use __syncwarp if we are depending on such behaviour.

Warp Scheduling and Latency Tolerance

When assigning threads to SMs, we usually assign far more threads than cores available on the SM. A natural question to ask is whether we should just assign the number of threads to be equal to the number of cores at each request?

It turns out that having many threads (or warps) waiting around for execution on SMs is an important feature for GPUs to tolerate long latency operations like reading from global memory.

When a warp of threads arrives at an SM and it is not ready for execution (as it is waiting for the result of a previously initiated memory access operation), the warp is not selected. Instead the SM will select another warp that is ready for execution. This mechanism allows the SM to "hide" the latency of long latency operations and fill it with meaningful work.

This dynamic scheduling of warps is made even more efficient due to zero-overhead scheduling design of GPUs. In CPUs, context switching is more expensive because when switching context from a thread that is waiting to a thread that is ready to make progress, some work needs to be done to save and restore register contents, adding significant overhead.

In GPUs, because SMs hold all the execution states of assigned warps in the hardware registers, there is very little overhead in switching between warps. Since GPUs can effectively "hide" the latency of long latency operations, GPUs dedicate less chip area to latency-reduction mechanisms like cache memories and branch prediction mechanisms, allowing them to focus more on floating point execution and large bandwidth memory access.

Occupancy

As we saw, over-subscription of warps to SMs is key for tolerating long latency operations. Hence, the goal is to maximize the number of warps assigned to an SM.

Occupancy. The ratio of the number of warps assigned to an SM over the maximum number of warps an SM supports is refrred to as occupancy.

Due to different limits on resource allocation, an inefficient configuration can lead to under-utilization and sub-optimal occupancy. For example, the A100 has the following limits:

Up to 32 blocks per SM
Up to 64 warps per SM
i.e. up to 2,048 thread slots per SM
Up to 1,024 threads per block

If we set block_size=512, num_blocks=4, then we get to max out our thread slots at 2,048

But if we set block_size=32, and maximize the number of blocks at num_blocks=32, we only utilize 1,024 thread slots. Because we utilized too few threads per block, we end up with only 50% occupancy.

Anoter scenario for under-utilization is when the number of threads per block is not divisible by the block size. For example, if we choose block_size=768, then the maximum num_blocks=2. This results in 1,536 / 2,048 = 75% occupancy.

Another factor that may lower occupancy is the impact of register usage. The A100 has a maximum of 65,536 registers per SM, so to run at full occupancy, we can only support 65,536 / 2,048 = 32 registers per thread. If our kernel code uses 64 registers, then the maximum occupancy we can achieve is 1,024 / 2,048 = 50% regardless of the block size.

Querying device properties

Get number of CUDA devices in the system:

int devCount;
cudaGetDeviceCount(&devCount);

Get device properties:

cudaDeviceProp devProp;
for (unsigned int i = 0; i < devCount; i++) {
    cudaGetDeviceProperties(&devProp, i);
    // Do something with devProp
}

Some attributes of devProp:

devProp.maxThreadsPerBlock gives the maximum number of threads allowed in a block in the device.
devProp.multiProcessorCount gives the number of SMs in the device
devProp.clockRate gives the clock frequency of the device
devProp.maxThreadsDim[0] gives the maximum number of threads in the x dimension. Use 1 for y dimension and 2 for z dimension.
devProp.maxGridSize[0] similarly gives the maximum number of blocks in the x dimension in the grid.
devProp.regsPerBlock gives the number of registers that are available in each SM. The name is a minomer - it is usually equivalent to the number of registers per SM as well.
devProp.warpSize gives the size of a warp.

Exercises

Consider the following CUDA kernel and the corresponding host function that calls it:

__global__ void foo_kernel(int* a, int* b) {
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (threadIdx.x < 40 || threadIdx.x >= 104) {
        b[i] = a[i] + 1;
    }
    if (i % 2 == 0) {
        a[i] = b[i] * 2;
    }
    for (unsigned int j = 0; j < 5 - (i % 3); ++j) {
        b[i] += j;
    }
}

void foo(int* a_d, int* b_d) {
    unsigned int N = 1024;
    foo_kernel<<< (N + 128 - 1) / 128, 128 >>>(a_d, b_d);
}

a. What is the number of warps per block?

128 / 32 = 4

b. What is the number of warps in the grid?

Number of blocks is 1,024 / 128 = 8. So number of warps is 8 * 4 = 32.

c. For the statement on line 04:

How many warps in the grid are active?

How many warps in the grid are divergent?

What is the SIMD efficiency (in %) of warp 0 of block 0?

What is the SIMD efficiency (in %) of warp 1 of block 0?

What is the SIMD efficiency (in %) of warp 3 of block 0?

Each block has 4 warps. Amongst the warps, only warp 2 is inactive. So total active warps = 3 * 8 = 24.
Warps 1 and 3 are divergent.
Efficiency of warp 0 of block 0 is 100%
Not clear how SIMD efficiency for a divergent warp is defined. Efficiency of warp 1 of block 0 is 8 / 32 = 25% on the True pass and 75% on the False pass.
Efficiency of warp 3 of block 0 is 24 / 32 = 75% on the True pass and 25% on the False psas.

d. For the statement on line 07:

How many warps in the grid are active?

How many warps in the grid are divergent?

What is the SIMD efficiency (in %) of warp 0 of block 0?

All warps are active
All warps are divergent
50%

e. For the loop on line 09:

How many iterations have no divergence?

How many iterations have divergence?

The number of j loops depends on the result of j < 5 - (i % 3):

j runs from 0-4
j runs from 0-3
j runs from 0-2

Hence there is always divergence in any warp.

For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid?

2,048

For the previous question, how many warps do you expect to have divergence due to the boundary check on vector length?

1 warp is divergent, 1 warp is inactive.

Consider a hypothetical block with 8 threads executing a section of code before reaching a barrier. The threads require the following amount of time (in microseconds) to execute the sections: 2.0, 2.3, 3.0, 2.8, 2.4, 1.9, 2.6, and 2.9; they spend the rest of their time waiting for the barrier. What percentage of the threads’ total execution time is spent waiting for the barrier?

The slowest thread is 3.0, so total time is 3 * 8 = 24. The sum of execution times is 19.9, so total time waiting is 17.1%.

A CUDA programmer says that if they launch a kernel with only 32 threads in each block, they can leave out the __syncthreads() instruction wherever barrier synchronization is needed. Do you think this is a good idea? Explain.

No, because we still need it for syncing across different warps in the same block. Also if there's divergence, there is no guarantee that threads in the same warp execute in lock step.

If a CUDA device’s SM can take up to 1536 threads and up to 4 thread blocks, which of the following block configurations would result in the most number of threads in the SM?

128 threads per block

256 threads per block

512 threads per block

1024 threads per block

512 with 3 thread blocks.

Assume a device that allows up to 64 blocks per SM and 2048 threads per SM. Indicate which of the following assignments per SM are possible. In the cases in which it is possible, indicate the occupancy level.

8 blocks with 128 threads each
- Possible, occupancy = 50%
16 blocks with 64 threads each
- Possible, occupancy = 50%
32 blocks with 32 threads each
- Possible, occupancy = 50%
64 blocks with 32 threads each
- Possible, occupancy = 100%
32 blocks with 64 threads each
- Possible, occupancy = 100%

Consider a GPU with the following hardware limits: 2048 threads per SM, 32 blocks per SM, and 64K (65,536) registers per SM. For each of the following kernel characteristics, specify whether the kernel can achieve full occupancy. If not, specify the limiting factor.

The kernel uses 128 threads per block and 30 registers per thread.
- Yes, we can hit full occupancy
The kernel uses 32 threads per block and 29 registers per thread.
- No, not enough threads per block, we can only get 50% occupancy
The kernel uses 256 threads per block and 34 registers per thread.
- No, we have too many registers per thread, so we cannot have 2,048 threads.

A student mentions that they were able to multiply two 1024×1024 matrices using a matrix multiplication kernel with 32×32 thread blocks. The student is using a CUDA device that allows up to 512 threads per block and up to 8 blocks per SM. The student further mentions that each thread in a thread block calculates one element of the result matrix. What would be your reaction and why?

It does not seem possible as he has 1,024 threads per block which exceeds the hardware limit.

Keyboard shortcuts

Chux's Notebook