Performance Considerations
This chapter goes deeper into off chip memory (DRAM) and discusses memory coalescing and memory latency hiding. It also discusses thread granularity coarsening.
Memory Coalescing
One of the most important factors in CUDA kernel performance is accessing data in the global memory. This chapter further discusses memory coalescing techniques to move data between global memory and shared memory or registers efficiently.
Global memory is implemented with DRAM:
- Access latency to a bit in DRAM is relatively slow (tens of nanoseconds)
- Because of this slowness, DRAM is designed to use parallelism to increase the memory access throughput
Each time a DRAM location is accessed, a range of consecutive locations nearby are accessed. Once detected by the sensors, the data from all these locations can be transferred to the processor.
- These consecutive locations assessed is known as DRAM bursts
- Making use of these consecutive and simultaneous accesses can allow us to improve memory access throughput
When a warp of threads is launched, all threads in the warp execute the same instruction at the same point in time. The hardware detects whether the threads are accessing consecutive memory locations.
- Thus, the best access pattern is for threads in a warp to access consecutive data locations in global memory
- We say that the hardware coalesces the accesses into a consoliated access to consecutive DRAM locations
Recall that multidimensional data are linearized according to the row-major convention and stored into memory space. The read accesses are coalesced if consecutive threads in a warp are assigned to consecutive memory elements in this row major storage format.
Row Major
Consider the following matrix multiplication code in which each thread computes one element of the output matrix P without tiling:
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
if (row < Width && col < Width) {
float Pvalue = 0.0f;
for (unsigned int k = 0; k < Width; ++k) {
Pvalue += N[row*Width + k]*M[k*Width + col];
}
P[row*Width + col] = Pvalue;
}
Notice that memory accesses to M are coalesced because:
- The index of
Misk * Width + col kandWidthhave the same values across all threads in a warpcol = blockIdx.x * blockDim.x + threadIdx.x, which means that consecutive threads (with incrementingthreadIdx.x) will access consecutive elements
Column Major
Now suppose the matrix is stored in column major format instead. Now the code adjusts to the following:
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
if (row < Width && col < Width) {
float Pvalue = 0.0f;
for (unsigned int k = 0; k < Width; ++k) {
Pvalue += N[row*Width + k]*M[col*Width + k];
}
P[row*Width + col] = Pvalue;
}
Observe that the index to M changed from k*Width + col to col*Width + k. This is because in the column major format, elements that were previously in adjacent columns to each other are now in adjacent rows. Hence the indexing to M is swapped to reflect this.
Now we can see that we have lost the coalesced access to M:
- The indexing to
Mis nowcol*Width + k - Consecutive threads still have consecutive values of
col - But
colis now multiplied byWidth, meaning that consecutive threads access values that areWidthapart, which can be very far
Corner Turning
This is a method to create coalesced memory accesses even when the computation is not naturally amenable to it. The idea is to read from global memory into shared memory in a coalesced manner, then do the unfavourable access pattern from the cheap shared memory instead.
Suppose we are in the tiled matrix multiplication example. First, we do a coalesced read from M into Mds for our tile by swapping the row and column. The elements can be arranged in Mds in either row or column major format, it does not matter. Then, we can do fast reads from Mds to do the matrix multiplication for our tile.
Hiding Memory Latency
DRAM bursting is a form of parallel organization: multiple locations are accessed in the DRAM core array in parallel. There are two more ways of parallel organization: banks and channels. A single processor may have 8 channels. Each channel has a bus that accesses many banks.
The bus transfers data from the banks back to the processor. The data transfer bandwidth of the bus is defined by its width and clock frequency. Modern DDR buses perform two data transfers per clock cycle.
- e.g. A 64-bit DDR bus with clock frequency of 1 GHz has a bandwidth of
8B * 2 * 1 = 16GB/s.
For each channel, we need many banks in order to fully utilize the data transfer bandwidth of the bus. Why?
- Recall earlier that the data access latency is slow, because it takes time for cells to share their stored charge with the sensing amplifier
- Once the sensing amplifier has completed its work, the burst data is delivered through the bus (this part is relatively much faster)
- If we only have a single bank, there is a long latency waiting for the sensing amplifier while the bus sits idle
- If we have multiple buses, we can trigger multiple read accesses to different banks in close succession in an async manner
- When each bank has been sensed, the delivery of burst data can be tightly packed and fill up the idle time
In general, if the ratio of cell array access latency / data transfer time = R, then we need to have at least R + 1 banks to fully utilize the data transfer bandwidth of the bus.
- Note that we need more than
Rbecause of bank conflict. Since each bank can only serve one data access at a time (each bank can store multiple data points), if multiple threads need to access the same bank for data, there is a conflict. So having more banks help to reduce the probability of bank conflict
There is an important connection between occupancy and parallel access to DRAM data. As we saw earlier, maximizing occupancy ensures that we have enough threads on the SMs to hide long memory access latencies. In this section, we see another distinct advantage of maximizing occupancy - having many threads access different memory locations allows us to fully utilize the memory transfer bandwidth of the buses. That is, if the memory accesses are well distributed across channels and banks.
Practically, the distribution of data across channels and banks is in an interleaved data distribution. Suppose we have a toy example:
4channels2banks per channel- Each bank stores
2consecutive floats
Now suppose we are doing a tiled matrix multiplication between two 4 x 4 matrices M @ N, where we set the tile width to 2.
The storage of M is like so (referring to the linearized index):
- Channel 0
- Bank 0:
M[0], M[1] - Bank 1:
M[8], M[9]
- Bank 0:
- Channel 1
- Bank 0:
M[2], M[3] - Bank 1:
M[10], M[11]
- Bank 0:
- Channel 2
- Bank 0:
M[4], M[5] - Bank 1:
M[12], M[13]
- Bank 0:
- Channel 3
- Bank 0:
M[6], M[7] - Bank 1:
M[14], M[15]
- Bank 0:
As we can see, consecutive elements of our data arrays is first put into bank 0 of all the channels. It then moves on to bank 1 of all the channels. If there is more data, we will wrap back around to bank 0 again. The breadth-first distribution across channels helps ensure that even when the amount of data is small, we have good distribution of data across channels and banks.
Now for the matrix multiplication, we can verify that in phase 0 of the tiled matrix multiplication, these are the elements of M requested by each block:
Block 0,0:M[0], M[1], M[4], M[5]Block 0,1:M[0], M[1], M[4], M[5]Block 1,0:M[8], M[9], M[12], M[13]Block 1,1:M[8], M[9], M[12], M[13]
This is a bit tedious so will not elaborate on how these were derived. But basically, we can see that we access both banks of channels 0 and 2 for phase 1. So we are utilizing 50% of the channels. With multiplication of larger matrices, we can fully utilize all channels and buses.
Thread Coarsening
So far, we have focused on kernels where work is parallelized across all threads at the finest granularity. For e.g. in the matrix multiplication example, each thread is assigned one element in the output matrix P. The advantage of parallelizing at the finest granularity is that it enhances transparent scalability - if the hardware has enough resources to run all the work in parallel, then we have maximized the parallelism.
However, if the hardware does not have enough resources to run all the work at once, then a queue is formed up, and the parallel work becomes sequential. In this case, there is a price to be paid for such aggressive parallelism, such as:
- redundant loading of the same data by different blocks
- redundant work
- syncrhonization overhead
When the threads are executed in parallel, this price is often well worth it. But if the hardware ends up serializing the work as a result of insufficient resources, this price may have been paid unnecessarily. It may thus be worthwhile to reduce the amount of parallelism by having each thread do more work, in exchange for reducing the price that we pay.
One such technique is called thread coarsening. We can see a good example in the tiled matrix multiplication example. For calculating a given tile in the output matrix P, each thread block must load its own copy of the input tiles of matrix M. However, this means that the same tile of input matrix M will be loaded redundantly by different thread blocks and incurring access to global memory.
More concretely, consider that we are computing rows of P from P[i, :] to P[j, :].
- One thread block may be computing
P[i, k1] to P[j, k2] - One thread block may be computing
P[i, k2] to P[j, k3] - But notice that to do this computation, they all have to access
M[i, k1] to M[j, k2]in one of their phases - It may make more sense for a single thread block to compute a few adjacent
Ptiles at once, and thus load this slice of matrixMonly once and reduce redundant memory accesses
Here we can dive into the thread coarsening matrix multiplication kernel:
#define TILE_WIDTH 32
#define COARSE_FACTOR 4
__global__ void matrixMulKernel(float* M, float* N, float* P, int width) {
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int row = by * TILE_WIDTH + ty;
int colStart = bx * TILE_WIDTH * COARSE_FACTOR + tx;
float Pvalue[COARSE_FACTOR];
for (int c = 0; c < COARSE_FACTOR; ++c) {
Pvalue[c] = 0.0f;
}
for (int ph = 0; ph < width/TILE_WIDTH; ++ph) {
Mds[ty][tx] = M[row*width + ph*TILE_WIDTH + tx];
for (int c = 0; c < COARSE_FACTOR; ++c) {
int col = colStart + c * TILE_WIDTH;
Nds[ty][tx] = N[(ph * TILE_WIDTH + ty) * width + col];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k) {
Pvalue[c] += Mds[ty][k] * Nds[k][tx];
}
__syncthreads();
}
}
for (int c = 0; c < COARSE_FACTOR; ++c) {
int col = colStart + c * TILE_WIDTH;
P[row * width + col] = Pvalue[c];
}
}
Let's unpack the code above. Notice that instead of each thread block processing one tile of the output P matrix, now one thread block will process COARSE_FACTOR = 4 adjacent tiles of P
- Therefore, each thread now processes
4cells instead of1cell
First we identify row and colStart:
rowis the row ofPthat we are computing. This is as per normalcolStartis the start of the column that we are computing. Because each thread is now doing4cells, we multiply byCOARSE_FACTORto start further down. It is also the "start" because there are 4 cells to compute, and this variable indicates the first cell.
Initialize Pvalue: instead of one float, we have an array of 4 floats.
Now the main loop:
- The outer loop is over phases, which is as per previous. Each phase loads a new tile of
M(by moving rightwards) and a new tile ofN(by moving downwards) to compute the samePtile in a tiled manner. - Notice that we load a tile of
Mfor this block that will be reused across multiple loops of the coarsening loop overc.- The way
Mis loaded is as per normal - we locate the correctrow, and the column is determined byph * TILE_WIDTH + tx
- The way
- The coarsening loop is where things are interesting:
- For a given value of
c, we are working on one tile ofN - Each iteration of the loop moves us in a stride of
TILE_WIDTHto the right, as reflected by the waycolis computed ascolStart + c * TILE_WIDTH - e.g. a given thread may compute
col = 0, 32, 64, 96instead ofcol = 0, 1, 2, 3 - Once we have computed
colto account for coarsening, the indexing intoNis as per previous
- For a given value of
So we see that for the same M tile that is loaded, we reuse it COARSE_FACTOR times for different N tiles. Thread coarsening is a common technique that can led to substantial performance improvements.
Pitfalls
There are some potential pitfalls for thread coarsening:
- Not to apply optimization when it is unnecessary. Thread coarsening is only beneficial when there is a price paid for parallelization, such as redundant data loading. It will not help when there is no price paid.
- Apply too much coarsening such that hardware resources become underutilized. By applying coarsening, we reduce the amount of parallelism exposed and can lead to underutilisation.
- Third pitfall is that our coarsened kernel ends up using more resources which can hurt occupancy, for example, by using more registers per thread or more shared memory per thread block.
Checklist of optimizations
- Maximizing occupancy
- Idea: more parallel memory accesses to hide DRAM latency
- Strategy: tune parameters like threads per block, shared memory per block, registers per thread to maximize occupancy
- Enabling coalesced global memory accesses
- Idea: less global memory traffic and better utilization of DRAM burst to efficiently read data
- Strategies:
- Transfer from global to shared memory in a coalesced manner and perform uncoalesced accesses in shared memory
- Rearrange mapping of threads to data
- Rearrange layout of data
- Minimize control divergence
- Idea: Higher SIMD efficiency and have less idle cores during SIMD execution
- Strategies:
- Rearrange mapping of threads to work and / or data
- Rearrange layout of data
- Tiling of reused data
- Idea: Less global memory traffic by reusing tiles
- Strategies:
- Place reused data in shared memory or registers so that we only read from global memory once
- Privatisation
- Idea: Less contention of writes to a shared copy of data and serialization of atomic updates
- Strategy:
- Apply partial updates to a private copy of the data and then updataing the universal copy when done
- Thread coarsening
- Idea: Less redundant work, divergene and global memory traffic
- Strategy:
- Assign multiple units of parallelism to each thread to reduce the price of parallelism when unnecessarily incurred
Exercises
- Write a matrix multiplication kernel function that corresponds to the design illustrated in Fig. 6.4.
Fig 6.4 is the corner turning example. Suppose we have matrices M x N = P. For simplicity, we assume square matrices of Width.
The access to M remains the same as normal tiled matrix multiplication, but we want to change the way N is loaded into shared memory, since the assumption is that N is stored in column major format.
Doing a recap: each block computes a tile of P. Let T denotes the TILE_WIDTH. The rows and columns are determined by the current values of by and bx respectively:
- The first row of the block is
by * T - The first column of the block is
bx * T tyandtxgives us the coordinate of the exact element within the block
Now let's focus on N. Here we need to distinguish between the logical indexing into N (call it N_logical) and the indexing into N based on how it is stored (column major, call it N_stored). We start with the logical view:
- Each phase of the tiled matmul moves us downward on
N - So we start at row
ph * Tand work downwards - The columns are fixed as
bx * Ttobx * (T+1) - 1
Since N is stored in column major, we need to swap the indexing when referring to N_stored. So:
- We start at row
bx * T(denotedN_rowstartin code) - We start at column
ph * T(denotedN_colstartin code)
Now we have found the correct coordinates of the tile within N. This is where it gets tricky: we want to read the elements of this tile in a coalesced way, such that incrementing tx by 1 reads the next element in the linearized array.
- So the first element is
N_rowstart * Widthcorresponding toN_logical[ph*T, bx*T] - Second element is
N_rowstart * Width + 1corresponding toN_logical[ph*T + 1, bx*T]etc. - This means that we will be reading in the transpose of this N tile, and we need to transpose it again when loading into
Ndsso that we get the correct tile when doing the final multiplication
Hopefully this analysis sheds some light on the code below.
# define TILE_WIDTH 16
__global__ void matrixMulKernel(float* M, float* N, float* P, int Width) {
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// Identify P[Row, Col] to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0;
for (int ph = 0; ph < Width/TILE_WIDTH; ++ph) {
// Load M, N into shared memory
Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx];
// Note: rowstart, colstart refers to N_stored
int N_rowstart = bx * TILE_WIDTH;
int N_colstart = ph * TILE_WIDTH;
Nds[tx][ty] = N[(N_rowstart + ty)*Width + N_colstart + tx];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k) {
// We index Nds as per usual since it is now row major
Pvalue += Mds[ty][k] * Nds[k][tx];
}
__syncthreads();
}
P[Row*Width + Col] = Pvalue;
}
- For tiled matrix multiplication, of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely avoid uncoalesced accesses to global memory? (You need to consider only square blocks.)
Not too sure about this one. Since a warp has 32 threads, and we read 32 elements simultaneously, it is good if block_size is a multiple of 32, so that we get coalesced reads of 32 elements at once.
- Consider the following CUDA kernel:
__global__ void foo_kernel(float* a, float* b, float* c, float* d, float* e) {
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ float a_s[256];
__shared__ float bc_s[4*256];
a_s[threadIdx.x] = a[i];
for(unsigned int j = 0; j < 4; ++j) {
bc_s[j*256 + threadIdx.x] = b[j*blockDim.x*gridDim.x + i] + c[i*4 + j];
}
__syncthreads();
d[i + 8] = a_s[threadIdx.x];
e[i*8] = bc_s[threadIdx.x*4];
}
For each of the following memory accesses, specify whether they are coalesced or uncoalesced or coalescing is not applicable:
a. The access to array a of line 05
Coalesced because increasing tx by 1 takes the next element of a
b. The access to array a_s of line 05
Not applicable, a_s is shared memory.
c. The access to array b of line 07
Coalesced because increasing tx by 1 will take the next element of b
d. The access to array c of line 07
Not coalesced because increasing tx by 1 will increase the index into c by 4.
e. The access to array bc_s of line 07
Not applicable, bc_s is shared memory.
f. The access to array a_s of line 10
Not applicable, a_s is shared memory.
g. The access to array d of line 10
Coalesced.
h. The access to array bc_s of line 11
Not applicable.
i. The access to array e of line 11
Not coalesced, increasing tx will increment access to e by 8.
- What is the floating point to global memory access ratio (in OP/B) of each of the following matrix-matrix multiplication kernels?
a. The simple kernel described in Chapter 3, Multidimensional Grids and Data, without any optimizations applied.
In the simple matmul, each thread accesses Width number of elements of M and N respectively, and performs a sum and multiplication for each loop.
2 x Widthnumber of global memory accesses2 x Widthnumber of OPsOP/B = 2 x Width / (2 x Width x 4) = 1/4(because4 bytesper float)
b. The kernel described in Chapter 5, Memory Architecture and Data Locality, with shared memory tiling applied using a tile size of 32 x 32.
For tiled matmul (see tiled matmul), each thread still does 2 x Width number of OPs. But the number of global memory accesses is now 2 x Width / TILE_WIDTH.
OP / B = 2 x Width / (2 x Width x 4 / 32) = 8 OP/B
c. The kernel described in this chapter with shared memory tiling applied using a tile size of 32 x 32 and thread coarsening applied using a coarsening factor of 4.
Thread coarsening means that we reuse the same M tile for computing 4 output tiles:
Width / TILE_WIDTH x COARSE_FACTORnumber of accesses forNWidth / TILE_WIDTHnumber of accesses forM2 x Width x COARSE_FACTORnumber of FLOPs- So numerator is
2 x Width - Denominator is
4 x Width / (TILE_WIDTH x COARSE_FACTOR) x Width / (TILE_WIDTH) OP / B = 2/4 / (1/128 + 1/32) = 12.8 OP/B