Performance Considerations

This chapter goes deeper into off chip memory (DRAM) and discusses memory coalescing and memory latency hiding. It also discusses thread granularity coarsening.

Memory Coalescing

One of the most important factors in CUDA kernel performance is accessing data in the global memory. This chapter further discusses memory coalescing techniques to move data between global memory and shared memory or registers efficiently.

Global memory is implemented with DRAM:

Access latency to a bit in DRAM is relatively slow (tens of nanoseconds)
Because of this slowness, DRAM is designed to use parallelism to increase the memory access throughput

Each time a DRAM location is accessed, a range of consecutive locations nearby are accessed. Once detected by the sensors, the data from all these locations can be transferred to the processor.

These consecutive locations assessed is known as DRAM bursts
Making use of these consecutive and simultaneous accesses can allow us to improve memory access throughput

When a warp of threads is launched, all threads in the warp execute the same instruction at the same point in time. The hardware detects whether the threads are accessing consecutive memory locations.

Thus, the best access pattern is for threads in a warp to access consecutive data locations in global memory
We say that the hardware coalesces the accesses into a consoliated access to consecutive DRAM locations

Recall that multidimensional data are linearized according to the row-major convention and stored into memory space. The read accesses are coalesced if consecutive threads in a warp are assigned to consecutive memory elements in this row major storage format.

Row Major

Consider the following matrix multiplication code in which each thread computes one element of the output matrix P without tiling:

unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
if (row < Width && col < Width) {
    float Pvalue = 0.0f;
    for (unsigned int k = 0; k  < Width; ++k) {
        Pvalue += N[row*Width + k]*M[k*Width + col];
    }
    P[row*Width + col] = Pvalue;
}

Notice that memory accesses to M are coalesced because:

The index of M is k * Width + col
k and Width have the same values across all threads in a warp
col = blockIdx.x * blockDim.x + threadIdx.x, which means that consecutive threads (with incrementing threadIdx.x) will access consecutive elements

Column Major

Now suppose the matrix is stored in column major format instead. Now the code adjusts to the following:

unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
if (row < Width && col < Width) {
    float Pvalue = 0.0f;
    for (unsigned int k = 0; k  < Width; ++k) {
        Pvalue += N[row*Width + k]*M[col*Width + k];
    }
    P[row*Width + col] = Pvalue;
}

Observe that the index to M changed from k*Width + col to col*Width + k. This is because in the column major format, elements that were previously in adjacent columns to each other are now in adjacent rows. Hence the indexing to M is swapped to reflect this.

Now we can see that we have lost the coalesced access to M:

The indexing to M is now col*Width + k
Consecutive threads still have consecutive values of col
But col is now multiplied by Width, meaning that consecutive threads access values that are Width apart, which can be very far

Corner Turning

This is a method to create coalesced memory accesses even when the computation is not naturally amenable to it. The idea is to read from global memory into shared memory in a coalesced manner, then do the unfavourable access pattern from the cheap shared memory instead.

Suppose we are in the tiled matrix multiplication example. First, we do a coalesced read from M into Mds for our tile by swapping the row and column. The elements can be arranged in Mds in either row or column major format, it does not matter. Then, we can do fast reads from Mds to do the matrix multiplication for our tile.

Hiding Memory Latency

DRAM bursting is a form of parallel organization: multiple locations are accessed in the DRAM core array in parallel. There are two more ways of parallel organization: banks and channels. A single processor may have 8 channels. Each channel has a bus that accesses many banks.

The bus transfers data from the banks back to the processor. The data transfer bandwidth of the bus is defined by its width and clock frequency. Modern DDR buses perform two data transfers per clock cycle.

e.g. A 64-bit DDR bus with clock frequency of 1 GHz has a bandwidth of 8B * 2 * 1 = 16GB/s.

For each channel, we need many banks in order to fully utilize the data transfer bandwidth of the bus. Why?

Recall earlier that the data access latency is slow, because it takes time for cells to share their stored charge with the sensing amplifier
Once the sensing amplifier has completed its work, the burst data is delivered through the bus (this part is relatively much faster)
If we only have a single bank, there is a long latency waiting for the sensing amplifier while the bus sits idle
If we have multiple buses, we can trigger multiple read accesses to different banks in close succession in an async manner
When each bank has been sensed, the delivery of burst data can be tightly packed and fill up the idle time

In general, if the ratio of cell array access latency / data transfer time = R, then we need to have at least R + 1 banks to fully utilize the data transfer bandwidth of the bus.

Note that we need more than R because of bank conflict. Since each bank can only serve one data access at a time (each bank can store multiple data points), if multiple threads need to access the same bank for data, there is a conflict. So having more banks help to reduce the probability of bank conflict

There is an important connection between occupancy and parallel access to DRAM data. As we saw earlier, maximizing occupancy ensures that we have enough threads on the SMs to hide long memory access latencies. In this section, we see another distinct advantage of maximizing occupancy - having many threads access different memory locations allows us to fully utilize the memory transfer bandwidth of the buses. That is, if the memory accesses are well distributed across channels and banks.

Practically, the distribution of data across channels and banks is in an interleaved data distribution. Suppose we have a toy example:

4 channels
2 banks per channel
Each bank stores 2 consecutive floats

Now suppose we are doing a tiled matrix multiplication between two 4 x 4 matrices M @ N, where we set the tile width to 2.

The storage of M is like so (referring to the linearized index):

Channel 0
- Bank 0: M[0], M[1]
- Bank 1: M[8], M[9]
Channel 1
- Bank 0: M[2], M[3]
- Bank 1: M[10], M[11]
Channel 2
- Bank 0: M[4], M[5]
- Bank 1: M[12], M[13]
Channel 3
- Bank 0: M[6], M[7]
- Bank 1: M[14], M[15]

As we can see, consecutive elements of our data arrays is first put into bank 0 of all the channels. It then moves on to bank 1 of all the channels. If there is more data, we will wrap back around to bank 0 again. The breadth-first distribution across channels helps ensure that even when the amount of data is small, we have good distribution of data across channels and banks.

Now for the matrix multiplication, we can verify that in phase 0 of the tiled matrix multiplication, these are the elements of M requested by each block:

Block 0,0: M[0], M[1], M[4], M[5]
Block 0,1: M[0], M[1], M[4], M[5]
Block 1,0: M[8], M[9], M[12], M[13]
Block 1,1: M[8], M[9], M[12], M[13]

This is a bit tedious so will not elaborate on how these were derived. But basically, we can see that we access both banks of channels 0 and 2 for phase 1. So we are utilizing 50% of the channels. With multiplication of larger matrices, we can fully utilize all channels and buses.

Thread Coarsening

So far, we have focused on kernels where work is parallelized across all threads at the finest granularity. For e.g. in the matrix multiplication example, each thread is assigned one element in the output matrix P. The advantage of parallelizing at the finest granularity is that it enhances transparent scalability - if the hardware has enough resources to run all the work in parallel, then we have maximized the parallelism.

However, if the hardware does not have enough resources to run all the work at once, then a queue is formed up, and the parallel work becomes sequential. In this case, there is a price to be paid for such aggressive parallelism, such as:

redundant loading of the same data by different blocks
redundant work
syncrhonization overhead

When the threads are executed in parallel, this price is often well worth it. But if the hardware ends up serializing the work as a result of insufficient resources, this price may have been paid unnecessarily. It may thus be worthwhile to reduce the amount of parallelism by having each thread do more work, in exchange for reducing the price that we pay.

One such technique is called thread coarsening. We can see a good example in the tiled matrix multiplication example. For calculating a given tile in the output matrix P, each thread block must load its own copy of the input tiles of matrix M. However, this means that the same tile of input matrix M will be loaded redundantly by different thread blocks and incurring access to global memory.

More concretely, consider that we are computing rows of P from P[i, :] to P[j, :].

One thread block may be computing P[i, k1] to P[j, k2]
One thread block may be computing P[i, k2] to P[j, k3]
But notice that to do this computation, they all have to access M[i, k1] to M[j, k2] in one of their phases
It may make more sense for a single thread block to compute a few adjacent P tiles at once, and thus load this slice of matrix M only once and reduce redundant memory accesses

Here we can dive into the thread coarsening matrix multiplication kernel:

#define TILE_WIDTH 32
#define COARSE_FACTOR 4
__global__ void matrixMulKernel(float* M, float* N, float* P, int width) {
    __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x; int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;

    int row = by * TILE_WIDTH + ty;
    int colStart = bx * TILE_WIDTH * COARSE_FACTOR + tx;

    float Pvalue[COARSE_FACTOR];
    for (int c = 0; c < COARSE_FACTOR; ++c) {
        Pvalue[c] = 0.0f;
    }

    for (int ph = 0; ph < width/TILE_WIDTH; ++ph) {
        Mds[ty][tx] = M[row*width + ph*TILE_WIDTH + tx];
        for (int c = 0; c < COARSE_FACTOR; ++c) {
            int col = colStart + c * TILE_WIDTH;
            Nds[ty][tx] = N[(ph * TILE_WIDTH + ty) * width + col];
            __syncthreads();
            for (int k = 0; k < TILE_WIDTH; ++k) {
                Pvalue[c] += Mds[ty][k] * Nds[k][tx];
            }
            __syncthreads();
        }
    }
    for (int c = 0; c < COARSE_FACTOR; ++c) {
        int col = colStart + c * TILE_WIDTH;
        P[row * width + col] = Pvalue[c];
    }
}

Let's unpack the code above. Notice that instead of each thread block processing one tile of the output P matrix, now one thread block will process COARSE_FACTOR = 4 adjacent tiles of P

Therefore, each thread now processes 4 cells instead of 1 cell

First we identify row and colStart:

row is the row of P that we are computing. This is as per normal
colStart is the start of the column that we are computing. Because each thread is now doing 4 cells, we multiply by COARSE_FACTOR to start further down. It is also the "start" because there are 4 cells to compute, and this variable indicates the first cell.

Initialize Pvalue: instead of one float, we have an array of 4 floats.

Now the main loop:

The outer loop is over phases, which is as per previous. Each phase loads a new tile of M (by moving rightwards) and a new tile of N (by moving downwards) to compute the same P tile in a tiled manner.
Notice that we load a tile of M for this block that will be reused across multiple loops of the coarsening loop over c.
- The way M is loaded is as per normal - we locate the correct row, and the column is determined by ph * TILE_WIDTH + tx
The coarsening loop is where things are interesting:
- For a given value of c, we are working on one tile of N
- Each iteration of the loop moves us in a stride of TILE_WIDTH to the right, as reflected by the way col is computed as colStart + c * TILE_WIDTH
- e.g. a given thread may compute col = 0, 32, 64, 96 instead of col = 0, 1, 2, 3
- Once we have computed col to account for coarsening, the indexing into N is as per previous

So we see that for the same M tile that is loaded, we reuse it COARSE_FACTOR times for different N tiles. Thread coarsening is a common technique that can led to substantial performance improvements.

Pitfalls

There are some potential pitfalls for thread coarsening:

Not to apply optimization when it is unnecessary. Thread coarsening is only beneficial when there is a price paid for parallelization, such as redundant data loading. It will not help when there is no price paid.
Apply too much coarsening such that hardware resources become underutilized. By applying coarsening, we reduce the amount of parallelism exposed and can lead to underutilisation.
Third pitfall is that our coarsened kernel ends up using more resources which can hurt occupancy, for example, by using more registers per thread or more shared memory per thread block.

Checklist of optimizations

Maximizing occupancy
- Idea: more parallel memory accesses to hide DRAM latency
- Strategy: tune parameters like threads per block, shared memory per block, registers per thread to maximize occupancy
Enabling coalesced global memory accesses
- Idea: less global memory traffic and better utilization of DRAM burst to efficiently read data
- Strategies:
  - Transfer from global to shared memory in a coalesced manner and perform uncoalesced accesses in shared memory
  - Rearrange mapping of threads to data
  - Rearrange layout of data
Minimize control divergence
- Idea: Higher SIMD efficiency and have less idle cores during SIMD execution
- Strategies:
  - Rearrange mapping of threads to work and / or data
  - Rearrange layout of data
Tiling of reused data
- Idea: Less global memory traffic by reusing tiles
- Strategies:
  - Place reused data in shared memory or registers so that we only read from global memory once
Privatisation
- Idea: Less contention of writes to a shared copy of data and serialization of atomic updates
- Strategy:
  - Apply partial updates to a private copy of the data and then updataing the universal copy when done
Thread coarsening
- Idea: Less redundant work, divergene and global memory traffic
- Strategy:
  - Assign multiple units of parallelism to each thread to reduce the price of parallelism when unnecessarily incurred

Exercises

Write a matrix multiplication kernel function that corresponds to the design illustrated in Fig. 6.4.

Fig 6.4 is the corner turning example. Suppose we have matrices M x N = P. For simplicity, we assume square matrices of Width.

The access to M remains the same as normal tiled matrix multiplication, but we want to change the way N is loaded into shared memory, since the assumption is that N is stored in column major format.

Doing a recap: each block computes a tile of P. Let T denotes the TILE_WIDTH. The rows and columns are determined by the current values of by and bx respectively:

The first row of the block is by * T
The first column of the block is bx * T
ty and tx gives us the coordinate of the exact element within the block

Now let's focus on N. Here we need to distinguish between the logical indexing into N (call it N_logical) and the indexing into N based on how it is stored (column major, call it N_stored). We start with the logical view:

Each phase of the tiled matmul moves us downward on N
So we start at row ph * T and work downwards
The columns are fixed as bx * T to bx * (T+1) - 1

Since N is stored in column major, we need to swap the indexing when referring to N_stored. So:

We start at row bx * T (denoted N_rowstart in code)
We start at column ph * T (denoted N_colstart in code)

Now we have found the correct coordinates of the tile within N. This is where it gets tricky: we want to read the elements of this tile in a coalesced way, such that incrementing tx by 1 reads the next element in the linearized array.

So the first element is N_rowstart * Width corresponding to N_logical[ph*T, bx*T]
Second element is N_rowstart * Width + 1 corresponding to N_logical[ph*T + 1, bx*T] etc.
This means that we will be reading in the transpose of this N tile, and we need to transpose it again when loading into Nds so that we get the correct tile when doing the final multiplication

Hopefully this analysis sheds some light on the code below.

# define TILE_WIDTH 16
__global__ void matrixMulKernel(float* M, float* N, float* P, int Width) {
    __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x;    int by = blockIdx.y;
    int tx = threadIdx.x;   int ty = threadIdx.y;

    // Identify P[Row, Col] to work on
    int Row = by * TILE_WIDTH + ty;
    int Col = bx * TILE_WIDTH + tx;

    float Pvalue = 0;
    for (int ph = 0; ph < Width/TILE_WIDTH; ++ph) {
        // Load M, N into shared memory
        Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx];
        // Note: rowstart, colstart refers to N_stored
        int N_rowstart = bx * TILE_WIDTH;
        int N_colstart = ph * TILE_WIDTH;
        Nds[tx][ty] = N[(N_rowstart + ty)*Width + N_colstart + tx];
        __syncthreads();

        for (int k = 0; k < TILE_WIDTH; ++k) {
            // We index Nds as per usual since it is now row major
            Pvalue += Mds[ty][k] * Nds[k][tx];
        }
        __syncthreads();
    }
    P[Row*Width + Col] = Pvalue;
}

For tiled matrix multiplication, of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely avoid uncoalesced accesses to global memory? (You need to consider only square blocks.)

Not too sure about this one. Since a warp has 32 threads, and we read 32 elements simultaneously, it is good if block_size is a multiple of 32, so that we get coalesced reads of 32 elements at once.

Consider the following CUDA kernel:

__global__ void foo_kernel(float* a, float* b, float* c, float* d, float* e) {
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    __shared__ float a_s[256];
    __shared__ float bc_s[4*256];
    a_s[threadIdx.x] = a[i];
    for(unsigned int j = 0; j < 4; ++j) {
        bc_s[j*256 + threadIdx.x] = b[j*blockDim.x*gridDim.x + i] + c[i*4 + j];
    }
    __syncthreads();
    d[i + 8] = a_s[threadIdx.x];
    e[i*8] = bc_s[threadIdx.x*4];
}

For each of the following memory accesses, specify whether they are coalesced or uncoalesced or coalescing is not applicable:

a. The access to array a of line 05

Coalesced because increasing tx by 1 takes the next element of a

b. The access to array a_s of line 05

Not applicable, a_s is shared memory.

c. The access to array b of line 07

Coalesced because increasing tx by 1 will take the next element of b

d. The access to array c of line 07

Not coalesced because increasing tx by 1 will increase the index into c by 4.

e. The access to array bc_s of line 07

Not applicable, bc_s is shared memory.

f. The access to array a_s of line 10

Not applicable, a_s is shared memory.

g. The access to array d of line 10

Coalesced.

h. The access to array bc_s of line 11

Not applicable.

i. The access to array e of line 11

Not coalesced, increasing tx will increment access to e by 8.

What is the floating point to global memory access ratio (in OP/B) of each of the following matrix-matrix multiplication kernels?

a. The simple kernel described in Chapter 3, Multidimensional Grids and Data, without any optimizations applied.

In the simple matmul, each thread accesses Width number of elements of M and N respectively, and performs a sum and multiplication for each loop.

2 x Width number of global memory accesses
2 x Width number of OPs
OP/B = 2 x Width / (2 x Width x 4) = 1/4 (because 4 bytes per float)

b. The kernel described in Chapter 5, Memory Architecture and Data Locality, with shared memory tiling applied using a tile size of 32 x 32.

For tiled matmul (see tiled matmul), each thread still does 2 x Width number of OPs. But the number of global memory accesses is now 2 x Width / TILE_WIDTH.

OP / B = 2 x Width / (2 x Width x 4 / 32) = 8 OP/B

c. The kernel described in this chapter with shared memory tiling applied using a tile size of 32 x 32 and thread coarsening applied using a coarsening factor of 4.

Thread coarsening means that we reuse the same M tile for computing 4 output tiles:

Width / TILE_WIDTH x COARSE_FACTOR number of accesses for N
Width / TILE_WIDTH number of accesses for M
2 x Width x COARSE_FACTOR number of FLOPs
So numerator is 2 x Width
Denominator is 4 x Width / (TILE_WIDTH x COARSE_FACTOR) x Width / (TILE_WIDTH)
OP / B = 2/4 / (1/128 + 1/32) = 12.8 OP/B

Keyboard shortcuts

Chux's Notebook