03 - Multidimensional grids
In the previous chapter we considered a simple 1D kernel. In general, a grid is a three-dimensional array of blocks, and each block is a 3D array of threads. We can specify them like so:
dim3 dimGrid(32, 1, 1);
dim3 dimBlock(128, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(...)
This specifies a grid with 32 blocks and each block has 128 threads. So the total number of threads in the grid is 4,096.
Or we can let the variables be a function of something else, e.g. if n denotes the size of the vector:
dim3 dimGrid(ceil(n/256.0), 1, 1);
dim3 dimBlock(256, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(...)
This fixes the number of threads per block at 256 and the number of blocks varies with the size of the vector.
For convenience, if we are specifying 1D grids and blocks, we can just use a single number to represent it.
In CUDA C, the allowable range of values:
gridDim.xvaries from togridDim.y,gridDim.zvaries from to- Consequently,
blockDim.xranges from togridDim.x-1and so on
The total size of a block in current CUDA systems is limited to 1024 threads. These threads can be distributed across three dimensions in any way as long as the total number of threads does not exceed 1024.
The dimensions of the number of blocks in a grid and the dimensions of the number of threads in a block need not match up.
Sometimes, we need to refer to the index of a specific block in the grid or a specific thread in a block. The idiomatic way is:
- Block:
(blockIdx.z, blockIdx.y, blockIdx.x) - Thread:
(threadIdx.z, threadIdx.y, threadIdx.z)
Notice that the order of the index is reverse to the order of specifying the dimGrid or dimBlock. There is some reason for this, although it is not clear to me right now.
Mapping Threads to Multidimensional Data
The choice of 1D, 2D or 3D thread organization is usually based on the nature of the data. For example, pictures are 2D and so it is conventional to choose 2D arrangements.
Suppose we have a picture with 62 x 76 pixels, i.e. 62 in the y dimension and 76 in the x dimension. Assume we use a 16 x 16 block, with 16 threads in the x, y dimensions respectively. We will thus need 4 blocks in the y direction, and 5 blocks in the x direction. So for each thread:
- The vertical (row) coordinate is:
row = blockIdx.y * blockDim.y + threadIdx.y - The horizontal (column) coordinate is:
col = blockIdx.x * blockDim.x + threadIdx.x
How does the kernel call look like? Suppose we have:
nrepresents number of pixels in theydirectionmrepresents the number of pixels in thexdirection- The picture data has been copied to device memory stored at pointer
Pin_d
Then the host code looks like:
dim3 dimGrid(ceil(m/16.0), ceil(n/16.0), 1);
dim3 dimBlock(16, 16, 1);
colorToGrayScaleConversion<<<dimGrid, dimBlock>>>(Pin_d, Pout_d, m, n);
Memory Layout
Given that we have computed the row and col coordinates for each thread, ideally we would access the elements of Pin_d like Pin_d[row][col]. Unfortunately C does not allow us to do this - we are only allowed to access it as a 2D array in this manner if the dimensions of Pin_d is known at compile time.
Memory Space. A memory space is a simplified view of how a processor accesses its memory. A memory space is usually associated with each running application. Each location has a byte and an address. Variables that need multiple bytes are stored in consecutive byte locations.
Since there is only one address for each location, we say that the memory space has a flat layout. Thus, all multidimensional arrays are ultimately flattened into equivalent one dimensional arrays. When a C programmer uses multidimensional syntax to access an element, the transpiler is really translating these accesses into a 1D base pointer.
Hence, the CUDA programmer needs to explicitly linearize the 2D accesses into a 1D access. There are two ways to do this:
- row-major layout: rows are placed one after another consecutively in memory space
- column-major layout: columns are placed one after another consecutively in memory space
CUDA C uses the row-major layout; an example of the column-major layout is FORTRAN. We will thus focus on the row major layout.
Let us denote the element of interest as , i.e. the element with y coordinate j and x coordinate i. Suppose is 4 x 4 and linearized into 16 element 1D array. In the flattened row-major layout, we access using:
j * 4 + i
Now we'll tackle the 2D version of colorToGrayscaleConversion.
__global__
void colorToGrayscaleConversion(unsigned char * Pout,
unsigned char * Pin, int width, int height) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col < width && row < height) {
int grayOffset = row*width + col;
int rgbOffset = grayOffset * CHANNELS;
unsigned char r = Pin[rgbOffset];
unsigned char g = Pin[rgbOffset + 1];
unsigned char b = Pin[rgbOffset + 2];
Pout[grayOffset] = 0.21f*r + 0.71f*g + 0.07f*b;
}
}
Notes:
- Note that the input image is encoded as unsigned chars, since they encode a value from
0-255- Each pixel is represented as
3consecutive chars representing the3channels - Thus pixel
istarts at the char located ati*3position
- Each pixel is represented as
- We first compute
grayOffset, which corresponds to the location for the pixel we are processing rgbOffsetis3x grayOffsetbecause each pixel has 3 unsigned chars in the input image- After extracting the
r, g, bvalues from the input image, we compute and store the gray intensity intoPoutusing the linearized 1D acessgrayOffset
3D arrays
We can easily extend the discussion of 2D arrays into 3D. The idea is that the x * y coordinates form a plane, and we insert each "plane" one after another into the address space.
- This means that we track a new
int plane = blockIdx.z * blockDim.z + threadIdx.z - And access the 1D index as
P[plane*m*n + row*m + col]
Image Blur
Image blurring is closer to a real world example. The idea of blurring is to smooth out abrupt variations in pixel values. Mathematically, it is just computing the value of an output pixel as a weightedsum of a patch of pixels surrounding the pixel in the input image. In our simple example, we will take the simple average value of the N x N patch of pixels surrounding our target pixel.
Similar to before, each thread will handle one output pixel. Thus much of the code will resemble that in the grayscale example.
__global__
void blurKernel(unsigned char *in, unsigned char *out, int w, int h) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col < w && row < h) {
int pixVal = 0;
int pixels = 0;
for (int blurRow=-BLUR_SIZE; blurRow<BLUR_SIZE+1; ++blurRow) {
for (int blurCol=-BLUR_SIZE; blurCol<BLUR_SIZE+1; ++blurCol) {
int curRow = row + blurRow;
int curCol = col + blurCol;
if (curRow>=0 && curRow<h && curCol>=0 && curCol<w) {
pixVal += in[curRow*w + curCol];
++pixels;
}
}
}
out[row*w + col] = (unsigned char) (pixVal/pixels);
}
}
Code walkthrough
inandoutpoint to the input and output pixel representation of the same size- The
rowandcolindices are derived as per normal fromblockIdxandthreadIdx pixValstores the accumulated intensity,pixelsstores the number of valid pixels- e.g. near the edge,
pixelsmight not represent the full square patch
- e.g. near the edge,
- At each pixel location
in[row, col], we create a nested for loop to look up a square patch around the pixel- e.g.
BLUR_SIZE=1will walk through a3x3patch around the centre pixel - Notice that instead of a 2D index, we need to linearize into row-major form, i.e.
in[curRow*w + curCol] - We accumulate intensity values into
pixValand count pixels walked inpixels - Finally we assign the average intensity into the corresponding location in
out
- e.g.
Matrix Multiplication
Matrix multiplication is an important component of Basic Linear Algebra Subprograms (BLAS). Matrix multiplication between an I x j matrix M and a j x k matrix N produces a I x k matrix P. For simplicity, let's consider the case where both N and M are square with Width dimensions.
__global__ void MatrixMulKernel (float* M, float* N, float* P, int Width) {
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
if ((row < Width) && (col < Width)) {
float Pvalue = 0;
for (int k = 0; k < Width; ++k) {
Pvalue += M[row*Width + k] * N[k*Width + col];
}
P[row*Width + col] = Pvalue;
}
}
Code walkthrough:
rowandcolare derived as per usual- For one position in the output matrix
P[row, col] - We dot product
M[row, :]withN[:, col]
- For one position in the output matrix
- The inner index
kloops through the columns ofMand the rows ofNto compute the dot product - Note the row-major linearized accesses to
M,NandP
Note that for a given block of threads, it will process some contiguous rows from M and some contiguous columns from N. This means that visually, each block will process a tile of elements in the output matrix P.
Exercises
- In this chapter we implemented a matrix multiplication kernel that has each thread produce one output matrix element. In this question, you will implement different matrix-matrix multiplication kernels and compare them.
a. Write a kernel that has each thread produce one output matrix row. Fill in the execution configuration parameters for the design.
Suppose we have input matrices M and N which are both square with width Width. Since each thread does a whole row, we just need to use one row of M per thread but work through the entire N matrix.
The configuration parameters would be MatrixMulKernelRowWise<<<ceil(Width / n_threads), n_threads>>>, since we just need one thread per row of M.
__global__ void MatrixMulKernelRowWise (float* M, float* N, float* P, int Width) {
int row = blockIdx.x*blockDim.x + threadIdx.x;
if (row < Width) {
for (int col = 0; col < Width; ++col) {
float Pvalue = 0;
for (int k = 0; k < Width; ++k) {
Pvalue += M[row*Width + k] * N[k*Width + col];
}
P[row*Width + col] = Pvalue;
}
}
}
b. Write a kernel that has each thread produce one output matrix column. Fill in the execution configuration parameters for the design.
This is very similar, except that we work through the whole M matrix and only use a single column from N.
The configuration parameters would be MatrixMulKernelColWise<<<ceil(Width / n_threads), n_threads>>>, since we just need one thread per col of N.
__global__ void MatrixMulKernelColWise (float* M, float* N, float* P, int Width) {
int col = blockIdx.x*blockDim.x + threadIdx.x;
if (col < Width) {
for (int row = 0; row < Width; ++row) {
float Pvalue = 0;
for (int k = 0; k < Width; ++k) {
Pvalue += M[row*Width + k] * N[k*Width + col];
}
P[row*Width + col] = Pvalue;
}
}
}
c. Analyze the pros and cons of each of the two kernel designs.
Not very sure about this, but my thinking is that the row-wise version is more efficient. Note that access to N is strided in both cases, so the difference is in how we access M.
Since each thread processes one row of M, we get the following:
- thread 1 accesses
M[0, 0]. Incrementkand we accessM[0, 1], which is the next element in memory layout. - thread 2 accesses
M[1, 0]. Incrementkand we accessM[1, 1], which is next element in memory layout
So as we increment k, each thread can benefit from its local cache and read consecutive elements efficiently.
In contrast, for col-wise:
- We start at row
0, and all threads accessM[0, 0]. The access is efficient as we incrementk. - But for each subsequent row, we need to read a row of
Mfrom global memory again - So we incur much more global memory accesses than the row-wise approach
- A matrix-vector multiplication takes an input matrix B and a vector C and produces one output vector A. Each element of the output vector A is the dot product of one row of the input matrix B and C, that is, . For simplicity we will handle only square matrices whose elements are single-precision floating-point numbers. Write a matrix-vector multiplication kernel and the host stub function that can be called with four parameters: pointer to the output matrix, pointer to the input matrix, pointer to the input vector, and the number of elements in each dimension. Use one thread to calculate an output vector element.
Assuming that B has dimensions h x w and thus C has dimension w. Each thread processes one row of B and the entire C vector.
__global__ void MatrixVectorMulKernel(float* A, float* B, float* C, int w, int h) {
int row = blockIdx.x*blockDim.x + threadIdx.x;
if (row < h) {
float Pvalue = 0;
for (int col = 0; col < w; ++col) {
Pvalue += B[row*w + col] * C[col];
}
A[row] = Pvalue;
}
}
void matrixVectorMul(float* A, float* B, float* C, int w, int h) {
int sizeA = h * sizeof(float);
int sizeB = h * w * sizeof(float);
int sizeC = w * sizeof(float);
float *A_d, *B_d, *C_d;
cudaMalloc((void **) &A_d, sizeA);
cudaMalloc((void **) &B_d, sizeB);
cudaMalloc((void **) &C_d, sizeC);
cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
int block_dim = 256;
int grid_dim = (h + block_dim - 1) / block_dim
MatrixVectorMulKernel<<<grid_dim, block_dim>>>(A_d, B_d, C_d, w, h);
cudaMemcpy(A, A_d, sizeA, cudaMemcpyDeviceToHost);
cudaFree(A_d);
cudaFree(B_d);
cudaFree(C_d);
}
- Consider the following CUDA kernel and the corresponding host function that calls it:
__global__ void foo_kernel(float* a, float* b, unsigned int M, unsigned int N) {
unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
b[row * N + col] = a[row * N + col] / 2.1f + 4.8f;
}
}
void foo(float* a_d, float* b_d) {
unsigned int M = 150;
unsigned int N = 300;
dim3 bd(16, 32);
dim3 gd((N - 1) / 16 + 1, (M - 1) / 32 + 1);
foo_kernel<<<gd, bd>>>(a_d, b_d, M, N);
}
a. What is the number of threads per block?
512
b. What is the number of threads in the grid?
We have gd(19, 5), so 95 blocks per grid. 95 x 512 = 48,640 threads per grid.
c. What is the number of blocks in the grid?
95
d. What is the number of threads that execute the code on line 05?
150 x 300 = 45,000
- Consider a 2D matrix with a width of 400 and a height of 500. The matrix is stored as a one-dimensional array. Specify the array index of the matrix element at row 20 and column 10:
a. If the matrix is stored in row-major order.
Each row is appended subsequent to each other, so we have 20 * 400 + 10 = 8010.
b. If the matrix is stored in column-major order.
Each column is appended subsequent to each other, so we have 10 * 500 + 20 = 5020.
- Consider a 3D tensor with a width of 400, a height of 500, and a depth of 300. The tensor is stored as a one-dimensional array in row-major order. Specify the array index of the tensor element at x=10, y=20, and z=5.
In row-major format, for an array A[z, y, x], the rightmost index changes fastest. Suppose we have 10 in each dimension. Then we go:
A[0, 0, 0]toA[0, 0, 1]etc.- Then
A[0, 0, 9]toA[0, 1, 0]when theydimension changes - Then
A[0, 9, 9]toA[1, 0, 0]when thezdimension changes
So to locate A[5, 20, 10]:
- First we find the start of the relevant plane
5 * 500 * 400 - Then within this plane, we find the start of the relevant row
20 * 400 - Then within this row, we find the element
10
So the answer is the sum of all these: 1,008,010.