03 - Multidimensional grids

In the previous chapter we considered a simple 1D kernel. In general, a grid is a three-dimensional array of blocks, and each block is a 3D array of threads. We can specify them like so:

dim3 dimGrid(32, 1, 1);
dim3 dimBlock(128, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(...)

This specifies a grid with 32 blocks and each block has 128 threads. So the total number of threads in the grid is 4,096.

Or we can let the variables be a function of something else, e.g. if n denotes the size of the vector:

dim3 dimGrid(ceil(n/256.0), 1, 1);
dim3 dimBlock(256, 1, 1);
vecAddKernel<<<dimGrid, dimBlock>>>(...)

This fixes the number of threads per block at 256 and the number of blocks varies with the size of the vector.

For convenience, if we are specifying 1D grids and blocks, we can just use a single number to represent it.

In CUDA C, the allowable range of values:

gridDim.x varies from $1$ to $2^{31} - 1$
gridDim.y, gridDim.z varies from $1$ to $2^{16} - 1 = 65, 535$
Consequently, blockDim.x ranges from $0$ to gridDim.x-1 and so on

The total size of a block in current CUDA systems is limited to 1024 threads. These threads can be distributed across three dimensions in any way as long as the total number of threads does not exceed 1024.

The dimensions of the number of blocks in a grid and the dimensions of the number of threads in a block need not match up.

Sometimes, we need to refer to the index of a specific block in the grid or a specific thread in a block. The idiomatic way is:

Block: (blockIdx.z, blockIdx.y, blockIdx.x)
Thread: (threadIdx.z, threadIdx.y, threadIdx.z)

Notice that the order of the index is reverse to the order of specifying the dimGrid or dimBlock. There is some reason for this, although it is not clear to me right now.

Mapping Threads to Multidimensional Data

The choice of 1D, 2D or 3D thread organization is usually based on the nature of the data. For example, pictures are 2D and so it is conventional to choose 2D arrangements.

Suppose we have a picture with 62 x 76 pixels, i.e. 62 in the y dimension and 76 in the x dimension. Assume we use a 16 x 16 block, with 16 threads in the x, y dimensions respectively. We will thus need 4 blocks in the y direction, and 5 blocks in the x direction. So for each thread:

The vertical (row) coordinate is: row = blockIdx.y * blockDim.y + threadIdx.y
The horizontal (column) coordinate is: col = blockIdx.x * blockDim.x + threadIdx.x

How does the kernel call look like? Suppose we have:

n represents number of pixels in the y direction
m represents the number of pixels in the x direction
The picture data has been copied to device memory stored at pointer Pin_d

Then the host code looks like:

dim3 dimGrid(ceil(m/16.0), ceil(n/16.0), 1);
dim3 dimBlock(16, 16, 1);
colorToGrayScaleConversion<<<dimGrid, dimBlock>>>(Pin_d, Pout_d, m, n);

Memory Layout

Given that we have computed the row and col coordinates for each thread, ideally we would access the elements of Pin_d like Pin_d[row][col]. Unfortunately C does not allow us to do this - we are only allowed to access it as a 2D array in this manner if the dimensions of Pin_d is known at compile time.

Memory Space. A memory space is a simplified view of how a processor accesses its memory. A memory space is usually associated with each running application. Each location has a byte and an address. Variables that need multiple bytes are stored in consecutive byte locations.

Since there is only one address for each location, we say that the memory space has a flat layout. Thus, all multidimensional arrays are ultimately flattened into equivalent one dimensional arrays. When a C programmer uses multidimensional syntax to access an element, the transpiler is really translating these accesses into a 1D base pointer.

Hence, the CUDA programmer needs to explicitly linearize the 2D accesses into a 1D access. There are two ways to do this:

row-major layout: rows are placed one after another consecutively in memory space
column-major layout: columns are placed one after another consecutively in memory space

CUDA C uses the row-major layout; an example of the column-major layout is FORTRAN. We will thus focus on the row major layout.

Let us denote the element of interest as $M_{j, i}$ , i.e. the element with y coordinate j and x coordinate i. Suppose $M$ is 4 x 4 and linearized into 16 element 1D array. In the flattened row-major layout, we access $M_{j, i}$ using:

j * 4 + i

Now we'll tackle the 2D version of colorToGrayscaleConversion.

__global__
void colorToGrayscaleConversion(unsigned char * Pout, 
            unsigned char * Pin, int width, int height) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    if (col < width && row < height) {
        int grayOffset = row*width + col;
        int rgbOffset = grayOffset * CHANNELS;
        unsigned char r = Pin[rgbOffset];
        unsigned char g = Pin[rgbOffset + 1];
        unsigned char b = Pin[rgbOffset + 2];
        Pout[grayOffset] = 0.21f*r + 0.71f*g + 0.07f*b;
    }
}

Notes:

Note that the input image is encoded as unsigned chars, since they encode a value from 0-255
- Each pixel is represented as 3 consecutive chars representing the 3 channels
- Thus pixel i starts at the char located at i*3 position
We first compute grayOffset, which corresponds to the location for the pixel we are processing
rgbOffset is 3x grayOffset because each pixel has 3 unsigned chars in the input image
After extracting the r, g, b values from the input image, we compute and store the gray intensity into Pout using the linearized 1D acess grayOffset

3D arrays

We can easily extend the discussion of 2D arrays into 3D. The idea is that the x * y coordinates form a plane, and we insert each "plane" one after another into the address space.

This means that we track a new

int plane = blockIdx.z * blockDim.z + threadIdx.z

And access the 1D index as
```
P[plane*m*n + row*m + col]
```

Image Blur

Image blurring is closer to a real world example. The idea of blurring is to smooth out abrupt variations in pixel values. Mathematically, it is just computing the value of an output pixel as a weightedsum of a patch of pixels surrounding the pixel in the input image. In our simple example, we will take the simple average value of the N x N patch of pixels surrounding our target pixel.

Similar to before, each thread will handle one output pixel. Thus much of the code will resemble that in the grayscale example.

__global__
void blurKernel(unsigned char *in, unsigned char *out, int w, int h) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    if (col < w && row < h) {
        int pixVal = 0;
        int pixels = 0;
        for (int blurRow=-BLUR_SIZE; blurRow<BLUR_SIZE+1; ++blurRow) {
            for (int blurCol=-BLUR_SIZE; blurCol<BLUR_SIZE+1; ++blurCol) {
                int curRow = row + blurRow;
                int curCol = col + blurCol;
                if (curRow>=0 && curRow<h && curCol>=0 && curCol<w) {
                    pixVal += in[curRow*w + curCol];
                    ++pixels;
                }
            }
        }
        out[row*w + col] = (unsigned char) (pixVal/pixels);
    }
}

Code walkthrough

in and out point to the input and output pixel representation of the same size
The row and col indices are derived as per normal from blockIdx and threadIdx
pixVal stores the accumulated intensity, pixels stores the number of valid pixels
- e.g. near the edge, pixels might not represent the full square patch
At each pixel location in[row, col], we create a nested for loop to look up a square patch around the pixel
- e.g. BLUR_SIZE=1 will walk through a 3x3 patch around the centre pixel
- Notice that instead of a 2D index, we need to linearize into row-major form, i.e. in[curRow*w + curCol]
- We accumulate intensity values into pixVal and count pixels walked in pixels
- Finally we assign the average intensity into the corresponding location in out

Matrix Multiplication

Matrix multiplication is an important component of Basic Linear Algebra Subprograms (BLAS). Matrix multiplication between an I x j matrix M and a j x k matrix N produces a I x k matrix P. For simplicity, let's consider the case where both N and M are square with Width dimensions.

__global__ void MatrixMulKernel (float* M, float* N, float* P, int Width) {
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    if ((row < Width) && (col < Width)) {
        float Pvalue = 0;
        for (int k = 0; k < Width; ++k) {
            Pvalue += M[row*Width + k] * N[k*Width + col];
        }
        P[row*Width + col] = Pvalue;
    }
}

Code walkthrough:

row and col are derived as per usual
- For one position in the output matrix P[row, col]
- We dot product M[row, :] with N[:, col]
The inner index k loops through the columns of M and the rows of N to compute the dot product
Note the row-major linearized accesses to M, N and P

Note that for a given block of threads, it will process some contiguous rows from M and some contiguous columns from N. This means that visually, each block will process a tile of elements in the output matrix P.

Exercises

In this chapter we implemented a matrix multiplication kernel that has each thread produce one output matrix element. In this question, you will implement different matrix-matrix multiplication kernels and compare them.

a. Write a kernel that has each thread produce one output matrix row. Fill in the execution configuration parameters for the design.

Suppose we have input matrices M and N which are both square with width Width. Since each thread does a whole row, we just need to use one row of M per thread but work through the entire N matrix.

The configuration parameters would be MatrixMulKernelRowWise<<<ceil(Width / n_threads), n_threads>>>, since we just need one thread per row of M.

__global__ void MatrixMulKernelRowWise (float* M, float* N, float* P, int Width) {
    int row = blockIdx.x*blockDim.x + threadIdx.x;
    if (row < Width) {
        for (int col = 0; col < Width; ++col) {
            float Pvalue = 0;
            for (int k = 0; k < Width; ++k) {
                Pvalue += M[row*Width + k] * N[k*Width + col];
            }
            P[row*Width + col] = Pvalue;
        }
    }
}

b. Write a kernel that has each thread produce one output matrix column. Fill in the execution configuration parameters for the design.

This is very similar, except that we work through the whole M matrix and only use a single column from N.

The configuration parameters would be MatrixMulKernelColWise<<<ceil(Width / n_threads), n_threads>>>, since we just need one thread per col of N.

__global__ void MatrixMulKernelColWise (float* M, float* N, float* P, int Width) {
    int col = blockIdx.x*blockDim.x + threadIdx.x;
    if (col < Width) {
        for (int row = 0; row < Width; ++row) {
            float Pvalue = 0;
            for (int k = 0; k < Width; ++k) {
                Pvalue += M[row*Width + k] * N[k*Width + col];
            }
            P[row*Width + col] = Pvalue;
        }
    }
}

c. Analyze the pros and cons of each of the two kernel designs.

Not very sure about this, but my thinking is that the row-wise version is more efficient. Note that access to N is strided in both cases, so the difference is in how we access M.

Since each thread processes one row of M, we get the following:

thread 1 accesses M[0, 0]. Increment k and we access M[0, 1], which is the next element in memory layout.
thread 2 accesses M[1, 0]. Increment k and we access M[1, 1], which is next element in memory layout

So as we increment k, each thread can benefit from its local cache and read consecutive elements efficiently.

In contrast, for col-wise:

We start at row 0, and all threads access M[0, 0]. The access is efficient as we increment k.
But for each subsequent row, we need to read a row of M from global memory again
So we incur much more global memory accesses than the row-wise approach

A matrix-vector multiplication takes an input matrix B and a vector C and produces one output vector A. Each element of the output vector A is the dot product of one row of the input matrix B and C, that is, $A [i] = \sum_{j} B [i] [j] + C [j]$ . For simplicity we will handle only square matrices whose elements are single-precision floating-point numbers. Write a matrix-vector multiplication kernel and the host stub function that can be called with four parameters: pointer to the output matrix, pointer to the input matrix, pointer to the input vector, and the number of elements in each dimension. Use one thread to calculate an output vector element.

Assuming that B has dimensions h x w and thus C has dimension w. Each thread processes one row of B and the entire C vector.

__global__ void MatrixVectorMulKernel(float* A, float* B, float* C, int w, int h) {
    int row = blockIdx.x*blockDim.x + threadIdx.x;
    if (row < h) {
        float Pvalue = 0;
        for (int col = 0; col < w; ++col) {
            Pvalue += B[row*w + col] * C[col];
        }
        A[row] = Pvalue;
    }
}
void matrixVectorMul(float* A, float* B, float* C, int w, int h) {
    int sizeA = h * sizeof(float);
    int sizeB = h * w * sizeof(float);
    int sizeC = w * sizeof(float);
    float *A_d, *B_d, *C_d;

    cudaMalloc((void **) &A_d, sizeA);
    cudaMalloc((void **) &B_d, sizeB);
    cudaMalloc((void **) &C_d, sizeC);

    cudaMemcpy(B_d, B, sizeB, cudaMemcpyHostToDevice);
    cudaMemcpy(C_d, C, sizeC, cudaMemcpyHostToDevice);
    
    int block_dim = 256;
    int grid_dim = (h + block_dim - 1) / block_dim
    MatrixVectorMulKernel<<<grid_dim, block_dim>>>(A_d, B_d, C_d, w, h);

    cudaMemcpy(A, A_d, sizeA, cudaMemcpyDeviceToHost);
    cudaFree(A_d);
    cudaFree(B_d);
    cudaFree(C_d);
}

Consider the following CUDA kernel and the corresponding host function that calls it:

__global__ void foo_kernel(float* a, float* b, unsigned int M, unsigned int N) {
    unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
    unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M && col < N) {
        b[row * N + col] = a[row * N + col] / 2.1f + 4.8f;
    }
}

void foo(float* a_d, float* b_d) {
    unsigned int M = 150;
    unsigned int N = 300;
    dim3 bd(16, 32);
    dim3 gd((N - 1) / 16 + 1, (M - 1) / 32 + 1);
    foo_kernel<<<gd, bd>>>(a_d, b_d, M, N);
}

a. What is the number of threads per block?

512

b. What is the number of threads in the grid?

We have gd(19, 5), so 95 blocks per grid. 95 x 512 = 48,640 threads per grid.

c. What is the number of blocks in the grid?

95

d. What is the number of threads that execute the code on line 05?

150 x 300 = 45,000

Consider a 2D matrix with a width of 400 and a height of 500. The matrix is stored as a one-dimensional array. Specify the array index of the matrix element at row 20 and column 10:

a. If the matrix is stored in row-major order.

Each row is appended subsequent to each other, so we have 20 * 400 + 10 = 8010.

b. If the matrix is stored in column-major order.

Each column is appended subsequent to each other, so we have 10 * 500 + 20 = 5020.

Consider a 3D tensor with a width of 400, a height of 500, and a depth of 300. The tensor is stored as a one-dimensional array in row-major order. Specify the array index of the tensor element at x=10, y=20, and z=5.

In row-major format, for an array A[z, y, x], the rightmost index changes fastest. Suppose we have 10 in each dimension. Then we go:

A[0, 0, 0] to A[0, 0, 1] etc.
Then A[0, 0, 9] to A[0, 1, 0] when the y dimension changes
Then A[0, 9, 9] to A[1, 0, 0] when the z dimension changes

So to locate A[5, 20, 10]:

First we find the start of the relevant plane 5 * 500 * 400
Then within this plane, we find the start of the relevant row 20 * 400
Then within this row, we find the element 10

So the answer is the sum of all these: 1,008,010.

Keyboard shortcuts