Data Parallel Computing

When applications run slowly, the issue is usually that there is too much data to process. In many cases, this data can be split up and dealt with independently. Such independent evaluation of different pieces of data is called data parallelism.

This is different from Task parallelism, which is the task decomposition of applications. For example, we may need to do a vector multiplication and addition on the same set of data. In general, data parallelism is usually the main source of scalability for parallel programs.

Let us use a running example using color to gray scale conversion. This is a simple data parallel task as each pixel can be dealt with independently. Suppose we have a color image comprising many pixels, each pixel has an r, g, b value ranging from 0 to 1 indicating the intensity.

We use a simple grayscale formula (note that the grayscale formula is because our eyes perceive luminosity of different colors differently): $L = 0.21 r + 0.72 g + 0.07 b$

CUDA C Program

CUDA C is an extension of ANSI C programming language with minimal new syntax. The structure of a CUDA C program reflects the coexistence of a host (CPU) and one or more devices (GPUs) in the computer. Each source file can have a mixture of host code and device code. By definition, a traditional C program is a CUDA program that contains only host code. The device code is clearly marked with CUDA C keywords.

When a kernel function is called, a large number of threads are launched on a device to execute the kernel. All the threads that are launched by a kernel call are collectively called a grid. When all threads of a grid have completed, the grid terminates and the execution continues on the host until the next grid is launched.

Launching a grid generates many threads. In the grayscale example, each thread processes one pixel. One can assume that CUDA threads take very few clock cycles to generate and schedule. In contrast, traditional C programs can take thousands of clock cycles to generate and schedule.

Threads. A thread is a simplified view of how a processor executes a sequential program. A thread consists of the code of the program, the point in the code that is being executed, and the values of its variables and data structures. The execution of a thread may be viewed as sequential.

Vector Addition

Vector addition is the hello world example. Here is a simple vector addition in C:

In CUDA programms, we suffix with _h to denote host variables and _d for device variables.

void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
    for (int i = 0; i < n; ++1) {
        C_h[i] = A_h[i] + B_h[i];
    }
}
int main() {
    // memory allocation etc.
    vecAdd(A, B, C, N);
}

In the vector addition example, we use a simple for loop to do addition. After execution, we can access the modified contents of C to see the results.

A straightforward way to do vector addition in parallel is as follows:

Allocate device memory for A, B and C
Copy A and B to device memory
Call a kernel to launch threads to perform the actual addition
Copy C from the device memory back to host
Free device vector

Device Global Memory

Each device comes with their own dynamic random access memory called device global memory. e.g. NVIDIA Volta V100 comes with 16GB or 32GB of global memory. Before calling the addition kernel, we need to allocate space in the device global memory and transfer data from the host memory to the allocated space on the device.

cudaMalloc() is used to allocate object on device global memory
- Takes in address of a pointer to the allocated object
- Size of the allocated object in bytes
cudaFree() is used to free object from global memory
- Takes in the pointer to object to be freed

float *A_d // Pointer to a block of floating point numbers
int size = n * sizeof(float); // n * 4 bytes
cudaMalloc((void**) &A_d, size);
...
cudaFree(A_d);

Note:

(void**) casts to a generic pointer, which is expected by cudaMalloc which frees any generic object
&A_d means the address of the pointer A_d:
- A_d itself is a pointer (on CPU memory) pointing to some spot on GPU memory
- We need to pass the address of A_d to cudaMalloc so that cudaMalloc can modify the contents of A_d to point at the newly allocated GPU memory location
- So cudaMalloc expects the address of a pointer
cudaFree in contrast doesn't need to modify A_d, it just needs to go to the GPU location and free up the GPU memory there
- Thus the argument to cudaFree is A_d instead of &A_d
Note that A_d is a pointer to device memory, so it cannot be de-referenced in host code
- Dereferencing using *A_d in host code will lead to errors

After space has been allocated on device, we can transfer data from host to device.

cudaMemcpy()
- Pointer to destination
- Pointer to source
- Number of bytes copied
- Type / direction of transfer

Now our vecAdd functio looks like this:

void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
    int size = n * sizeof(float);
    float *A_d, *B_d, *C_d;

    cudaMalloc((void **) &A_d, size);
    cudaMalloc((void **) &B_d, size);
    cudaMalloc((void **) &C_d, size);

    cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);
    cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice);

    // Invoke kernel

    cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);

    cudaFree(A_d);
    cudaFree(B_d);
    cudaFree(C_d);
}

The above code is quite self explanatory. This kind of host code that delegates work to a device is called a stub. Note that cudaMemcpyDeviceToHost etc. are special CUDA keywords.

Notice that C syntax is a bit confusing - we write float *A_d, *B_d, *C_d where the * is attached to the variable, but in function parameters we write void vecAdd(float* A_h, ...). It turns out that both ways of writing compile to the same code, and in both cases we are referring to A_h, A_d as float pointers. The reason why * is bound to the variable when declaring is to avoid the subtle bug of float* A_d, B_d, which actually declares B_d as a float, not a float pointer!

Kernel Functions and Threading

Since all threads execute the same code, CUDA C programming is an instance of Single Program, Multiple Data (SPMD) parallel programming style. CUDA launches a grid of threads that are organized into a two level hierarchy:

Each grid is organized as an array of thread blocks, which are also referred to as blocks
All blocks are of the same size, and each block can contain up to 1,024 threads

The total number of threads per block is specified by the host code when a kernel is called. For a given grid of threads, the number of threads in a block is available in a variable called blockDim.

blockDim is a struct with three unsigned integer fields x, y, z, which may be accessed like blockDim.x
The choice of dimensionality for organizing threads usually reflects the dimensionality of the data
In general, the number of threads is recommended to be a multple of 32 for hardware reasons

CUDA kernels and thus CUDA kernel code have access to two more built-in variables called threadIdx and blockIdx:

threadIdx gives each thread a unique coordinate within a block
- The first thread in a block has value 0 in threadIdx.x variable, second thread has 1 and so on
blockIdx gives all threads in a block a common block coordinate
- All threads in the first block have value 0 in their blockIdx.x variable
- Second block has value 1 and so on
Thus we can combine threadIdx and blockIdx to give a unique global index to each thread

Now we are ready to write kernel code for vector addition:

__global__
void vecAddKernel(float* A, float* B, float* C, int n) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

Notes:

We do not need suffix _d since there is no confusion - kernel code only runs on device
i is a global index with a different value for each thread
blockDim.x is 256 for our example, provided by the CUDA runtime
- This is specified in host code, when we launch the kernel
The keyword __global__ indicates that this function is a kernel
- In general, there are three qualifier keywords in CUDA C
- __host__ means callable from host and executed on host (i.e. normal C function)
- __device__ means callable from device and executed on device
- __global__ means callable from host / device and executed on device
Note that if (i < n) is a guarding condition
- When we launch a grid, the number of threads is a multiple of the block size
- The number of threads has to be greater than or equal to n
- To avoid errors on the excess threads, the conditional statement is used

Notice that the CUDA kernel function has no for-loop. The reason is that the loop has been replaced by the grid of threads. Each thread in the grid corresponds to one iteration of the for loop. This is sometimes referred to as loop parallelism.

Complete Version of `vecAdd`

Now we can write the complete version of vecAdd.

void vecAdd(float* A, float* B, float* C, int n) {
    int size = n * sizeof(float);
    float *A_d, *B_d, *C_d;

    cudaMalloc((void **) &A_d, size);
    cudaMalloc((void **) &B_d, size);
    cudaMalloc((void **) &C_d, size);

    cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

    vecAddKernel<<<ceil(n / 256.0), 256>>>(A_d, B_d, C_d, n);

    cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);

    cudaFree(A_d);
    cudaFree(B_d);
    cudaFree(C_d);
}

Note that we call the kernel function with vecAddKernel and specify execution configuration parameters using the "<<< >>>" demarcators.

The first parameter ceil(n / 256.0) gives the number of blocks in the grid
- Using ceil ensures that we have >= n threads
- Note that 256.0 is a float so that the division gives a float
The second specifies the number of threads in a block

If we vary n, the number of thread blocks will vary. A small GPU may execute 2 thread blocks in parallel, larger GPUs may do 64 or 128 at a time.

In reality, the sequential way of doing vecAdd will probably be faster than the CUDA way, due to the overheads of allocating memory, data transfer, deallocating memory etc. Real CUDA programs do a lot more computation within the kernel to make the overheads worth it.

Compilation

A traditional C compiler will not work with CUDA code. We need to use NVCC (NVIDIA C Compiler). The host code portions are compiled with standard C/C++ compilers and run as traditional CPU process. The device code is compiled by NVCC into .ptx files which are further compiled by a runtime component of NVCC into real object files.

Exercises

If we want to use each thread in a grid to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to the data index (i)?

i = blockIdx.x * blockDim.x + threadIdx.x;

Assume that we want to use each thread to calculate two adjacent elements of a vector addition. What would be the expression for mapping the thread/block indices to the data index (i) of the first element to be processed by a thread?

i = (blockIdx.x * blockDim.x + threadIdx.x) * 2;

We want to use each thread to calculate two elements of a vector addition. Each thread block processes 2*blockDim.x consecutive elements that form two sections. All threads in each block will process a section first, each processing one element. They will then all move to the next section, each Page 45processing one element. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index of the first element?

i = blockIdx.x * blockDim.x * 2 + threadIdx.x;

For a vector addition, assume that the vector length is 8000, each thread calculates one output element, and the thread block size is 1024 threads. The programmer configures the kernel call to have a minimum number of thread blocks to cover all output elements. How many threads will be in the grid?

If we want to allocate an array of v integer elements in the CUDA device global memory, what would be an appropriate expression for the second argument of the cudaMalloc call?

v * sizeof(int)

If we want to allocate an array of n floating-point elements and have a floating-point pointer variable A_d to point to the allocated memory, what would be an appropriate expression for the first argument of the cudaMalloc() call?

(void **) &A_d

If we want to copy 3000 bytes of data from host array A_h (A_h is a pointer to element 0 of the source array) to device array A_d (A_d is a pointer to element 0 of the destination array), what would be an appropriate API call for this data copy in CUDA?

cudaMemcpy(A_d, A_h, 3000, cudaMemcpyHostToDevice)

How would one declare a variable err that can appropriately receive the returned value of a CUDA API call?

cudaError_t err;

Consider the following CUDA kernel and the corresponding host function that calls it:

01 __global__ void foo_kernel(float* a, float* b, unsigned int N){
02     unsigned int i=blockIdx.x*blockDim.x + threadIdx.x;
03     if(i < N) {
04         b[i]=2.7f*a[i] - 4.3f;
05     }
06 }
07 void foo(float* a_d, float* b_d) {
08     unsigned int N=200000;
09     foo_kernel <<< (N + 128–1)/128, 128 >>>(a_d, b_d, N);
10 }

a. What is the number of threads per block?

b. What is the number of threads in the grid?

1,563 * 128 = 200,064

c. What is the number of blocks in the grid?

1,563

d. What is the number of threads that execute the code on line 02?

200,064

e. What is the number of threads that execute the code on line 04?

200,000

A new summer intern was frustrated with CUDA. He has been complaining that CUDA is very tedious. He had to declare many functions that he plans to execute on both the host and the device twice, once as a host function and once as a device function. What is your response?

We can use dual qualifiers to declare a function that works on both host and device.

e.g.
__host__ __device__
void myFunction(...) {
    // shared logic usable on both host and device
}

Keyboard shortcuts

Chux's Notebook