Introduction
Historically, vast majority of software applications are written as sequential programs. From 1980 to 2000, advancements in CPUs have driven performance increases in this paradigm. Since 2003, this has slowed down due to increased energy consumption and heat dissipation issues. Hence, there is a move toward multiple physical CPUs, referred to as processor cores, used in each chip. To benefit from multiple processor cores, there must be multiple instruction sequences that can simultaneously execute on these processor cores.
Hence, application software that will continue to enjoy significant performance improvement will be parallel programs, in which multiple threads of execution cooperate to complete the work faster.
There are two main trajectories for designing microprocessors:
- Multicore trajectory seeks to maintain execution speed of sequential programs.
- Many-thread trajectory seeks to maximize execution throughput of parallel programs. The idea is to have large number of threads (e.g. NVIDIA Tesla A100 has tens of thousands of threads)
The many-thread paradigm leads the race of floating-point calculation throughput. As of 2016, the ratio of peak throughput between many-thread GPUs and multicore CPUs is about 10. This performance gap motivates many applications to move parts of their software to GPU for execution.
Fundamentally, the design of CPU and GPU are quite different.
- CPUs are optimized for sequential processing, and much space on the chip is dedicated for the following functions
- They make use of sophisticated control logic to allow instructions from a single thread to execute in parallel while maintaining the appearance of sequential execution
- Large cache memories are provided to reduce the instruction and data access latencies
- GPUs are optimized for throughput. In general, reducing latency is more expensive than reducing throughput. Hence, the solution is to increase throughput by including a massive number of threads.
- Pipelined memory channels and arithmetic operations are allowed to have longer latency, allowing for smaller chip area for memory access hardware and arithmetic units
- This allows us to have a much higher degree of parallelism compared to CPUs
Given the relative benefits of each, one should expect that applications should use both CPUs and GPUs. This is why CUDA is designed to support joint CPU - GPU execution of an application.
Architecture of a GPU
A GPU is organized as an array of highly threaded streaming multiprocessors (SMs).
- Each SM has a number of streaming processors (SPs) that share control logic and instruction cache
- Each GPU comes with high bandwidth (and also higher latency) global memory which we will refer to as DRAM. In NVIDIA's recent Pascal architecture, this is refered to as High Bandwidth Memory (HBM).
There is also a CPU - GPU / GPU - GPU connection to allow data transfers. In the Pascal family, this is called NVLINK, which allows transfers of up to 40 GB/s per channel. As the size of GPU memory grows, applications increasingly keep their data in the global memory instead of relying on the CPU-GPU connection.
Speeding up Applications
How much parallelism can speed up an application depends on the portion of the application that can be parallelized. If the time taken for a sequential program is and we can only parallelize 30% of it, then a speed-up of the parallel portion will only result in 27% execution time reduction:
The fact that the level of speedup one can achieve through parallel execution can be severely limited by the parallelizable portion is known as Amdahl's Law.
Goals of the book
Much of the book is dedicated to techniques for developing high performance parallel code. Focus on computational thinking techniques that will be amenable to parallel computing.
Second goal is to write parallel programs with correct functionality and reliability, and in a way that you can debug and inspect the code. CUDA encourages use of simple forms of barrier synchronization, memory consistency, and atomicity for managing parallelism.
Third goal is scalability across future hardware designs. The key to such scaling is to regularize and localize memory data accesses to minimize consumption of critical resources and conflicts in updating data parameters.