Introduction to GPU Computing
From Jacket Wiki
Back to Main Page, Forward to Jacket Overview
Contents |
Graphics Processors for General Purpose Computing
Over the past few years, specialized coprocessors from floating point hardware to field programmable gate arrays have enjoyed a widening performance gap with traditional x86-based processors. Of these, graphics processing units (GPUs) have advanced at an astonishing rate, currently capable of delivering over 1 TFOPS of single precision performance and over 300 GFLOPS (see Figure 1 below) of double precision while executing up to 240 simultaneous threads in one low-cost package. As such, GPUs have gained significant popularity as powerful tools for high performance computing (HPC) achieving 20-100 times the speed of their x86 counterparts in applications such as physics simulation, computer vision, options pricing, sorting, and search.
A GPU is a highly parallel computing device designed for the task of graphics rendering. However, the GPU has evolved in recent years to become a more general processor, allowing users to flexibly program certain aspects of the GPU to facilitate sophisticated graphics effects and even scientific applications. In general, the GPU has become a powerful device for the execution of data-parallel, arithmetic (versus memory) intensive applications in which the same operations are carried out on many elements of data in parallel. Example applications include the iterative solution of PDEs, video processing, machine learning, and 3D medical imaging.
What is the reason for the large performance gap between many-core GPUs and general purpose multi-core CPUs? The answer lies in the fundamental architectural design of these two processors, as illustrated in Figure 2. The design of a CPU is optimized for sequential code performance. It makes use of sophisticated control logic to allow instructions from a single thread of execution to execute in parallel or even out of sequential order while maintaining the appearance of sequential execution. More importantly, large cache memories are provided to reduce the instruction and data access latencies of large complex applications. Neither control logic nor cache memories contribute to the peak calculation speed. As of 2008, the new general purpose multi-core microprocessors typically have four large processor cores designed to deliver strong sequential code performance.
The design philosophy of the GPUs has historically been motivated by the fast growing video game industry that exerts tremendous economic pressure for the ability to perform massive numbers of floating point calculations in advanced games. Therefore, the design goal for GPU vendors is to look for ways to maximize the chip area and power budget dedicated to floating point calculations. The general philosophy for GPU design is to optimize for the execution of massive number of threads. The hardware takes advantage of a large number of execution threads to find work to do when some of them are waiting for long-latency memory accesses, minimizing the control logic required for each execution thread. Small cache memories are provided to help control the bandwidth requirements of these applications so that multiple threads that access the same memory data do not need to all go to DRAM. As a result, much more chip area is dedicated to the floating-point calculations.
With the release of NVIDIA's Compute Unified Device Architecture (CUDA), the hardware architecture of modern GPUs can be viewed as a series of Single Instruction Multiple Data (SIMD) multiprocessors, each capable of processing a set of instructions on different memory elements in a clock cycle. Typically, the GPU is accessed for computation or graphics rendering through its device driver using a graphics API, such as OpenGL or DirectX, or through specialized APIs such as CUDA (which shares the same name with NVIDIA's hardware architecture). GPU computing with CUDA is a more modern and capable GPGPU computing approach than utilizing the former two APIs. CUDA forms the underlying workhorse for Jacket.
CUDA is a software architecture and API geared towards the utilization of the GPU as a computing device rather than a graphics rendering device. The CUDA software includes a GPU device driver, a runtime system that serves as an abstraction over the driver, and also runtime libraries that CUDA applications may link to in order to provide GPU-enabled FFT and BLAS support, among others. CUDA also includes a compiler toolchain which provides extensions onto the C/C++ languages for the construction of GPU applications. While programming the GPU with the CUDA toolchain, the GPU is viewed as a coprocessor to the CPU, or host, which orchestrates the executions carried out by the GPU as needed. In order to utilize the GPU to its fullest potential, the CPU must minimize data communication with the GPU, due to limited bus bandwidth, and maximize data parallelism in the tasks given to the GPU to maximize usage of GPU processors. Though the GPU can be viewed as capable of executing a large number of general threads in parallel, GPU programming is still typically accomplished through the specification of kernels which operate across an array of data elements. These kernels are limited in their length and the amount of local memory they use. The potential bottlenecks involved in com¬puting with the GPU include memory allocation, memory transfer, and kernel execution. In the ideal case, each of these tasks is done sparingly to ensure that minimal overhead is accrued over the lifetime of an application. Jacket, which we describe in the following sections minimizes these tasks transparently and yields high GPU/CPU performance for MATLAB® applications with minimal effort from the user.
Important Concepts
A few basic concepts should be understood in getting started with GPU computing. These are outlined in the following few paragraphs:
Data-parallel Computations
In order to understand what algorithms work well on the GPU, it is important to understand the difference between data parallelism and task parallelism. There are many ways to define this, but simply put and in our context:
- Task parallelism is the simultaneous execution on multiple cores of many different functions across the same or different datasets.
- Data parallelism (aka SIMD) is the simultaneous execution on multiple cores of the same function across the elements of a dataset.
In particular GPUs are especially well-suited to address problems that can be expressed as data-parallel computations with high arithmetic intensity – a high ratio of arithmetic operations to memory operations. Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up simulations. For example, image and media processing applications such as post-processing of images, video encoding and decoding, image scaling, stereo vision, and pattern recognition very naturally map image blocks and pixels to parallel processing threads. Moreover, many algorithms outside the field of image rendering and processing are also accelerated by data-parallel processing ranging from general signal processing or physics simulation to computational finance or computational biology.
However, it should be understood that the GPU is designed as a numeric computing engine and it will not perform well on some tasks. Therefore, one should expect that most applications will use both CPUs and GPUs, executing the sequential parts on the CPU and numeric intensive parts on the GPUs. This heterogeneous computing model is fully supported by Jacket for joint CPU-GPU execution of an application.
Data Sizes and Transfers between Host and GPU
Another important concept in understanding the limitation with the heterogeneous computation model defined above is the significant overhead of memory transfers between the host CPU and the GPU. In general, the overhead of time spent in sending data to the GPU and bringing it back neutralizes any performance benefit obtained by computing on the GPU for smaller sized datasets. Moreover, GPUs offer best performance gains when all the computing resources, processing cores and memory, are maximally utilized. Therefore, from the user's perspective, an analysis of data sizes is the best way to determine which jobs to offload to the GPU.
Why Jacket?
Jacket exists to enable domain professionals, including scientists, engineering, and analysts, to get the benefits of GPU computing without the hassle of GPU-specific programming constructs. Jacket overcomes this problem by providing a middleware approach to GPU programming, with MATLAB® as the frontend point of interaction for the user. MATLAB with millions of users worldwide is the platform of choice for engineers and scientists alike, for rapid algorithm prototyping. MATLAB is an extensible interactive programming environment for numerical analysis built on a vector language called M. The M-language, like other vector languages, provides users with a high-level interface at which operations may be specified over large sets of data at once making the expression of data-parallel algorithms natural. M is also dynamically typed, adheres to pass-by-value semantics, and is integrated into a well developed interpreted environment. With these characteristics, M has proven to be a powerful, user-friendly language. Using Jacket, the M-language and MATLAB transparently adapt to GPGPU computing. Unlike other GPU solutions, Jacket provides GPU computation and graphics ability from a language which is inherently parallel and interpreted, thereby providing a standard, extensible, and simple method of programming for the GPU in an already proven rapid prototyping environment. Jacket adds few GPU-specific datatypes to MATLAB with overloaded operators and entire CPU-bound MATLAB programs can be converted into GPU-enabled programs through as little as adding a 'g' prefix onto memory allocation commands. Otherwise, the user interacts with MATLAB as they normally would either from the command line or when running scripts.
Forward to Jacket Overview