NVIDIA GPU Computing & CUDA FAQ |
Articles - Featured Guides | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Written by Olin Coles and NVIDIA | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Monday, 16 June 2008 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NVIDIA GPU Compute FAQ
GPU Computing OverviewYou are going to see an increased interest in GPU computing very soon. Terms such as "heterogeneous computing" and "parallel computing" are going to be used as often as the term "video card" is used in a product review. You won't want to miss this evolution in graphics technology, because we are witness to a pivital moment in time when computers are going to stop being filled with familiar single-purpose hardware. Benchmark Reviews offers this FAQ to help our readers understand what is happening, and help introduce them to what is coming. We don't want anyone to be left in the cold when the rest of the world learns how the GPU is learning to be a CPU. Think of this as the moment when unibody construction evolved the automobile industry decades ago, and later shaped an entirely new dimension for manufacturers to approached building cars. We're experiencing the same moment, because the CPU is about to be joined by a GPU that does many of the same tasks; only better. For years, CPU manufacturers have enjoyed a position at the head of the table. But with heterogeneous computing now a present-day reality, many systems operate with smaller purpose-driven chips on a platform more representative of a round table. Benchmark Reviews offers this FAQ to help our readers understand what is happening within our world of technology, and help introduce them to what is coming as we launch the NVIDIA GeForce GTX 280 Compute Video Card. We don't want anyone to be left in the cold when the rest of the world learns that the GPU is this years CPU. What is heterogeneous computing?Heterogeneous computing is the idea that to attain the highest efficiency applications should use both of the major processors in the PC: the CPU and GPU. CPUs tend to be best at serial operations with lots of branches and random memory access. GPUs, on the other hand, excel at parallel operations with lots of floating point calculations. The best result is achieved by using a CPU for serial applications and a GPU for parallel applications. Heterogeneous computing is about using the right processor for the right operation. What kind of applications are serial, what kinds are parallel?Very wew applications are purely serial or purely parallel. Most require both types of operations to varying degrees. Compilers, word processors, Web browsers, and e-mail clients are examples of applications that are primarily serial. Video playback, video encoding, photo processing, scientific computing, physics simulation, and 3D graphics (raytracing and rasterization) are examples of parallel applications.
What GPUs does CUDA operate with?NVIDIA CUDA-enabled products can help accelerate the most demanding tasks-from video and audio encoding to oil and gas exploration, product design, medical imaging, and scientific research. Many CUDA programs require at least 256 MB of memory attached to the GPU. Please check your system's specifications to ensure the GPU has enough memory to run CUDA programs.
GPU Computing is a standard feature in NVIDIA's 8-Series and future GPUs. CUDA will be supported across a range NVIDIA GPUs although we recommend that the GPU have at least 256 MB of graphics memory. System configurations with less than the recommended memory size may not have enough memory to properly support CUDA programs. What makes the GeForce GTX 280 a great parallel processor for the PC?There are three key ingredients:
CUDA: Compute Unified Device ArchitectureWhat is CUDA?NVIDIA CUDA (Compute Unified Device Architecture) technology is the world's only C language environment that enables programmers and developers to write software to solve complex computational problems in a fraction of the time by tapping into the many-core parallel processing power of GPUs. With millions of CUDA-capable GPUs already deployed, thousands of software programmers are already using the free CUDA software tools to accelerate applications-from video and audio encoding to oil and gas exploration, product design, medical imaging, and scientific research. Providing orders of magnitude more performance than current CPUs and simplifying software development by extending the standard C language, CUDA technology enables developers to create innovative solutions for data-intensive problems. For advanced research and language development, CUDA includes a low level assembly language layer and driver interface. CUDA is a software and GPU architecture that makes it possible to use the many processor cores (and eventually thousands of cores) in a GPU to perform general-purpose mathematical calculations. CUDA is accessible to all programmers through an extension to the C and C++ programming languages for parallel computing. Technology Features:
Does CUDA also work on the CPU?Yes! The upcoming version of CUDA will support multicore CPUs. See the CUDA FAQ for more details. How is CUDA different from GPGPU?CUDA is designed from the ground-up for efficient general purpose computation on GPUs. It uses a C-like programming language and does not require remapping algorithms to graphics concepts. CUDA is an extension to C for parallel computing. It allows the programmer to program in C, without the need to translate problems into graphics concepts. Anyone who can program C can swiftly learn to program in CUDA. GPGPU (General-Purpose computation on GPUs) uses graphics APIs like DirectX and OpenGL for computation. It requires detailed knowledge of graphics APIs and hardware. The programming model is limited in terms of random read and write and thread cooperation. CUDA exposes several hardware features that are not available via the graphics API. The most significant of these is shared memory, which is a small (currently 16KB per multiprocessor) area of on-chip memory which can be accessed in parallel by blocks of threads. This allows caching of frequently used data and can provide large speedups over using textures to access data. Combined with a thread synchronization primitive, this allows cooperative parallel processing of on-chip data, greatly reducing the expensive off-chip bandwidth requirements of many parallel algorithms. This benefits a number of common applications such as linear algebra, Fast Fourier Transforms, and image processing filters. Whereas fragment programs in the graphics API are limited to outputting 32 floats (RGBA * 8 render targets) at a pre-specified location, CUDA supports scattered writes - i.e. an unlimited number of stores to any address. This enables many new algorithms that were not possible to perform efficiently using graphics-based GPGPU. The graphics API forces the user to store data in textures, which requires packing long arrays into 2D textures. This is cumbersome and imposes extra addressing math. CUDA can perform loads from any address. CUDA also offers highly optimized data transfers to and from the GPU. CUDA is available free from NVIDIA at the CUDA Zone website. Parallel Computing with GeForceIsn't parallel computing difficult? How difficult is it to program in CUDA?Parallel programming is difficult because it has typically meant making many CPUs work together (as in a cluster). Desktop applications have been slow to take advantage of multi-core CPUs due to the difficulty of splitting a single program into one that works across multiple threads. These difficulties arise from the fact that a CPU is inherently a serial processor and having multiple CPUs require complex software to manage them.
CUDA removes much of the burden of manually managing parallelism. A program written in CUDA is actually a serial program called a kernel. The GPU takes this kernel and makes it parallel by launching thousands of instances of the program. Since CUDA is an extension of C, it's often trivial to port programs to CUDA. It can be as simple as converting a loop into a CUDA call. There's no need to completely re-architect the program to be multi-threaded. What are the key features of CUDA?
Describe the whole process of creating a CUDA program and executing it on the GPU.The first step involves profiling the existing application or algorithm to determine which code segments are the bottleneck and which of these are good candidates for parallel execution. Next, these functions are then redirected to the GPU using the C extensions in CUDA to define a parallel data structures and operations. The program is compiled using NVIDIA's CUDA compiler, which produces code for both the CPU and GPU. When the program is run, the CPU executes the serial portions of the code and the GPU executes the CUDA code where the heavy computation takes place. The GPU portion of the code is called a kernel. The kernel defines the operations that are to be applied to a given dataset. The GPU takes the kernel and creates an instance of it for every element in the dataset. These kernel instances are called threads. A thread contains its own program counter, registers, and state. For large datasets, as in image and data processing, millions of threads could be launched. Threads are executed in groups of 32 called "warps." Warps are assigned and executed on streaming multiprocessors (SMs). An SM is an eight-core processor. Each core is called a streaming processor (SP), or thread processor, capable of executing one instruction per thread, per clock. Hence an SM takes four processor clocks to execute a warp (32 threads). An SM is not a traditional multicore processor. An SM is highly multithreaded, supporting up to 32 warps at a time. At each clock, the hardware picks and chooses which warp to execute. It switches from one warp to the next with no penalty. Using a CPU analogy, it is akin to supporting 32 programs at once and switching between them at every clock with no context-switch penalty. In practice most CPU cores support one program at a time. Other programs are switched in and out with a delay of hundreds of clock cycles. In summary, the high-level flow of the execution is as follows: define the kernel, have the GPU instantiate and launch threads based on this kernel, group the threads into bundles of 32 called warps, and execute these warps in highly multithreaded processors called SMs. GPU vs. CPU ArchitectureThe CPU's model for computing is well understood. How does the GPU compute, say, the sum of two arrays?Suppose we have two arrays of 1,000 elements and want to find the sum of their elements. The CPU program would iteratively step through the two arrays, finding the sum at each point. For 1,000 elements, it takes 1,000 iterations to execute. On a GPU, the program is defined as a sum operation over the two arrays. When the GPU executes the program, it generates an instance of the sum program for every element in the array. For an array of 1000 elements, it creates and launches 1000 "sum threads." A GeForce GTX 280 has 240 cores, allowing 240 threads to be calculated per clock. For 1000 elements, the GeForce GTX 280 finishes execution in five cycles. The key here is that a CUDA program defines parallelism and the GPU can take this information and launch threads in hardware. The programmer is freed from the task of creating, managing and retiring threads. This also allows for a program to be compiled once and then run on different types of GPUs with different numbers of cores. What are the essential differences between GPU and CPU architecture?There are many ways of looking at this:
Specifications:
GPU Computing PerformanceHow do GPUs perform in real world applications? GPUs excel in highly parallel applications. Speedups between 10× and 100× have been observed in real-world applications. Video EncodingFor video encoding, a 110-second clip encodes in 21 seconds on the GeForce GTX 280. The same clip takes 231 seconds on the fastest CPU.
Folding@HomeFolding@Home, the distributed-computing protein-folding application from Stanford University, runs more than 100× faster on the GPU than on the fastest CPU. Protein folding is measured in nanoseconds per day, or how many nanoseconds of the protein's life can be simulated in a day's worth of computing time. A GeForce GTX 280 can fold at 590 ns/day (up from the 511 ns/day discussed at Editor's Day in May 2008), compared to 4 ns/day on a CPU or 100 ns/day on the Playstation 3. These results are based on the same protein and equivalent work units. At the time of this writing, a beta release of the GPU F@H client is available for Windows Vista users exclusively from Benchmark Reviews.
GPU PhysicsPhysics simulations are inherently parallel and run very well on the GPU. The table below shows common problems like cloth, soft bodies and fluid simulation. The GPU is on average is 11× faster than a quad core CPU on a preliminary implementation of the PhysX engine on the GPU.
Questions? Comments? Benchmark Reviews really wants your feedback. We invite you to leave your remarks in our Discussion Forum.
Related Articles:
|
Comments
can i program gtx 260 with visual studio ?