NVIDIA GPU Computing & CUDA FAQ

Written by Olin Coles and NVIDIA

Monday, 16 June 2008

NVIDIA GPU Compute FAQ

GPU Computing Overview

You are going to see an increased interest in GPU computing very soon. Terms such as "heterogeneous computing" and "parallel computing" are going to be used as often as the term "video card" is used in a product review. You won't want to miss this evolution in graphics technology, because we are witness to a pivital moment in time when computers are going to stop being filled with familiar single-purpose hardware. Benchmark Reviews offers this FAQ to help our readers understand what is happening, and help introduce them to what is coming. We don't want anyone to be left in the cold when the rest of the world learns how the GPU is learning to be a CPU.

Think of this as the moment when unibody construction evolved the automobile industry decades ago, and later shaped an entirely new dimension for manufacturers to approached building cars. We're experiencing the same moment, because the CPU is about to be joined by a GPU that does many of the same tasks; only better. For years, CPU manufacturers have enjoyed a position at the head of the table. But with heterogeneous computing now a present-day reality, many systems operate with smaller purpose-driven chips on a platform more representative of a round table.

Benchmark Reviews offers this FAQ to help our readers understand what is happening within our world of technology, and help introduce them to what is coming as we launch the NVIDIA GeForce GTX 280 Compute Video Card. We don't want anyone to be left in the cold when the rest of the world learns that the GPU is this years CPU.

What is heterogeneous computing?

Heterogeneous computing is the idea that to attain the highest efficiency applications should use both of the major processors in the PC: the CPU and GPU. CPUs tend to be best at serial operations with lots of branches and random memory access. GPUs, on the other hand, excel at parallel operations with lots of floating point calculations. The best result is achieved by using a CPU for serial applications and a GPU for parallel applications. Heterogeneous computing is about using the right processor for the right operation.

What kind of applications are serial, what kinds are parallel?

Very wew applications are purely serial or purely parallel. Most require both types of operations to varying degrees. Compilers, word processors, Web browsers, and e-mail clients are examples of applications that are primarily serial. Video playback, video encoding, photo processing, scientific computing, physics simulation, and 3D graphics (raytracing and rasterization) are examples of parallel applications.

What GPUs does CUDA operate with?

NVIDIA CUDA-enabled products can help accelerate the most demanding tasks-from video and audio encoding to oil and gas exploration, product design, medical imaging, and scientific research. Many CUDA programs require at least 256 MB of memory attached to the GPU. Please check your system's specifications to ensure the GPU has enough memory to run CUDA programs.

GeForce	Tesla	Quadro
GTX 280	C870	FX 5600
GTX 260	D870	FX 4600
9800 GX2	S870	FX 3700
9800 GTX		FX 1700
9600 GT		FX 570
8800 Ultra		FX 370
8800 GTX		NVS 290
8800 GTS		FX 3600M
8800 GT		FX 1600M
8800 GS		FX 570M
8600 GTS		FX 360M
8600 GT		Quadro Plex 1000 Model IV
8500 GT		Quadro Plex 1000 Model S4
8400 GS		NVS 320M
8800M GTX		NVS 140M
8800M GTS		NVS 135M
8700M GT		NVS 130M
8600M GT
8600M GS
8400M GT
8400M GS
8400M G		BmR 2008

GPU Computing is a standard feature in NVIDIA's 8-Series and future GPUs. CUDA will be supported across a range NVIDIA GPUs although we recommend that the GPU have at least 256 MB of graphics memory. System configurations with less than the recommended memory size may not have enough memory to properly support CUDA programs.

What makes the GeForce GTX 280 a great parallel processor for the PC?

There are three key ingredients:

CUDA: The greatest obstacle to parallel computing has always been the software. The GeForce GTX 280 supports CUDA, the industry's first parallel computing language to have deep penetrating (70 million user base) on the PC. CUDA is simple, powerful and offers exceptional scaling on visual computing applications.
GPU Computing Architecture: The GeForce GTX 280 is designed specifically for parallel computing, incorporating unique features like shared memory, atomic operations and double precision support.
Many-core architecture: With 240 cores running at 1.3GHz, the GeForce GTX 280 is the most powerful floating point processor ever created for the PC.
Torrential bandwidth: Due to their high data content, visual computing applications become bandwidth starved on the CPU. With eight on-die memory controllers, the GeForce 280 GTX can access 141GB of data per second, greatly accelerating HD video transcoding, physics and image processing applications.

CUDA: Compute Unified Device Architecture

What is CUDA?

NVIDIA CUDA (Compute Unified Device Architecture) technology is the world's only C language environment that enables programmers and developers to write software to solve complex computational problems in a fraction of the time by tapping into the many-core parallel processing power of GPUs. With millions of CUDA-capable GPUs already deployed, thousands of software programmers are already using the free CUDA software tools to accelerate applications-from video and audio encoding to oil and gas exploration, product design, medical imaging, and scientific research.

Providing orders of magnitude more performance than current CPUs and simplifying software development by extending the standard C language, CUDA technology enables developers to create innovative solutions for data-intensive problems. For advanced research and language development, CUDA includes a low level assembly language layer and driver interface.

CUDA is a software and GPU architecture that makes it possible to use the many processor cores (and eventually thousands of cores) in a GPU to perform general-purpose mathematical calculations. CUDA is accessible to all programmers through an extension to the C and C++ programming languages for parallel computing.

Technology Features:

Standard C language for parallel application development on the GPU
Standard numerical libraries for FFT (Fast Fourier Transform) and BLAS (Basic Linear Algebra Subroutines)
Dedicated CUDA driver for computing with fast data transfer path between GPU and CPU
CUDA driver interoperates with OpenGL and DirectX graphics drivers
Support for Linux 32/64-bit and Windows XP 32/64-bit operating systems

Does CUDA also work on the CPU?

Yes! The upcoming version of CUDA will support multicore CPUs. See the CUDA FAQ for more details.

How is CUDA different from GPGPU?

CUDA is designed from the ground-up for efficient general purpose computation on GPUs. It uses a C-like programming language and does not require remapping algorithms to graphics concepts. CUDA is an extension to C for parallel computing. It allows the programmer to program in C, without the need to translate problems into graphics concepts. Anyone who can program C can swiftly learn to program in CUDA.

GPGPU (General-Purpose computation on GPUs) uses graphics APIs like DirectX and OpenGL for computation. It requires detailed knowledge of graphics APIs and hardware. The programming model is limited in terms of random read and write and thread cooperation.

CUDA exposes several hardware features that are not available via the graphics API. The most significant of these is shared memory, which is a small (currently 16KB per multiprocessor) area of on-chip memory which can be accessed in parallel by blocks of threads. This allows caching of frequently used data and can provide large speedups over using textures to access data. Combined with a thread synchronization primitive, this allows cooperative parallel processing of on-chip data, greatly reducing the expensive off-chip bandwidth requirements of many parallel algorithms. This benefits a number of common applications such as linear algebra, Fast Fourier Transforms, and image processing filters.

Whereas fragment programs in the graphics API are limited to outputting 32 floats (RGBA * 8 render targets) at a pre-specified location, CUDA supports scattered writes - i.e. an unlimited number of stores to any address. This enables many new algorithms that were not possible to perform efficiently using graphics-based GPGPU.

The graphics API forces the user to store data in textures, which requires packing long arrays into 2D textures. This is cumbersome and imposes extra addressing math. CUDA can perform loads from any address. CUDA also offers highly optimized data transfers to and from the GPU.

CUDA is available free from NVIDIA at the CUDA Zone website.

Parallel Computing with GeForce

Isn't parallel computing difficult? How difficult is it to program in CUDA?

Parallel programming is difficult because it has typically meant making many CPUs work together (as in a cluster). Desktop applications have been slow to take advantage of multi-core CPUs due to the difficulty of splitting a single program into one that works across multiple threads. These difficulties arise from the fact that a CPU is inherently a serial processor and having multiple CPUs require complex software to manage them.

CUDA removes much of the burden of manually managing parallelism. A program written in CUDA is actually a serial program called a kernel. The GPU takes this kernel and makes it parallel by launching thousands of instances of the program. Since CUDA is an extension of C, it's often trivial to port programs to CUDA. It can be as simple as converting a loop into a CUDA call. There's no need to completely re-architect the program to be multi-threaded.

What are the key features of CUDA?

Shared memory: Every multiprocessor in CUDA-capable GPUs contains 16 KB of shared memory. This allows different threads to communicate with each other and share data. Shared memory can be considered as software managed cache, which provides great speedups by conserving bandwidth to main memory. This benefits a number of common applications such as linear algebra, fast Fourier transforms, and image-processing filters.
Random read and write (ie. gather and scatter): Whereas fragment programs in the graphics API are limited to outputting 32 floats (RGBA * 8 render targets) at a pre-specified memory location, CUDA supports scattered writes, i.e., an unlimited number of stores to any memory address. This enables many new algorithms that are not feasible using a graphics API.
Arrays and integer addressing: Graphics APIs force the user to store data as textures, which requires packing long arrays into 2D textures. This is cumbersome and imposes extra addressing math. CUDA allows data to be stored in standard arrays and can perform loads from any address.
Texturing Support: CUDA provides optimized texture access with automatic caching, free filtering and integer addressing.
Coalesced memory loads and stores: CUDA groups multiple memory load requests or multiple store requests together, effectively reading or writing data from memory in chunks, allowing near-peak use of memory bandwidth.

Describe the whole process of creating a CUDA program and executing it on the GPU.

The first step involves profiling the existing application or algorithm to determine which code segments are the bottleneck and which of these are good candidates for parallel execution. Next, these functions are then redirected to the GPU using the C extensions in CUDA to define a parallel data structures and operations. The program is compiled using NVIDIA's CUDA compiler, which produces code for both the CPU and GPU. When the program is run, the CPU executes the serial portions of the code and the GPU executes the CUDA code where the heavy computation takes place. The GPU portion of the code is called a kernel. The kernel defines the operations that are to be applied to a given dataset.

The GPU takes the kernel and creates an instance of it for every element in the dataset. These kernel instances are called threads. A thread contains its own program counter, registers, and state. For large datasets, as in image and data processing, millions of threads could be launched.

Threads are executed in groups of 32 called "warps." Warps are assigned and executed on streaming multiprocessors (SMs). An SM is an eight-core processor. Each core is called a streaming processor (SP), or thread processor, capable of executing one instruction per thread, per clock. Hence an SM takes four processor clocks to execute a warp (32 threads).

An SM is not a traditional multicore processor. An SM is highly multithreaded, supporting up to 32 warps at a time. At each clock, the hardware picks and chooses which warp to execute. It switches from one warp to the next with no penalty. Using a CPU analogy, it is akin to supporting 32 programs at once and switching between them at every clock with no context-switch penalty. In practice most CPU cores support one program at a time. Other programs are switched in and out with a delay of hundreds of clock cycles.

In summary, the high-level flow of the execution is as follows: define the kernel, have the GPU instantiate and launch threads based on this kernel, group the threads into bundles of 32 called warps, and execute these warps in highly multithreaded processors called SMs.

GPU vs. CPU Architecture

The CPU's model for computing is well understood. How does the GPU compute, say, the sum of two arrays?

Suppose we have two arrays of 1,000 elements and want to find the sum of their elements. The CPU program would iteratively step through the two arrays, finding the sum at each point. For 1,000 elements, it takes 1,000 iterations to execute.

On a GPU, the program is defined as a sum operation over the two arrays. When the GPU executes the program, it generates an instance of the sum program for every element in the array. For an array of 1000 elements, it creates and launches 1000 "sum threads." A GeForce GTX 280 has 240 cores, allowing 240 threads to be calculated per clock. For 1000 elements, the GeForce GTX 280 finishes execution in five cycles.

The key here is that a CUDA program defines parallelism and the GPU can take this information and launch threads in hardware. The programmer is freed from the task of creating, managing and retiring threads. This also allows for a program to be compiled once and then run on different types of GPUs with different numbers of cores.

What are the essential differences between GPU and CPU architecture?

There are many ways of looking at this:

Design goal: a CPU core is designed to execute one stream of instructions as fast as possible. GPUs are designed to execute many parallel streams of instructions as fast as possible.
Transistor usage: The CPU spends transistors on hardware features like instruction reorder buffers, reservation stations, branch prediction hardware, and large on-die cache. These features are designed to speedup the execution of a single thread. The GPU spends transistors in processor arrays, multithreading hardware, shared memory, and multiple memory controllers. These features are not fixated on speeding up the execution of a particular thread, rather they allow the chip to support tens of thousands of threads concurrently on-chip, facilitating thread communication, and sustained high memory bandwidth.
Role of cache: The CPU uses cache to improve performance by reducing the latency of memory accesses. The GPU uses cache (or software-managed shared memory) to amplify bandwidth.
Managing Latency: The CPU handles memory latency by using large caches and branch prediction hardware. These take up a large deal of die-space and are often power hungry. The GPU handles latency by supporting thousands of threads in flight at once. If a particular thread is waiting for a load from memory, the GPU can switch to another thread with no delay.
Multithreading: CPUs support one or two threads per core. CUDA capable GPUs support up to 1,024 threads per streaming multiprocessor. The cost of a CPU thread switch is hundreds of cycles. GPUs have no cost in switching threads. GPUs typically switch threads every clock.
SIMD vs. SIMT: CPUs use SIMD (single instruction, multiple data) units for vector processing. GPUs employ SIMT (single instruction multiple thread) for scalar thread processing. SIMT does not require the programmer to organize the data into vectors, and it permits arbitrary branching behavior for threads.
Memory Controller: Intel CPUs have no on-die memory controllers. CUDA capable GPUs employ up to eight on-die memory controllers. As a result, GPUs typically have 10× the memory bandwidth of CPUs.

Specifications:

Processor	Intel Core 2 Extreme QX9650	NVIDIA GeForce GTX 280
Transistors	820 million	1.4 billion
Processor clock	3 GHz	1296 MHz
Cores	4	240
Cache / Shared Memory	6 MB x 2	16 KB x 30
Threads executed per clock	4	240
Hardware threads in flight	4	30,720
Peak gigaflops	96 gigaflops	933 gigaflops
Memory controllers	Off-die	8 x 64-bit
Memory Bandwidth	12.8 GBps	141.7 GBps

GPU Computing Performance

How do GPUs perform in real world applications?

GPUs excel in highly parallel applications. Speedups between 10× and 100× have been observed in real-world applications.

Video Encoding

For video encoding, a 110-second clip encodes in 21 seconds on the GeForce GTX 280. The same clip takes 231 seconds on the fastest CPU.

Folding@Home

Folding@Home, the distributed-computing protein-folding application from Stanford University, runs more than 100× faster on the GPU than on the fastest CPU. Protein folding is measured in nanoseconds per day, or how many nanoseconds of the protein's life can be simulated in a day's worth of computing time. A GeForce GTX 280 can fold at 590 ns/day (up from the 511 ns/day discussed at Editor's Day in May 2008), compared to 4 ns/day on a CPU or 100 ns/day on the Playstation 3.

These results are based on the same protein and equivalent work units. At the time of this writing, a beta release of the GPU F@H client is available for Windows Vista users exclusively from Benchmark Reviews.

GPU Physics

Physics simulations are inherently parallel and run very well on the GPU. The table below shows common problems like cloth, soft bodies and fluid simulation. The GPU is on average is 11× faster than a quad core CPU on a preliminary implementation of the PhysX engine on the GPU.

Questions? Comments? Benchmark Reviews really wants your feedback. We invite you to leave your remarks in our Discussion Forum.

Comments

# cuda — hamid 2010-09-26 04:03

hello
can i program gtx 260 with visual studio ?

Report Comment

# RE: cuda — Olin Coles 2010-09-26 08:04

Yes. The GeForce GTX 260 is fully compatible with CUDA, and will work with Visual Studio.