About Tech Program Multicore ARM Coprocessor GPUs Cluster Applications Reg


GPGPUs Technologies

GPGPU - OpenCL | GPGPU : Power & Perf | Home



contents | overview | Module 1: Getting Started:CUDA enabled NVIDIA GPU Programs | Module 2:Getting Started :PGI OpenACC APIs on CUDA enabled NVIDIA GPU | Module 3: CUDA enabled NVIDIA GPU Programs on Num. Computations | Module 4:CUDA enabled NVIDIA GPU Programs using BLAS libraries for Matrix Computations | Module 5:CUDA enabled NVIDIA GPU Programs - Application Kernels | Module 6:CUDA enabled NVIDIA GPU Memory Optimization Programs - Tuning & Performance | Module 7:CUDA enabled NVIDIA GPU Streams : Concurrent Ashynchronous Execution

It is well-known that the computational power of GPUs has widely attracted the scientific community and GPUs provide unprecedented computational power to solve the data intensive applications. The use of the graphical Processing Unit (GPU) to accelerate non-graphics computations has drawn much attention. This is due to the fact that the computational power of GPUs has exceeded that of PC-based CPUs by more than one order of magnitude while being available for a comparable price. CUDA 5.0 is used for development of programs in the lab. sessions and tuning & optimisation techniques are employed to extract the performance of application kernels.

NVIDIA GPU : CUDA †111.       OpenCL †112.       CUDA 4.0 / CUDA 5.0 †113       NVIDIA Fermi †114       NVIDIA Kepler †115

CUDA Driver API †116       CUDA Toolkit Lib. †117       CUDA Multi-GPU Prog.†118       Unified Virtual Addressing †119

GPUDirect 2.0 †120       NVIDIA PGI-OpenACC †121       CUDA/OpenCL/OpenACC Prog.†122

NVIDIA PGI OpenACC tutorials †123

References & Web-Pages : GPGPU & GPU Computing        Web Sites : NVIDIA CUDA

Click here ...... to know more about CUDA GPU computing/Codes
CUDA - NVIDIA GPU Prog. Overview:

The NVIDIA 's Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the company's powerful GPUs. The NVIDIA CUDA technology is a fundamentally new computing architecture that enables the GPU to solve complex computational problems.

CUDA technology gives computationally intensive applications access to the processing power of NVIDIA graphics processing units (GPUs) through a new, programming interface. CUDA is a software platform for massively parallel high-performance computing on the NVIDIA's powerful GPUs. The game community has been using the NVIDIA's GPUs and graphics cards (NVIDIA's GeForce, Quadrobrand and Tesla, Fermi brand products) since long time.

CUDA requires programmers to write special code for parallel processing but it doesn't require them to explicitly manage threads, which simplifies the programming model. CUDA includes C/C++ Software development tools, functions libraries and a hardware abstraction mechanism that hides the GPU hardware from developers. Selective Scientific and Engineering applications, which come, fall in the category of Data intensive as well as embarrassingly parallel and Consumer market applications (Gaming, Video) may require single precision floating point mathematical operations. CUDA provides solution for such applications and NVIDIA's new GPU which supports double precision floating point mathematical operations can address broader class of applications. The NVIDIA Tesla cards are becoming popular in high-performance computing applications.

CUDA Programming Model:

CUDA Programming model automatically manages the threads and it is significantly differs from single threaded CPU code and to some extent even the parallel code. Before availability of NVIDIA's CUDA, some of the users in Parallel Processing Community write codes for GPU. Efficient CUDA programs exploit both thread parallelism within a thread block and coarser block parallelism across thread blocks. Because only threads within the same block can cooperate via shared memory and thread synchronization, programmers must partition computation into multiple blocks. The GPU is viewed as a compute device capable of executing a very high number of threads in parallel. It operates as a coprocessor to the main CPU called host. Data-parallel, compute intensive portions of applications running on the host are transferred to the device by using a function that is executed on the device as many different threads. Both the host and the device maintain their own DRAM, referred to as host memory and device memory, respectively. One can copy data from one DRAM to the other through optimized API calls that utilize the devices high-performance Direct Memory Access (DMA) engines.

The CUDA model is highly parallel as GPGPU model. The approach is to divide the data set into smaller chunks stored in on-chip memory then allows multiple thread processors to share each chunk. Storing the data locally reduces the need to access off-chip memory, thereby improving the performance. Design class of applications that avoid access to off-chip memory in Scientific Computing requires to re-write the application or re-design algorithm. Also, the overheads involved while loading the required off-chip data into local memory, may affect the performance. CUDA handles in an intelligent way in which off-chip memory access usually doesn't stall a thread processor and another thread is ready to execute.

In CUDA, a group of threads work together in round-robin fashion, ensuring that each thread gets execution time without delaying other threads, thereby reducing the thread overheads. The wait for remote access and service strongly factors into a CUDA's efficiency and scaling. A thread block is a batch of threads that can cooperate together by efficiently sharing data through some -fast shared memory and synchronizing their execution to coordinate memory accesses by specifying synchronization points in the kernel. Its thread ID identifies each thread, which is the thread number within the block. An application can also specify a block as a three-dimensional array and identify each thread using a 3-component index.

The CUDA Toolkit is a complete software development solution for programming CUDA enabled GPUs. The Toolkit includes standard FFT and BLAS libraries, a C-compiler for the NVIDIA GPU and a runtime driver. CUDA technology is currently supported on Linux and Microsoft Windows XP operating systems.

CUDA Tool Kit 4.1 for Applications

CUDA Multi-GPU Programming :

CUDA Programming model provides two basic approaches available to execute CUDA kernels on multiple GPUs (CUDA "devices") concurrently from a single host application:

Applications that require tight coupling of the various CUDA devices within a sytem, these approaches may not be sufficient due to synchronization or communication with each other. The CUDA Runtime now provides features in which single host thread could easily launch work onto any devices it needed. To accommplish this, a host thread can call cudaSetDevice() at any time to change the currently active device. Also, host-thread can now control more than one device. The CUDA Driver API (Version 4.1) provides a way to access multiple devices from within a single host thread namely ( cuCtxPushCurrent() cuCtxPopCurrent()). For convenience sake, CUDA application developers can use set/get context management interface paradigm and CUDA 4.1 provides additional features. With this in mind, cuCtxSetCurrent()) and cuCtxGetCurrent()) have been added to version 4.1 of the CUDA Driver API in addition to the existing cuCtxPushCurrent()) and cuCtxPopCurrent()) functions.

Programming a multi-GPU application is straight forward and easy from programming an application to utilize multiple cores or sockets because CUDA is completely orthogonal to CPU thread management or message passing APIs. Most importantly, selecting the correct GPU, which in most cases is a free (without a context) GPU is important. Also, identification of compute intensive portion of the existing multi-threaded CPU code and port the code to GPU is easy without changing the inter-CPU-thread communication code unchanged.

In order to issue work to a GPU, a context is established between a CPU thread (or group of threads) and the GPU. Only one context can be active on a GPU at any particular instant. Similarly, a CPU thread can have one active context at a time. A context is established during the program's first call to a function that changes state (such as cudaMalloc(), etc.), so one can force the creation of a context by calling cudaFree(0). Note that a context is created on GPU 0 by default, unless another GPU is selected explicitly prior to context creation with a cudaSetDevice() call. The context is destroyed either with a cudaDeviceReset() call or when the controlling CPU process exits.

MPI, OpenMP, Pthreads on Host CPU (Multi-Core) & Multi-GPU : In order to issue work to p GPUs concurrently, a program can either use p CPU threads, each with its own context, or it can use one CPU thread that swaps among several contexts, or some combination thereof. CPU threads can be lightweight (pthreads, OpenMP, etc.) or heavyweight (MPI). Note that any CPU multi-threading or message-passing API or library can be used, as CPU thread management is completely orthogonal to CUDA. For example, one can add GPU processing to an existing MPI application by porting the compute-intensive portions of the code without changing the communication structure. For synchronization across computations on GPUs, the host-CPU or GPUDirect is required for communication.

Even though a GPU can execute calls from one context at a time, it can belong to multiple contexts. For example, it is possible for several CPU threads to establish separate contexts with the same GPU (though multiple CPU threads within the same process accessing the same GPU would normally share the same context by default). The GPU driver manages GPU switching between the contexts, as well as partitioning memory among the contexts (GPU memory allocated in one context cannot be accessed from another context).

In many applications, the algorithm is designed in such a way that each CPU thread (Pthreads, OpenMP, MPI) to control a different GPU. Achieving this is straightforward if a program spawns as many lightweight threads as there are GPUs - one can derive GPU index from thread ID. For example, OpenMP thread ID can be readily used to select GPUs. MPI rank can be used to choose a GPU reliably as long as all MPI processes are launched on a single host node having GPU devices and host configuration of CUDA programming environment.

Unified Virtual Addressing :

CUDA Toolkit 5.0 makes easy of programming on multi-GPU environments for NVIDIA Tesla T20-series (Fermi & Kepler) GPUs running in 64-bit mode on Linux. Unified Virtual Addressing (UVA) allows the system memory and the one or more device memories in a system to share a single virtual address space. This allows the CUDA Driver to determine the physical memory space to which a particular pointer refers by inspection, which simplifies the APIs of functions such as cudaMemcpy(), since the application need no longer keep track of which pointers refer to which memory.

GPUDirect 2.0 :

Built on top of UVA, GPUDirect v2.0 provides for direct peer-to-peer communication among the multiple devices in a system and for native MPI transfers directly from device memory.

Multi-Threaded Programming : This has several important ramifications for multi-threaded processes and some of these are given below. For more detail refer CUDA toolKit 5.0 for Applications

CUDA Driver API :

In version 4.1, a features in which multiple host threads to set a particular context current simultaneously using either cuCtxSetCurrent()or cuCtxPushCurrent(). For more information refer CUDA Toolkit 5.0 for Applications. This has several important ramifications for multi-threaded processes:

An Overview of OpenACC Directives :

C/C++ :

#Pragma acc directivre-name[clause[, clause]...] new-line

Fortran :
!$acc directivre-name[clause[, clause]...]

c$acc directivre-name[clause[, clause]...]

*$acc directivre-name[clause[, clause]...]

OpenACC Parallel Directive :

#pragma acc parallel[clause[, clause]...] new-line stuctured block

The kernel directive defines a region of a program that is to be compiled into a sequence of kernels for execution on the accelerator. Most importantly, each loop nest will bea different kernel and kernels are launched in order in device. When parallel directive is executed, the gangs of worker threads are created to execute accelerator, one worker in each gang begins executing the code following the structured block and number of gangs & workers remains constant in parallel regions.

OpenCL - CUDA Enabled NVIDIA GPU :

Architecture : The CUDA Architecture is a close to the OpenCL architecture. A CUDA device is build around a scalable array of multithreaded Streaming Multiprocessor (SMs). A multiprocessor corresponds to an OpenCL compute unit. A multiprocessor executes a CUDA thread for each OpenCL work-item and a thread block for each OpenCL work-group. A kernel is executed over an OpenCL and NDrange by a grid of thread blocks. Each of the thread blocks that execute kernels is therefore uniquely identified by its work-group ID, and each thread by its global ID or by a combination of its local ID and work-group ID. A thread is also given a unique thread ID within its block. When an OpenCL program on the host invokes a kernel, the work-groups are enumerated and distributed as thread blocks to the multi-processors with available execution capacity. The threads of thread block execute concurrently on one Multi-processor. A thread blocks terminate, new blocks are launched on the vacated multi-processors.

Memory Model : Each multi-processor of NVIDIA CUDA architecture has on-chip memory of the four following types:

There is also a global memory address space that is used for OpenCL global memory and a local memory address space that is private to each thread (and should not be confused with OpenCL local memory). Both memory spaces are read-write regions of device memory and are not cached.

List of Programs - OpenCL - CUDA enabled NVIDIA GPUs :

The matrix multiplication examples illustrate the typical data parallel approach used by OpenCL applications to achieve good performance on GPUs. It illustrates the use of OpenCL local memory that maps to share memory on the CUDA architecture. Shared memory is much faster than the global memory and implementation based on shared memory accesses give improvement in performance for typical matrix computations.

Experts may discuss performance guidelines, focusing on Instruction Performance, Memory Bandwidth Issues, Shared Memory, NDRange & execution time of a kernel launch on the OpenCL implementation, Data transfer between Host and Device, Warp level synchronization issues, and overall performance optimization strategies.

References :
  1. http://www.nvidia.com/object/nvidia-kepler.html NVIDIA Kepler Architecture

  2. http://developer.nvidia.com/cuda-toolkit/" class="bodylinks">NVIDIA CUDA toolkit 5.0 Preview Release April 2012

  3. http://developer.nvidia.com/category/zone/cuda-zone" class="bodylinks"> NVIDIA Developer Zone

  4. http://developer.nvidia.com/gpudirect" class="bodylinks">RDMA for NVIDIA GPUDirect coming in CUDA 5.0 Preview Release, April 2012

  5. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc//
    CUDA_C_Programming_Guide.pdfNVIDIA CUDA C Programmig Guide/
    Version 4.2 dated 4/16/2012 (April 2012)

  6. http://developer.download.nvidia.com/assets/cuda/files/CUDADownloads//
    TechBrief_Dynamic_Parallelism_in_CUDA.pdfDynamic Parallelism in CUDA Tesla K20 /
    Kepler GPUs - Prelease of NVIDIA CUDA 5.0

  7. http://developer.nvidia.com/cuda-downloads NVIDIA Developer ZONE - CUDA Downloads CUDA TOOLKIT 4.2

  8. http://developer.nvidia.com/gpudirect NVIDIA Developer ZONE - GPUDirect

  9. http://developer.nvidia.com/openacct Openacc - NVIDIA

  10. http://developer.nvidia.com/cuda-toolkit Nsight, Eclipse Edition Pre-release of CUDA 5.0, April 2012

  11. http://developer.download.nvidia.com/compute/DevZone/docs/html/OpenCL/doc/
    OpenCL_Programming_Guide.pdf NVIDIA OpenCL Programming Guide for
    the CUDA Architecture version 4.0 Feb, 2011 (2/14,2011)

  12. http://developer.download.nvidia.com/compute/DevZone/docs/html/OpenCL/doc//
    OpenCL_Best_Practices_Guide.pdf Optmization : /
    NVIDIA OpenCL Best Practices Guide Version 1.0 Feb 2011

  13. http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf NVIDIA OpenCL JumpStart Guide - Technical Brief

  14. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc//
    (Design Guide) V4.0, May 2011

  15. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc//
    CUDA_C_Programming_Guide.pdf NVIDIA CUDA C Programming /
    Guide Version V4.0, May 2011 (5/6/2011)

  16. http://developer.nvidia.com/gpu-computing-sdk NVIDIA GPU Computing SDK

  17. http://developer.apple.com/mac/snowleopard/opencl.html Apple : Snowleopard - OpenCL

  18. https://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf The OpenCL Specification, Version 1.1, Published by Khronos OpenCL Working Group, Aaftab Munshi (ed.), 2010

  19. http://www.khronos.org/opencl The OpenCL Speciifcation Version : v1.0 Khronos OpenCL Working Group

  20. http://www.khronos.org/assets/uploads/developers/library/overview/OpenCL-Overview-Jun10.pdf Khronos V1.0 Introduction and Overview, June 2010

  21. http://www.khronos.org/files/opencl-1-1-quick-reference-card.pdf The OpenCL 1.1 Quick Reference card

  22. http://www.khronos.org/registry/cl/ OpenCL 1.1 Specification (Revision 44) June 1, 2011

  23. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf The OpenCL 1.1 Specification (Document Revision 44) Last Revision Date : 6/1/11 Editor : Aaftab Munshi Khronos OpenCL Working Group

  24. http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/ OpenCL Reference Pages

  25. http://www.mathworks.com/products/matlab/ MATLAB

  26. http://developer.nvidia.com/object/matlab_cuda.html NVIDIA - CUDA MATLAB Acceleration

  27. Jason Sanders, Edward Kandrot (Foreword by Jack Dongarra) CUDA BY EXAMPLE - An Introduction to General Purpose GPU Programnming, Addison Wessely 2011, nvidia

  28. Programming Massively Parallel Processors A Hands-on Approach - David B Kirk, Wen-mei W. David B Kirk, Wen-mei W. Hwu nvidia corporation, 2010, Elsevier, Morgan Kaufmann Publishers, 2011

  29. http://www.mathworks.com/matlabcentral/fileexchange/30109-opencl-toolbox-v0-17l OpenCL Toolbox for MATLAB

  30. http://www.nag.co.uk/ NAG

Visitors : 13269