hyPACK-2013

 

GPGPUs Technologies

Mode-4 GPGPUs | NVIDIA - CUDA/OpenCL | AMD APP OpenCL |
GPGPU - OpenCL | GPGPU : Power & Perf | Home

NVIDIA - CUDA/OpenCL

 

contents | overview | Module 1: Getting Started:CUDA enabled NVIDIA GPU Programs | Module 2:Getting Started :PGI OpenACC APIs on CUDA enabled NVIDIA GPU | Module 3: CUDA enabled NVIDIA GPU Programs on Num. Computations | Module 4:CUDA enabled NVIDIA GPU Programs using BLAS libraries for Matrix Computations | Module 5:CUDA enabled NVIDIA GPU Programs - Application Kernels | Module 6:CUDA enabled NVIDIA GPU Memory Optimization Programs - Tuning & Performance | Module 7:CUDA enabled NVIDIA GPU Streams : Concurrent Ashynchronous Execution


NVIDIA\92s Compute Unified Device Architecture (CUDA) is a soft- ware platform for massively parallel high-performance computing on the company's powerful GPUs. NVIDIA\92s software CUDA programming model effectively use GPUs which could be harnessed for tasks other than graphics, achieving teraflops of computing power. CUDA Programming model automatically manages the threads and it is significantly differs from single threaded CPU code and to some extent even the parallel code. Efficient CUDA programs exploit both thread parallelism within a thread block and coarser block parallelism across thread blocks. Because only threads within the same block can cooperate via shared memory and thread synchronization, programmers must partition computation into multiple blocks.

 
List of Programs
Example 5.1:

Write a CUDA Program for implementation of solution of Partial Differential Equations (PDEs) based on Finite Difference Method.

Example 5.2:

Write a CUDA Program for implementation of solution of Partial Differential Equations (PDEs) based on Finite Difference Method using CUBLAS routines.

Example 5.3:

Write a Program for implementation of Laplacian Edge Detection algorithm - Image Processing on multiple device-GPUs ( Assignment).

Example 5.4:

Write a CUDA Program for implementation of solution of Partial Differential Equations (PDEs) based on Finite Element Method ( Assignment).

Example 5.5:

Write a CUDA Program for implementation of solution of Partial Differential Equations (PDEs) based on Finite Difference Method using multiple-GPUs ( Assignment).

Example 5.6:

Write a CUDA Program for implementation of String Search (Boyer-Moore) algorithm ( Assignment).


 
CUDA enabled NVIDIA GPU Programs for Application Kernels

To Compile the program (Download source code :Makefile-cublas )
Makefile
Example 5.1: Write CUDA program to obtain Solution of Poisson Equation (Partial differential Equations) by Finite Difference Method using simple Decomposition of Mesh.

(Download source code :CudaPoissonEquation.cu)
  • Objective

    Write a CUBLAS CUDA program to solve the Poisson equation with Dirichlet boundary conditions in two space dimensions by finite difference method on structured rectangular type of grid. Use Jacobi iteration method to solve the discretized equations.

  • Description:

    In the Poisson problem, the FD method is imposed a regular grid on the physical domain. The resuting disretized equations are solved using Jacobi Itertaive method. The description of the problem is discussed in Example 5.2 .


Example 5.2: Write CUBLAS CUDA program to obtain Solution of Poisson Equation (Partial differential Equations) by Finite Difference Method using One Dimensional Decomposition of Mesh.

(Download source code :CUBlasPoissonEquation.cu)
  • Objective

    Write a CUBLAS CUDA program to solve the Poisson equation with Dirichlet boundary conditions in two space dimensions by finite difference method on structured rectangular type of grid. Use Jacobi iteration method to solve the discretized equations.

  • Description

    In the Poisson problem, the FD method is imposed a regular grid on the physical domain. It then approximate the derivative of unknown quantity u at a grid point by the ratio of the difference in u at two adjacent grid points to the distance between the grid point. In a simple situation, consider a square domain discretized by n x n grid points, as shown in the Figure 1(a). Assume that the grid points are numbered in a row-major order fashion from left to right and from top to bottom, as shown in the Figure 1(b). This ordering is called natural ordering . Given a total of n2 points in the domain n x n grid, this numbering scheme labels the immediate neighbors of point i on the top, left, right, and bottom point as i - n, i -1, i +1 and i+n , respectively. Figure 1(b) represents partitioning of mesh using one dimensional partitioning


    Finite Difference Grid

       Figure 1(a). Finite difference grid: Five point stencil approximation 


    One dimensional decomposition

      Figure 1(b). One dimensional decomposition for 2-D problem, with n=7 

  • Formulation of Poisson Problem

    The Poisson problem is expressed by the equations Lu = f(x, y) in the interior of domain [0,1] x [0,1] Where L is Laplacian operator in two space dimensions. u(x,y) = g(x,y) on the boundary of the domain [0,1] x [0,1] We define a square mesh (also called a grid) consisting of the points (xi , yi), given by xi = i / n+1, i=0, ..., n+1, yj = j / n+1, j=0, ...,n+1, where there are n+2 points along each edge of the mesh. We will find an approximation to u(x , y) only at the points (xi, yj). Let ui, j be the approximate solution to u at (xi, yj). and let h = 1/(n+1). Now, we can approximate at each of these points with the formula (ui-1, j +ui, j+1+ui, j-1+ui+1, j-4ui, j )/ h2 = fi, j . Rewriting the above equation as ui, j = 1/4(ui-1, j+ui, j+1+ui, j-1+ui+1, j-h2fi, j), Jacobi iteration method is employed to obtain final solution starting with initial guess solution ui,jk for k=0 for all mesh points u i,j and solve the following equation iteratively until the solution is converged. ui, jk+1 = 1/4(ui-1, jk+ui, j+1k+ui, j-1k+ui+1, jk-h2fi, j). The resultant discretized banded matrix is shown in the Figure 2.

    Image

  • Implementation:

    Two implementations have been done, one using kernels written in CUDA and the other using CUBLAS library. The steps in both implementations are the same, only the API may differ.


      Step 1: Four arrays are required for computation, one for the storing old values of U i.e Uold, one for the storing new values of U i.e Unew and one each for the storing difference between these two and storing the index values of the interior points. Memory for these are allocated on the Host and they are initialized.

      Step 2:Four more similar arrays are allocated on the GPU (device). The values of the arrays in the host machine are copied on to the arrays allocated on the device.

      Step 3: We begin the computation with an initial solution of all zero's for the Uold vector in the first iteration and then apply boundary conditions on both Uold and Unew by setting the boundary values at corresponding positions.

      Step 4:We compute the Unew values by ui, j = 1/4(ui-1, j+ui, j+1+ui, j-1+ui+1, j).

      Step 5:Next we compute the difference between Unew and Uold.

      Step 6:We assign Unew to Uold.

      Step 7:The values thus obtained (Uold) is used in the next iteration and steps 4,5,6 are repeated until Maximum difference value < TOLERANCE or present iteration equals Maximum iterations. All computations are done on GPU(device).


  • CUDA API used:

      Allocates memory on device:

      cudaMalloc(void** array, int size)

      Frees memory allocated on device:

      cudaFree(void* array )

      Copies from host to device:

      cudaMemcpy((void*)device_array, (void*)host_array, size , cudaMemcpyHostToDevice )

      Copies from device to host:

      cudaMemcpy((void*)host_array, (void*)device_array, size , cudaMemcpyDeviceToHost )

      Multiplies vector X by scalar alpha and adds results to vector Y and it stores result to vector Y:

      void cublasSaxpy (int n, float alpha, const float *x, int incx, float *y, int incy)

      Finds the smallest index of the maximum magnitude element of vector x:

      int cublasIsamax (int n, const float *x, int incx)

      Copies the vector x to the vector:

      void cublasScopy (int n, const float *x, int incx, float *y, int incy)

  • Performance:

    The gettimeofday() function which is part of sys/time.h is used to measure the time taken for computation. It is computed at the beginning of the computation and also at the end of the computation. The actual time is difference of both.

  • Input

    The input to the problem is given as arguments in the command line. It should be given in the following format. Suppose that the number of points in the X direction is m, the number of points in the Y direction is n and the maximum number of iterations the program can is run is p, then the program must be run as:

    ./program_name m n p

    CPU assigns the boundary values for the Uold and Unew vectors.

  • Output

    CPU prints the final Unew vector which is the solution to the Poisson equation with the above stated boundary values.


Example 5.3: Write CUDA program for Lapalce Edge Detection (Image Processing) Algorithm .
  • Objective:

    Write a CUBLAS CUDA program to implement Laplace Edge Detection algorithm on CUDA enabled NVIDIA GPU with MPI as host-CPU Programming. (Assignment)

  • Description

    The Laplacian image kernel is a 3 x3 two-dimensional array and this kernel is applied to each pixel in the image and takes into account the pixel\92s neighbors in a 3 x 3 area around it. The organization of multidimensional thread blocks of images into multidimensional grids are required for CUDA development to process the of arrays of data in typical edge detection algorithm. This representation of data in these forms are usually best suited for migration to CUDA for processing. Such data types include images that can be represented as a two-dimensional matrix where each entry corresponds to a single pixel in the image. An image pixel consists of a discrete red, green, and blue component in the range [0 . . . 255].

    We create a two-dimensional grid that is overlaid on the image, segmenting it into several rectangular sections, as illustrated in Figure 1. For simplicity, we assume the image can be evenly divided into full sized segments. In other words, the image is divided into thread blocks of 8 x 8 and 16 x 16 sizes. Each thread within the thread block corresponds to a single pixel within the image. In implementation, each thread is responsible for loading one pixel or the block of 3 x 3 pixel area into shared memory.


    Thread blocks for single pixel

       Figure 1(a). Thread blocks for single pixel per thread method  


    Thread blocks for single pixel

      Figure 1(b). Thread blocks for multiple pixels  


    For edge detection, the each target pixel requires that a 3x3 area surrounding pixels and these are analyzed to calculate the output. Therefore, threads on the edge of a thread block must examine pixels that are outside the dimensions of the thread block. In order to ensure accuracy of the output image, these threads are responsible for loading the pixels they are adjacent to that do not have a mapping in the thread block into shared memory.

    That is, the threads on the edge of the block load the boundary pixels into shared memory. This extra step is performed after the initial shared memory load that all threads perform. To compensate for the required extra space, the two-dimensional shared memory array is allocated to have dimensions of (blockDim.x+2, blockDim.y+2). This allocates two additional rows and two additional columns of shared memory. Once the block has loaded its respective section of the target image into shared memory, a syncthreads() function is called so that the threads can re-group before proceeding.

    For each thread in the thread-block, we apply the convolution kernel for target pixel which is at the center element and the 3x3 pixel group. After applying the convolution formula, we obtain correct pixel value. Once the CUDA kernel has finished executing, the allocated memory within the GPU is freed and the program exits.

    To develop a CUDA parallel algorithm for Laplacian edge detection, two approaches are considered. The first is straight forward in its data distribution scheme and organization of CUDA thread blocks, while the second took a new approach in an attempt to increase efficiency within CUDA thread blocks.