abcd-2014

 


ABCD-2014 Hands On Session
 
C-DAC HPC Cluster with Coprocessors / GPUs
 
Type 1: HPC GPU Cluster with NVIDIA GPUs
Type 2: HPC GPU Cluster with AMD GPUs
Type 3 : HPC Cluster with Intel Xeon-Phi Coprocessors
Type 4 : HPC Cluster (PARAM YUVA-II) with Intel Xeon-Phi Coprocessors



Type 1 : HPC GPU Cluster with NVIDIA GPUs
Host-CPU : Intel Xeon Quad Core
Device GPU : NVIDIA Fermi Multi GPUs
Prog. Env : CUDA/OpenCL -"http://www.nvidia.com" NVIDIA GPUs; CUDA SDK/APIs; "http://www.pgroup.com" PGI Accelerator

Peak performance (in double precision) of HPC GPU Cluster with one node having Single CUDA enabled NVIDIA GPU is 615 Gflop/s


Host-CPU (Xeon)
  • One Intel Xeon 64bit Quad Core (X5450 processor series (Harpertown Processor) with two PCI-e 2.0 x16 Slots; RAM-16 GB; Clock Speed : 3.0 GHz; Cent OS 5.2; GCC Version 4.1.2; Dual Socket Quad Core (6 Processors or cores)

  • Intel MKL version 10.2, CUBLAS version 3.2, Intel icc11.1 Peak Performance : CPU : 96 Gflops (1 Node - 8 Cores)

Device-CPU (NVIDIA)
  • One Tesla C2050 (Fermi) with 3 GB memory; Clock Speed 1.15 GHz, CUDA 3.2 Toolkit

  • Reported theoretical peak performance of the Fermi (C2050) is 515 Gflop/s in double precision (448 cores; 1.15 GHz; one instruction per cycle) and reported maximum achievable peak performance of DGEMM in Fermi up to 58% of that peak.

  • The theoretical peak of the GTX280 is 936 Gflops/s in single precision (240 cores X 1.30 GHz X 3 instructions per cycle) and reported maximum achievable peak performance of DGEMM up to 40% of that peak.




Type 2 : HPC GPU Cluster with AMD GPUs
"http://uk.hardware.info/news/28418/hp-pavilion-dv6-with-amd-a8-4500m-trinity-processor" HP Pavailion AMD A8-4500K (Trinity) APU with AMD Radeon 7640G chip
Host-CPU : AMD Opteron X86 12 Core     "http://developer.amd.com/assets/apu101.pdf" AMD APUs (APU 101)
Device GPU : AMD Fire Stream 9350 & 9250; AMD FirePro V5900 & V7900
Prog. Env : OpenCL - "http://www.amd.com" AMD APP GPUs; OpenCL SDK/APIs on APUs

Peak performance (double precision) of HPC GPU Cluster with one node having Single AMD Fire Stream 9305 is 415 Gflop/s


Host-CPU (AMD)
  • One AMD Opteron X86 24 Core Multi-Core Processor systems with two PCI-e 2.0 x16 Slots; RAM-48 GB; Clock Speed : 3.0 GHz; Cent OS 5.2; GCC Version 4.1.2; Dual Socket 12 Core (24 cores)

  • ACML version , OpenCL and BLAS Libraries; Peak Performance : CPU : 144 Gflops (1 Node - 12 Cores) and AMD-APP with OpenCL Prog. Env.

  • AMD Fire Stream 9250 GPU Accelerator :
    Double Precision Floating Point : The FireStream 9250 supports double precision floating point operations in hardware;
    High Performance per Watt : Up to 8 GFLOPS per watt of single precision performance potential

    Optimized for computation The AMD FireStream product line provides the industry's first double-precision floating point capability on a GPU. The AMD FireStream 9250 is our second generation DP-FP product. With 1GB GDDR3 memory on board and single-precision performance of 1 TFLOPS.

  • AMD Fire Stream 9350 GPU Accelerator :
    Technology Need : AMD FireStream Computing Solution
    High DPFP performance : 528 GFLOPS double precision
    High performance per Watt : 2.4 GFLOPS / Watt
    Open standards : OpenCL, Direct Compute
    Performance optimization tools : OpenCL SDK
    PCIe 2.1 Host Interface : 8 GB/S Host-GPU bandwidth

    The FireStream 9350 offers maximum GPU performance with 4GB of DDR5 memory in a 2-slot configuration. The FireStream 9350 offers maximum performance / slot with 2GB DDR5 memory in a 1-slot configuration.

  • AMD FirePro V5900 :
    The AMD FirePro V5900 features 2GB of blazing-fast GDDR5 memory, 512 stream processors, and support for three simultaneous monitor outputs from a single AMD FirePro V5900 graphics card with AMD Eyefinity technology. The AMD FirePro V5900 supports OpenCL and it has parallel processing capabilities of 512 stream processors and PCI Express 2.1 compliant.

  • AMD FirePro V7900 :
    The AMD FirePro V7900 features : 2GB of ultra-fast GDDR5 memory and 1280 stream processors. The AMD FirePro V7900 supports OpenCL and it has parallel processing capabilities of 1280 stream processors and PCI Express 2.1 compliant.




Type 3 : HPC Cluster with Intel Xeon-Phi Coprocessors

Details of Compute node and Intel Coprocessors are given below.
ABCD-2014 Lab. : Host-CPU : Intel Xeon Quad Core;
ABCD-2014 Lab. : Coprocessor : Intel Xeon Phi Multiple Co-processors


ABCD-2014 Lab. : Host-CPU (Xeon)
  • One Intel Xeon 64bit Quad Core (X5450 processor series (Harpertown Processor) with two PCI-e 2.0 x16 Slots; RAM-16 GB; Clock Speed : 3.0 GHz; Cent OS 5.2; GCC Version 4.1.2; Dual Socket Quad Core (6 Processors or cores)

  • Intel MKL version 10.2, Intel icc11.1 Peak Performance : CPU : 96 Gflops (1 Node - 8 Cores)

ABCD-2014 Lab. : Intel Xeon-Phi Coprocessor
  • 60 Intel Architecture (MIC or Intel Xeon-Phi) cores; I/O Bus: PCIe; Memory Type: GDDR5; Memory size: 8 GB GDDR5 memory technology


Type 4 : HPC Cluster (PARAM YUVA-II) with Intel Xeon-Phi Coprocessors

Two HPC Clusters using Intel Xeon-Phi Coprocessors have been used is Hands-on Session of ABCD-2014 workshop. The first HPC cluster is two node Cluster with Intel Xeon-Phi Coprocessors and other one is C-DAC PARAM YUVA-II HPC MPI bassed cluster having o.5 petaflops peak performance with 69 rank in Top-500 supercomputers released in July 2013. The PARAM YUVA-II system has 225 Intel Xeon DP nodes and has node having two Intel Xeon-Phi Coprocessors. The TOP-500 LINPACK Peak Performance (Rpeak) is 529.38 TF and the Top-500 LINPACK Performance (Rmax) is 386.708 TF.
Source : " Source : http://s.top500.org/static/lists/2013/06/TOP500_201306.xls" http://s.top500.org/static/lists/2013/06/TOP500_201306.xls
source : " http://www.top500.org/lists/" http://www.top500.org/lists/

Peak performance (in double precision) of HPC Cluster (PARAM YUVA-II) with one node having two Intel Xeon-Phi Coprocessors and other details are given in sub-sequent sections.


PARAM YUVA-II : Host-CPU (Xeon)
  • One Intel Xeon E52670 64-bit 8 Core Processors with multiple PCI-2 2.0 x16 Slots; RAM : 64 GB; Clock Frequency : 2.35 GHz; No. of Cores per Node : 16 Standard Linux dual Socket 8 Core processors

  • Software Intel Development Tools; Intel MPI, Pubic domain MVAPICH2); MKL, NAG CDAC KSHIPRA; Varda Prog. Env - Reconfigurabale Computing - FPGA

PARAM YUVA -II : Intel Xeon-Phi Coprocessor
  • 60 Intel Architecture (MIC or Intel Xeon-Phi) cores; connected by a high performance on-die bi-directional interconnect. I/O Bus: PCIe Memory Type: GDDR5 and >2x bandwidth of KNF; Memory size: 8 GB GDDR5 memory technology; Peak performance is more than TFLOP (DP); Single Linux image per chip

  • Intel Xeon Phi Co-processor : (No. of Cores : 60) Single Precision : 2129.47 Gflops/s Peak Perf : 1.1091 GHz X 60 cores X 16 lanes X 2 Peak Perf. of Single Core = 35.49 GigaFlop/s


Compute Nodes with Multiple Intel Co-processors

In a Message Passing Cluster, MPI processes are launched on multi-core processors of cluster node or Intel Xeon-Phi Co-processors with InfiniBand as interconnect using Intel programming enviornment