High-Performance Computing Research Group

Enhancing the Efficiency of Massively Parallel Programs

in Computational Science and Engineering Applications

Introduction

Graphic processing Units (GPUs) are gaining ground in high-performance computing and being considered as the leading edge of the next generation of general purpose computers. CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. However, the task of writing optimized CUDA program is complex even for experts. This is a major limiting factor to spreading the use of the GPU technology in various areas of research and development.

We propose to develop a software tool for restructuring C-like loops into optimized CUDA kernels based on a 3-step algorithm which are loop tiling, coalesced memory access, and optimized resource utilization. For this we will identify the GPU constraints for maximum performance such that the memory usage (global memory and shared memory), number of blocks, and number of threads per block, and data transfer bandwidth. We will also establish the relationships between the influencing parameters and propose a method for finding optimized tiling solutions with coalesced memory access that best meets the identified constraints. Based on the above, we will design a restructuring tool for optimizing performance of CUDA programs. In the evaluation, the above software tool will be used to parallelize the 2-D/3-D Fluid Flow simulation based on the Navier-Stokes Equations for fixed boundary conditions. Obtained performance will be compared to others’ contribution.

The above restructuring tool will greatly simplify programming for the best performance of GPU which will contribute to the spreading of the use of GPU supercomputing applications, scientific computing, and more generally the applications of information technology. This project will build sufficient know-how and state-of-the-art tools for the efficient programming of GPUs. The project will stimulate a long-term interest in the research and development of programming massively parallel computers and their applications especially in the Oil and Gas industry. Specifically, the project outcomes will serve the graduate research program and the industry in the kingdom of Saudi Arabia.

Research Plan

Massively Parallel Computing is gaining ground in high-performance computing. CUDA (an extension to C) is most widely used parallel programming framework for general purpose Graphic processing Units (GPUs). However, the task of writing optimized CUDA programs is complex even for experts. We are proposing to develop an automatic restructuring to optimize CUDA programs for computational science and engineering applications with following features:

· Identifying the condition for maximizing utilization of the GPU resources and establishing the relationships between the influencing parameters.

· Developing algorithms that explore possible tiling solutions with coalesced memory access and resource optimizations that best meet the identified restructuring specifications. For this we will tailor the GPU constraints to achieve maximum performance such as the memory usage (global memory and shared memory), number of blocks, and number of threads per block. A restructuring tool (R-CUDA) will be developed to enable optimizing the performance of CUDA programs based on the restructuring specifications.

· Building a 2-D Fluid Flow simulator based on the Navier-Stokes Equations for fixed boundary conditions. The simulator code will be optimized using the above restructuring tool to expose maximum data parallelism in dense and sparse linear algebra solvers.

· Extensive testing of the tool using benchmarks from the LAPACK – BLAS library such as DGEMM, SGEMM, CAXPY and check for correctness. Also the use of profiling tools such as CUDA Visual Profiler, Parallel Nsight, TotalView to verify the restructuring specifications. The simulator will be tested and validated using typical test cases.

The major outcomes of this project are: 1) an automatic restructuring tool for optimizing the performance of CUDA programs focusing on dense and sparse linear algebra solvers, (2) an optimized 2-D Fluid Flow simulator based on the Navier-Stokes Equations for fixed boundary conditions, and (3) a research lab in Massively Parallel Computing and a graduate course in Computational Science and Engineering.

Our proposal is to develop a restructuring tool to ease the process of writing efficient CUDA programs and to use the tool to parallelize the 2-D Fluid Flow simulation based on the Navier-Stokes Equations. We want to build the expertise and the know-how that will lead to efficiently writing parallel code for scientific simulators to serve the graduate research program and the Oil and Gas industry in the kingdom of Saudi Arabia.

Research Objectives

The objectives are the following:

1. OJ-1: Develop a software tool to ease the task of efficiently programming massively parallel computers. This requires (1) identifying the GPU constraints for maximum performance and establishing the relationships between the influencing parameters for some recent GPUs (Outcome O-1) and (2) design of a parametric algorithm to restructure C-like code into optimized GPU kernels (outcome O-2).

2. OJ-2: Build an optimized simulator for 2-D/3-D Fluid Flow. This objective can be achieved through the following two steps: (1) translate the Navier-Stokes Equations for fixed boundary conditions from the continuum domain into the discrete domain to expose maximum data parallelism (outcome O-3), (2) use the above restructuring tool to design an optimized parallel Fluid Flow simulator code (Outcome O-2), and (3) run the parallel code, report performance, and compare obtained performance to others’ contributions (Outcome O-3).

3. OJ-3: Publicize the gained experience and make the tool an open source to maximize its use in the academia and industry. This can be achieved through publishing a conference and a Journal papers on the above issues (Outcome O-4).

Cooperation

Our team consists of Dr. Mayez Al-Mouhamed (COE), Dr. Adel Ahmad (ICS), Dr. Rashad Ben-Mansour (ME), Mr. Ayaz Khan (CCSE Ph.D. Student), and Mr. Anas Almousa (CCSE Ph.D. Student). The team cooperates with KAUST group on enhancing the MAGMA and CUBLAS Libraries for the Kepler Cluster of GPUs.

Project

An NSTIP research project is currently submitted for approval.

Publications

Mayez A. Al-Mouhamed and Ayaz Khan, A Restructuring Algorithm for CUDA, Accepted in the International Journal of Parallel, Emergent and Distributed Systems, February, 2013.
Mayez A. Al-Mouhamed and Ayaz Khan, Exploration of Automatic Optimization for CUDA Programming, 2nd IEEE International Conference on Parallel, Distributed and Grid Computing, Jaypee University of Information Technology (IEEE-PDGC), Himachal Pradesh, India, 6 December 2012. This paper has been selected as the "Second Best IEEE-PDGC-2012 Conference Paper" out of 605 paper submissions.
Baqais, M. Assayony, A. Khan, and M. Al-Mouhamed, Bank Conflict-Free Access for CUDA-Based Matrix Transpose Algorithm on GPUs, Accepted in the International Conference on Computer Applications Technology (ICCAT'2013), 22 January, 2013.
A. Khan and M. A. Al-Mouhamed, CUDA – Based Strassen Matrix – Matrix Multiplication, to be sumbitted.

Some references

Work on Restructuring Compilers, Navier-Stokes Fluid Flow, and Sparse matrix Solving.

Performance Evaluation of CUDA Parallel Programming on Tesla GPUs, by Mr. Ayaz Khan.

A Restructuring Algorithm for CUDA, M. Al-Mouhamed and A. Khan, May 2012.

An Algorithm for Optimizing CUDA Programs, A. Khan, 15 May, 2012.

Automatic Auto-tuning for GPUs

Sparse Linear Algebra Solver Applications

Link to GPU Library on Sparse Linear Algebra Solver

Research papers on:

Optimized Sparse matrix access,

Thesis on Sparse Linear Algebra Solvers

Lectures on Linear Sovers

Books on Sparse Linear Algebra

GPU-applications

Applied math course on Linear Solvers: http://faculty.kfupm.edu.sa/MATH/ffairag/math590_111/notes/index.htm

Libraries :

CULA Sparse Reference Manual http://www.culatools.com/cula_sparse_programmers_guide/ http://www.culatools.com/

GPU v0.2 Linear Solver Library for OpenFOAM: http://www.symscape.com/gpu-0-2-openfoam

General-Purpose Computation on Graphics Hardware: http://gpgpu.org/tag/linear-algebra