King Fahd University of Petroleum & Minerals
College of Computer Sciences and Engineering
Computer Engineering Department

COE 502 Parallel Processing Architectures

COE 420 Parallel Computing

ICS 446 Cluster Computing

Course Project

Course Projects Working Framework

First 1-4 weeks of semester: Students are expected to independently carry out literature Survey on their selected topics, compile information of relevant papers in a progress report PR-1, have an action plan of what they are going to contribution and more specifically on their project implementation, and prepare for a presentation to expose all the above aspects to the class and instructor. During this period, the student is expected to interact with the instructor regarding the major issues of his task while being prepared for his topic and its details. PR-1 is expected to be the outcome of first1-4 weeks. PR-1 is to describe (1) the literature survey (no less than 5 relevant papers) and (2) what the student plan for the implementation phase. The report will be rated 25% of the whole project if it is timely submitted.
From week 5 to week 10, Students are expected to carry out the implementation part of their project by taking into account: (1) his presented or revised action plan, and (2) the instructor observations formulated during or after his presentation. During this period, the student is expected to interact with the instructor regarding the major design issues, difficulties in addressing some parts, resource problems, or any kind of problem that require specific attention to avoid negative impact on the project. A report PR-2 is expected to be the outcome of weeks 5-10. PR-2 is to describe (1) the details of the implementation aspects including the problem investigation, programming, problems encountered, and early results (If any). The above report part will be rated 25% of project of the whole project if it is timely submitted.
From week 11 to week 14, Students are expected to carry out: (1) debugging and testing of their implementation and refer to instructor in case of problems, (2) carry out performance evaluation of the project with data collection, (3) compile PR-1, PR-2, and the performance evaluation with detailed analysis of collected results, produce the Final Report (FR) and prepare the project presentation. During this period, the student is expected to interact with the instructor regarding the debugging issues and performance interpretation. A presentation involving overall project will be delivered by the team (or student) before the last day of classes (week 15). The presentation addresses the literature survey, implementation, and the performance evaluation. This part will be rated 50% of project for both presentation and timely submission of FR. The student is pleased to submit a zipped folder including all of original word reports (no pdf), well documented source code, evaluation data, and reference papers. A demo to the instructor will also be required.

Some proposed projects

CUDA Programming of Some Linear Algebra Solvers on GPU

Graphic processing Units (GPUs) are gaining ground in high-performance computing especially in arena of Massively Parallel Computing. GPU uses massive multithreading, fast context switching, and high memory bandwidth, and overlapping long-latency loads in stalled threads with computation in other threads (multiple streaming multiprocessors with potentially hundreds of cores). CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all algorithms but enough to matter. The project is to carry out following steps: (1) get familiar with the most recent GPU architecture, execution model, and programming, (2) Write parallel programs for a few computations like Matrix Multiply, SpMV, reduction in inner-product, and kernel-exit-entry synchronization (3) study of some numerical agoritms for solving system of linear equations, (4) identify parallelism pattern and optimizatization using combined operators, (5) collect performance plots of throughput or execution time versus different scaling of the problem size and the machine size, Here is some suggested material:

Some useful reading papers.

CUDA Program Optimization Approach

Reference book: Programming Massively Parallel Processors, D. Kirk and W. Hwu, ISBN 978-0-12-381472-2.

Sparse Linear Algebra Solver Applications:

Link to GPU Library on Sparse Linear Algebra Solver

Research papers on:

Optimized Sparse matrix access,

Thesis on Sparse Linear Algebra Solvers

Lectures on Linear Sovers

Books on Sparse Linear Algebra

GPU-applications

CUDA Synchronization and Reduction

High-Level Directives OpenACC and PGI

Libraries :

Applied math course on Linear Solvers: http://faculty.kfupm.edu.sa/MATH/ffairag/math590_111/notes/index.htm

CULA Sparse Reference Manual http://www.culatools.com/cula_sparse_programmers_guide/ http://www.culatools.com/

GPU v0.2 Linear Solver Library for OpenFOAM: http://www.symscape.com/gpu-0-2-openfoam

General-Purpose Computation on Graphics Hardware: http://gpgpu.org/tag/linear-algebra

2. DNA Sequence alignment based on Burrows-Wheeler transform

Bioinformatics is a fast-growing area of modern science with many important applications in various aspects of human life, from detection of genetically-determined diseases and creating genotype-specific medicine to deeper understanding of evolution and life in general. The problem is motivated by the need for searching for homogeneous proteins across species and DNA sequencing itself which involves huge amounts of sequence data that needs to be processed and grows exponentially due to constant progress in sequencing technology. This project addresses the problem of parallelizing the sequence alignment algorithm using Multi-Core computers. First the student investigate developing (1) a fast and scalable algorithm for the exact short sequence aligner and (2) matching algorithm with small memory footprint based on Burrows-Wheeler transform.

Overview of the problem (see paper titled "Sequence Alignment on Massively Parallel Heterogeneous Systems"): We may started with a pairwise local alignment problem. Analysis of existing solutions reveals that one of the main performance issues is memory limitation. It leads to splitting the index into small pieces to fit into GPU memory and repeating search for each part. As search complexity does not depend (or depends very little) on index size, splitting index in chunks increases computation time linearly. Copying index and queries to the device also takes time. To reduce memory consumption a matching algorithm based on Burrows Wheeler Transform can be used. This algorithm is mainly used for data compression, but possibility of its application to pattern matching was recently described. Index based on BWT is more than ten times smaller than index based on suffix array. We checked how well this algorithm fits GPU characteristics and did model implementation to see if we can actually get significantly better execution time with this smaller memory footprint algorithm. A model of possible memory utilization strategies which allowed to find best proportions and succession of memory allocations and data transfers to maximize overall performance is to be developed.

References to DNA Sequence alignment

2. Parallel Programming of the Lattice Bolzman Method for the CFD Simulation

An accurate model for computational fluid dynamics (CFD) is through the solving of the Navier–Stokes equations. The Lattice Boltzmann method (LBM) is based on a discrete equations for solving and simulating the flow of a Newtonian fluid using a collision model. It consists of subdividing the simulated 2D or 3D space into volumes, each contains a number of particles set in some predefined directions, and a streaming and collision processes across a limited number of particles. Particles jump (lattice) from one volume to the neighboring ones. The intrinsic particle interactions evince a microcosm of viscous flow behavior applicable across the greater mass. LBM is a major simplification of the original Navier-Stikes CFD for any incompressible fluid such as blood in vessels LBM provides an asymptotic solution which is quite similar to the solution provided by the accurate Navier-Stokes CFD system. The LBM algorithm is based on an outer loop on simulation time, in which there is there are two fundamental phases: (1) Bounce, and (2) Propagation. The project is to write an optimized CUDA program to implement the LBM approach in 3D by considering some boundary conditions as well as some initial conditions. A practical soft book is available with the instructor as well as a sequential simulation program. A proposed plan is to (1) explore papers on implementing LBM on GPUs including Nvidia library on LBM simulation, (2) write optimized CUDA/GPU and C/openMp for LBM simulation, (3) explore algorithm data-layout and parallelization issues on GPU and MIC, (3) simulate some interesting cases in cooperation with some going on research at ARAMCO and provide some 3D visualization of the results, and (4) compare performance with others' published research. The student may search the IeeExplore for papersa on LBM applications on GPUs.

3.. Analyzing techniques for compiler optimization for explicit parallel programs using the OpenMP (OpenMp and Cluster OpenMp):

This project aims at analyzing the present algorithms for these transformations, investigate the correctness of these algorithms and implement and discuss cases where this transformation is used. The project deals (Shigehisa Satoh, Kazuhiro Kusano, Mitsuhisa Sato) with analyzing techniques for compiler optimization for explicit parallel programs using the OpenMP API. To enable optimization across threads, some researchers designed dataflow analysis techniques in which interactions between threads are effectively modeled. This makes the structured description of parallelism and relaxed memory consistency in OpenMP make the analyses effective and efficient. Some algorithms have been developed for reaching definitions analysis, memory synchronization analysis, and cross-loop data dependence analysis for parallel loops. The primary target is compiler-directed software distributed shared memory systems in which aggressive compiler optimizations for software-implemented coherence schemes are crucial to obtaining good performance. Also developed optimizations applicable to general OpenMP implementations, namely redundant barrier removal and privatization of dynamically allocated objects. Experimental results for the coherency optimization show that aggressive compiler optimizations are quite effective for a shared-write intensive program because the coherence-induced communication volume in such a program is much larger than that in shared-read intensive programs.

The student is recommended to see the following advanced papers (OpenMp memory consistency):

Incorporation of OpenMP Memory Consistency into Conventional Dataflow Analysis by Ayon Basumallik and Rudolf Eigenmann
Midkiff, S.P., Lee, J., Padua, D.A.: A compiler for multiple memory models. Concurrency and Computation: Practice and Experience 16, 197–220 (2004)
Lin, Y., Terboven, C., an Mey, D., Copty, N.: Automatic Scoping of Variables in Parallel Regions of an OpenMP Program. In: Chapman, B.M. (ed.) WOMPAT 2004. LNCS, vol. 3349, pp. 83–97. Springer, Heidelberg (2005)
Hoeflinger, J., de Supinski, B.: The OpenMP Memory Model. In: Proceedings of the first International Workshop on OpenMP (IWOMP 2005) (2005)
Bronevetsky, G., de Supinski, B.: Complete Formal Specification of the OpenMP Memory Model. In: Proceedings of the second International Workshop on OpenMP (IWOMP 2006) (2006)
Compiler optimization techniques for OpenMP programs, by Shigehisa Satoh, Kazuhiro Kusano, Mitsuhisa Sato.
The student is pleased to see also all recent papers form all of the above authors,
Action plan is required.
The students will prepare presentations as stated in the Course Projects Working Framework.

4. Parallel Programming of some Digital Forenscic tool on GPUs.

Massive thread parallelism in GPUs may provide tremendous computational power to accelerate some Digital Forenscic tools. File carving is an important technique in digital forensics to recover files in cases where file system metadata is missing or damaged. In this technique, a database of file headers and footers is used to identify and recover files from raw disk images, regardless of the underlying file system. A well known file carver is Scalpel (see paper below).

This project consist of Surveying existing parallel implementations of scalpel's algorithm (see the 2nd attached paper as a start) and
develop a CUDA parallel implementation of the scalpel's algorithm. The students are expected to submit a conference paper including literature survey, design of parallel programs, performance evaluation, and comparison to others. The resulting parallel programs can be potentially tested on the Kepler K20X, Kepler K80, and the mobile GPU Jetson TK1.

Some useful references:

"Scalpel: A Frugal, High Performance File Carver", Golden G. Richard III and Vassil Roussev.
"Massive threading: Using GPUs to increase the performance of digital forensics tools", Lodovico Marziale, Golden G. Richard III*, Vassil Roussev.

5. Designing of a GPU Cluster and exploring Applications

This project deals with exploring GPU cluster configuration, OS and network file system, and study CUDA programming on a cluer of GPUs:

(1) Study the GPU Cluster System (GPU-CS). The GPU Cluster System consists of two 4U servers, each is a SuperMicro SYS 7048GR-TR workstation. Each 4U server has a host CPU (actually two Intel processors each is a 6-core) and has 4 Nvidia Kepler K80 GPUs interconnected using the PCIe. Each 4U server has 8 HDDs, each has a capacity of 2 TB. The two 4U servers will be interconnected using a single port infiniband Card and a fiber-optics connector. The GPU-CS is expected to be connected using G-Ether to KFUPM network, from there any student can remotely login, run a job, and get results. In other words, the GPU-CS will act as an HPC cluster system. In total, the GPU CS has 8 GPUs. In this step, the students are expected to study the architecture of the server and that of the GPUs and their expected performance.

(2) Configuring the OS, Network file system and software system to operate as a Single System Image (SSI) for the above GPU-CS. The students will explore the RAID 5 configuration of the HDDs on each 4U server, the OS (Ubuntu ver. 15) and work on activating the infiniband connection between the two 4U servers (Now it does not work). Following the above steps, you need to cooperate with Zenith Arabia Engineer to implement the SSI system on the cluster.

(3) Study CUDA programming of GPUs and cooperate with the instructor to test each of the 8 K80 GPUs in the above cluster to make sure they are all working. Next, an application can be executed on each GPU for which the execution time will be collected to ensure the normal operation of the GPUs. The students will try to enhance the CUDA program in cooperation with the instructor to improve its speedup and scalability.

The above work (above three steps) will be documented in a conference paper in cooperation with the instructor.

5. Parallel Programming of a Linear Algebra Solver

Here are some syggested steps:

1. The project is on solving linear system equation of the form of Ax=b, where A is an NxN matrix, b is known vector, and x is the unknown vector. The objective is to compute x using some iterative linear Algebra solver like Jacobi (JS) and Conjugate Gradient (CG) methods. Student carry out literature survey of JS and CG by searching IEEExplore papers on implementing solvers on GPU, especially the CG and focus on the solver structure, whether programmed or using a library for linear algebra, method of parallelization, and performance. JS is described in our textbook. Write your literature survey on the best 4-5 paper that best describe the above issue (need to review many papers before selecting the best).

2. Implementation: Carry out dependence analysis of CG using flow-chart (ask instructor for examples), identify the parallelism pattern, and try combining operators (if any) to improve data locality and to minimize overhead due to reduction and synchronization. Study ways for carrying out intra-block synchronization and inter-block-synchronization (ask instructor). Implement the iterative CG algorithm using either your own CUDA program or by calling an optimized library (GEMM or other) for computing basic algebra operators. Termination of the iterative process is done by testing the convergence of the solution vector.

3. Performance evaluation: run the CG code using different problem sizes and report execution time, throughput, and speedup over a sequential program. You may use a run-time profiler to identify bottleneck and to improve your parallel program performance. Comment on your performance plots and provide careful analysis.

Some useful paper titles:

1. Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU

2. A Comparative Study of Preconditioners for GPU-Accelerated Conjugate Gradient Solver (see library used)

3. Iterative Methods for Sparse Linear Systems on Graphics Processing Unit

4. CUDA-based Linear Solvers for Stable Fluids

5. Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS format on GPUs

A folder of some relevent papers

Another folder

6. Massively Parallel Sparse Matrix Computing

Here are some syggested steps:

1. Review of sparse matrix formats such as coo, csr, csc, ell, hyb, bsr, bsrx. Explore IEEExplore papers which describe the above format, learn about their methodology, comparison, and performance if possible. Write your literature survey based on selected papers.

2. Review of papers which use the CUSPARSE and CUSP libraries with gemm and gemv for sparse matrix computing. Note that these approaches aim at optimization of following factors: (1) SIMD inefficiency due to sparse structure, (2) reducing the overhead due to irregular memory access due to sparse representation, and (3) reducing load balancing due to non-uniform sparse matrix access. Explore IEEExplore papers which use the above libraries and those which best describe their experience and performance. Write your literature survey based on selected papers.

1. Implementation: Write code to carry out SpMv operations by invoking the above optimized libraries using standard Sparse Matrices (to be provided and downloaded). Code is to use above sparse matrices and invoking optimized libraries.

2. Performance evaluation: carry out benchmarking of above libraries by running SpMv (Matrix-Vector) programs which call the libraries. Report the program execution time versus different problem sizes, throughput, and comparison to sequential execution. Produce your performance plots and provide careful analysis of the plots with your observations and comments. As a result, identify/recommend one profitable methodology for SpMv with some optimized library.

Some useful references (paper titles):

1. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data

2. A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

3. Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

4. Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA

5. Iterative Methods for Sparse Linear Systems on Graphics Processing Unit

6. Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS format on GPUs

7. Search on your own.

A folder of some relevent papers

Another folder

7. Analysis and Benchmarking of OpenMP Memory Consistency Model:

Analysis and Benchmarking of the Memory Consistency Model of the IBM 1350 cluster using the Lazy Release Consistency model (Intel) which is derived from TreadMarks, a homeless sDSM. For benchmarking the Cluster OpenMp, it is highly recommended to use the Gaussian03 code which is based on the Linda object based DSM system on distributed memory systems and OpenMPon shared memory systems (Use of Cluster OpenMP with the Gaussian Quantum Chemistry Code- A ... , For the parallel benchmark codes, please see Jie Cai's page at http://cs.anu.edu.au/~Jie.Cai/, and
Improving performance using compiler-knowledge of the Memory Consistency Model. Memory consistency work is the major overhead of cluster enabled OpenMP systems, which sometimes dominates the overall performance. Page fault detection and servicing are the major activities for the memory consistency work.The major overhead of the sDSM system is the cost of servicing the page faults according to the memory consistency model. Program memory partitioned into local and global address spaces. Global pages are protected by using mprotect. In cluster OpenMp, Barriers, locks and flush operations lead to inter-process communication and re-set of page protection. The cluster keep its global memory consistent by detecting and servicing different types of page-faults. For this, the use of SEGVprof profiling tool is very useful to count the number of page faults during execution.

The student is recommended to see the following papers (OpenMp memory consistency Folder):

4. Parallelization of Large Spanning Tree problems:

See Parallelization of Spanning Tree Algorithm

8. PARALLEL SIMULATION OF GRAVITATIONAL N-BODY PROBLEM

The Gravitational N-Body Problem: Computational methods that track the motions of bodies interacting with one another, and possibly subject to an external field as well, have been the extensively studied for centuries. These methods were called “N-body” methods and have been applied to problems in astrophysics, semiconductor device simulation, molecular dynamics, plasma physics, and fluid mechanics. The problem states that: given initial states (position, mass and velocity) of N bodies, compute their states after time interval T. such kind of problems are computational extensive and will gain from the use of HPC in developing approximate solutions that helps in studying such phenomena. The simplest approach to tackle N-Body problem is to iterate over a sequence of small time steps. Within each time step, the acceleration on a body is computed by summing the contribution from each of the other N-1 bodies which is known as brute force algorithm. While this method is conceptually simple, easy to parallelize on HPC, and a choice for many applications, its O(N^2) time complexity make it impractical algorithm for large-scale simulations involving millions of bodies.To reduce the brute force algorithm time complexity, many algorithms has been proposed to get approximated solution for the problem within a reasonable time complexity and acceptable error bounds. These algorithms include Appel [APP85] and Barnes-Hut. It was claimed that Appel’s algorithm run in O(N) and Barnes-Hut (BH) run in O(Nlog(N)) for uniformly distributed bodies around the space. Greengard and Rokhlin developed the fast multi-pole method (FMM) which runs in O(N) time complexity and can be adjusted to give any fixed precision accuracy.

The above algorithms were initially proposed as sequential algorithms. However, with the evolution of vector machines and HPC clusters recently, there were many parallel implementations for these algorithms on different machine architectures. Zhao and Johnsson describe a nonadaptive 3D version of Greengard's algorithm on the Connection Machine CM-2. Salmon [SAL90] implemented the BH algorithm, with multipole approximations, on message passing architectures including the NCUBE and Intel iPSC. Nyland et al. implemented a 3D adaptive FMM with data-parallel methodology in Proteus, an architecture-independent language, and many other implementations on different machines. The objective is to implement the BH algorithm and parallelize it on IBM-e1350 eServer cluster. Our application of BH will be for a special N-Body problem which is the gravitational N-Body or Galaxy evolution problem. First we will implement the sequential algorithm; then, we will parallelize it using OpenMP set of directives with Intel C++ compiler installed on Redhat Linux server. The sequential BH algorithm and different approaches to parallelize it are presented. The implementation of the BH algorithm concentrates on the aspects that allow us to parallelize it easily. Different parallelization techniques included with openMP are investigated. Speedup of the parallel implementation will be outlined together with some future work guidelines.

Some previous student work:

Parallel Simulation of the Gravitational N-Body Problem

Dynamic load balancing based on double exponential thread time forcasting

Sample Application Programs

The Matrix Multiplication

Matrix multiplication C = A * B:

for (i = 0; i < N; i++){

     for (j = 0; j < N; j++){

           sum = 0;

           for (k = 0; k < N; k++){

                sum += A[i][k] * B[k][j];

           }

           C[i][j] = sum;

     }

     }

}

Jacobi Iterative Method

The Jacobi method is an algorithm for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. Each diagonal element is solved for, and an approximate value plugged in. The process is then iterated until it converges. This algorithm is a stripped-down version of the Jacobi transformation method of matrix diagonalization. The method is named after German mathematician Carl Gustav Jakob Jacobi.

The element-based formula is as follows:

This equation can be used as an iteration formula for each of the unknowns to obtain a better approximation. Note that in the Jacobi method computing xk+1 from xk is a perfectly parallel operation: each new element of xk+1 is computed using the values of xk. This can be achieved by allocating new vector (new_x) to store updated values and then copy this vector into x to be used in the next iteration.

Sequential Implementation:

#include <stdio.h>

#include <math.h>

#include <unistd.h>

#include <time.h>

#include <sys/time.h>

#include <math.h>

#define MAX_ITER 100

int N = 100;

int main(int argc, char *argv[]){

if(argc > 1)

N = atoi(argv[1]);

double sum;

int i, j, k;

double a[N][N], x[N], b[N], new_x[N];

double dtime=0.0;

struct timeval stv, etv;

int ssec, susec, esec, eusec;

for(i = 0; i < N; i++){

for(j = 0; j < N; j++)

a[i][j] = (i+j+1) * 1.0;

x[i] = b[i] = (i+j) * 1.0;

}

for(k = 0; k < MAX_ITER; k++){

for(i=0; i<N; i++){

sum = -a[i][i] * x[i];

for(j=0; j<N; j++)

sum += a[i][j] * x[j];

new_x[i] = (b[i] - sum)/a[i][i];

}

for(i=0; i < N; i++)

x[i] = new_x[i];

}

return 0;

}

Alternating Direction Implicit (ADI)

Sample Alternating Direction Integration (ADI) C code.

Fortran Code for ADI. From “Automatic data layout for distributed-memory machines”, by Ken Kennedy and Ulrich Kremer. Note loop 2 and 3 can be combined as well as 5 and 6 which produces 4 major loops in the main body.