King Fahd University of Petroleum & Minerals
College of Computer Sciences and Engineering
Computer Engineering Department
COE 502 Parallel Processing Architectures
COE 420 Parallel Computing
ICS 446 Cluster Computing
Course Project
Course Projects Working Framework
Some proposed projects
CUDA Programming of Some Linear Algebra Solvers on GPU
Graphic processing Units (GPUs) are gaining ground in high-performance computing especially in arena of Massively Parallel Computing. GPU uses massive multithreading, fast context switching, and high memory bandwidth, and overlapping long-latency loads in stalled threads with computation in other threads (multiple streaming multiprocessors with potentially hundreds of cores). CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. CUDA is an elegant solution to the problem of representing parallelism in algorithms, not all algorithms but enough to matter. The project is to carry out following steps: (1) get familiar with the most recent GPU architecture, execution model, and programming, (2) Write parallel programs for a few computations like Matrix Multiply, SpMV, reduction in inner-product, and kernel-exit-entry synchronization (3) study of some numerical agoritms for solving system of linear equations, (4) identify parallelism pattern and optimizatization using combined operators, (5) collect performance plots of throughput or execution time versus different scaling of the problem size and the machine size, Here is some suggested material:
Some useful reading papers.
Reference book: Programming Massively Parallel Processors, D. Kirk and W. Hwu, ISBN 978-0-12-381472-2.
Sparse Linear Algebra Solver Applications:
Research papers on:
Libraries :
Applied math course on Linear Solvers: http://faculty.kfupm.edu.sa/MATH/ffairag/math590_111/notes/index.htm
CULA Sparse Reference Manual http://www.culatools.com/cula_sparse_programmers_guide/ http://www.culatools.com/
GPU v0.2 Linear Solver Library for OpenFOAM: http://www.symscape.com/gpu-0-2-openfoam
General-Purpose Computation on Graphics Hardware: http://gpgpu.org/tag/linear-algebra
2. DNA Sequence alignment based on Burrows-Wheeler transform
Bioinformatics is a fast-growing area of modern science with many important applications in various aspects of human life, from detection of genetically-determined diseases and creating genotype-specific medicine to deeper understanding of evolution and life in general. The problem is motivated by the need for searching for homogeneous proteins across species and DNA sequencing itself which involves huge amounts of sequence data that needs to be processed and grows exponentially due to constant progress in sequencing technology. This project addresses the problem of parallelizing the sequence alignment algorithm using Multi-Core computers. First the student investigate developing (1) a fast and scalable algorithm for the exact short sequence aligner and (2) matching algorithm with small memory footprint based on Burrows-Wheeler transform.
Overview of the problem (see paper titled "Sequence Alignment on Massively Parallel Heterogeneous Systems"): We may started with a pairwise local alignment problem. Analysis of existing solutions reveals that one of the main performance issues is memory limitation. It leads to splitting the index into small pieces to fit into GPU memory and repeating search for each part. As search complexity does not depend (or depends very little) on index size, splitting index in chunks increases computation time linearly. Copying index and queries to the device also takes time. To reduce memory consumption a matching algorithm based on Burrows Wheeler Transform can be used. This algorithm is mainly used for data compression, but possibility of its application to pattern matching was recently described. Index based on BWT is more than ten times smaller than index based on suffix array. We checked how well this algorithm fits GPU characteristics and did model implementation to see if we can actually get significantly better execution time with this smaller memory footprint algorithm. A model of possible memory utilization strategies which allowed to find best proportions and succession of memory allocations and data transfers to maximize overall performance is to be developed.
References to DNA Sequence alignment
2. Parallel Programming of the Lattice Bolzman Method for the CFD Simulation
An accurate model for computational fluid dynamics (CFD)
is through the solving of the Navier–Stokes equations. The Lattice Boltzmann
method (LBM) is based on a discrete
equations for solving and simulating the flow of a Newtonian fluid using a
collision model. It consists of subdividing the simulated 2D or 3D space
into volumes, each contains a number of particles set in some predefined
directions, and a streaming and collision processes across a limited number of
particles. Particles jump (lattice) from one volume to the neighboring ones. The
intrinsic particle interactions evince a microcosm of viscous flow behavior
applicable across the greater mass. LBM is a major simplification of the
original Navier-Stikes CFD for any incompressible fluid such as blood in vessels
LBM provides an asymptotic solution which is quite similar to the solution
provided by the accurate Navier-Stokes CFD system. The LBM algorithm is based on
an outer loop on simulation time, in which there is there are two fundamental
phases: (1) Bounce, and (2) Propagation. The project is to write an optimized
CUDA program to implement the LBM approach in 3D by considering some boundary
conditions as well as some initial conditions. A practical soft book is
available with the instructor as well as a sequential simulation program.
A proposed plan is to (1) explore papers on implementing LBM on GPUs including
Nvidia library on LBM simulation, (2) write optimized CUDA/GPU and C/openMp for
LBM simulation, (3) explore algorithm data-layout and parallelization issues on
GPU and MIC, (3) simulate some interesting cases in cooperation with some going
on research at ARAMCO and provide some 3D visualization of the results, and (4)
compare performance with others' published research.
3.. Analyzing techniques for compiler optimization for explicit parallel programs using the OpenMP (OpenMp and Cluster OpenMp):
This project aims at analyzing the present algorithms for these transformations, investigate the correctness of these algorithms and implement and discuss cases where this transformation is used. The project deals (Shigehisa Satoh, Kazuhiro Kusano, Mitsuhisa Sato) with analyzing techniques for compiler optimization for explicit parallel programs using the OpenMP API. To enable optimization across threads, some researchers designed dataflow analysis techniques in which interactions between threads are effectively modeled. This makes the structured description of parallelism and relaxed memory consistency in OpenMP make the analyses effective and efficient. Some algorithms have been developed for reaching definitions analysis, memory synchronization analysis, and cross-loop data dependence analysis for parallel loops. The primary target is compiler-directed software distributed shared memory systems in which aggressive compiler optimizations for software-implemented coherence schemes are crucial to obtaining good performance. Also developed optimizations applicable to general OpenMP implementations, namely redundant barrier removal and privatization of dynamically allocated objects. Experimental results for the coherency optimization show that aggressive compiler optimizations are quite effective for a shared-write intensive program because the coherence-induced communication volume in such a program is much larger than that in shared-read intensive programs.
The student is recommended to see the following advanced papers (OpenMp memory consistency):
4. Parallel Programming of some Digital Forenscic tool on GPUs.
Massive thread parallelism in GPUs may provide tremendous
computational power to accelerate some Digital Forenscic tools. File
carving is an important technique in digital forensics to recover files in cases
where file system metadata is missing or damaged. In this technique, a database
of file headers and footers is used to identify and recover files from raw disk
images, regardless of the underlying file system. A well known file carver is
Scalpel (see paper below).
This project consist of
Surveying existing parallel implementations of
scalpel's algorithm (see the 2nd attached paper as a start) and
develop a CUDA parallel
implementation of the scalpel's algorithm. The students are
expected to submit a conference paper including literature survey, design of
parallel programs, performance evaluation, and comparison to others. The
resulting parallel programs can be potentially tested on the Kepler K20X, Kepler
K80, and the mobile GPU Jetson TK1.
Some useful references:
"Scalpel: A Frugal, High Performance File Carver", Golden G. Richard III and Vassil Roussev.
"Massive threading: Using GPUs to increase the performance of digital forensics tools", Lodovico Marziale, Golden G. Richard III*, Vassil Roussev.
(1)
Study the GPU Cluster System (GPU-CS). The GPU Cluster System consists of two
4U servers, each is a SuperMicro SYS 7048GR-TR workstation. Each 4U server has a
host CPU (actually two Intel processors each is a 6-core) and has 4 Nvidia
Kepler K80 GPUs interconnected using the PCIe. Each 4U server has 8 HDDs, each
has a capacity of 2 TB. The two 4U servers will be interconnected using a single
port infiniband Card and a fiber-optics connector. The GPU-CS is expected to be
connected using G-Ether to KFUPM network, from there any student can remotely
login, run a job, and get results. In other words, the GPU-CS will act as an HPC
cluster system. In total, the GPU CS has 8 GPUs. In this step, the students
are expected to study the architecture of the server and that of the GPUs and
their expected performance.
(2) Configuring the OS, Network
file system and software system to operate as a Single System Image (SSI) for
the above GPU-CS. The students will explore the RAID 5 configuration of the
HDDs on each 4U server, the OS (Ubuntu ver. 15) and work on activating the
infiniband connection between the two 4U servers (Now it does not work).
Following the above steps, you need to cooperate with Zenith Arabia Engineer to
implement the SSI system on the cluster.
(3) Study CUDA programming of GPUs and cooperate with the instructor to test
each of the 8 K80 GPUs in the above cluster to make sure they are all working.
Next, an application can be executed on each GPU for which the execution time
will be collected to ensure the normal operation of the GPUs. The students will
try to enhance the CUDA program in cooperation with the instructor to improve
its speedup and scalability.
The
above work (above three steps) will be documented in a conference paper in
cooperation with the instructor.
5. Parallel Programming of a Linear Algebra Solver
1. The project is on solving linear system equation of the
form of Ax=b, where A is an NxN matrix, b is known vector, and x is the unknown
vector. The objective is to compute x using some iterative linear Algebra solver
like Jacobi (JS) and Conjugate
Gradient (CG) methods. Student carry out literature survey of JS and CG by
searching IEEExplore papers on implementing solvers on GPU, especially the CG
and focus on the solver structure, whether programmed or using a library for
linear algebra, method of parallelization, and performance. JS is described in
our textbook. Write your literature survey on the best 4-5 paper that best
describe the above issue (need to review many papers before selecting the best).
2. Implementation: Carry out dependence analysis of CG using
flow-chart (ask instructor for examples), identify the parallelism pattern, and
try combining operators (if any) to improve data locality and to minimize
overhead due to reduction and synchronization. Study ways for carrying out
intra-block synchronization and inter-block-synchronization (ask instructor).
Implement the iterative CG algorithm using either your own CUDA program or by
calling an optimized library (GEMM or other) for computing basic algebra
operators. Termination of the iterative process is done by testing the
convergence of the solution vector.
3. Performance evaluation: run the CG code using different
problem sizes and report execution time, throughput, and speedup over a
sequential program. You may use a run-time profiler to identify bottleneck and
to improve your parallel program performance. Comment on your performance plots
and provide careful analysis.
Some useful paper titles:
1. Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU
2. A Comparative Study of Preconditioners for GPU-Accelerated Conjugate Gradient Solver (see library used)
3. Iterative Methods for Sparse Linear Systems on Graphics Processing Unit
4. CUDA-based Linear Solvers for Stable Fluids
5. Automatic Tuning of Sparse Matrix-Vector Multiplication
for CRS format on GPUs
A folder of
some relevent papers
6. Massively Parallel Sparse Matrix Computing
1. Review of sparse matrix formats such as coo, csr, csc,
ell, hyb, bsr, bsrx. Explore IEEExplore papers which describe the above format,
learn about their methodology, comparison, and performance if possible. Write
your literature survey based on selected papers.
2. Review of papers which use the CUSPARSE and CUSP libraries
with gemm and gemv for sparse matrix computing. Note that these approaches aim
at optimization of following factors: (1) SIMD inefficiency due to sparse
structure, (2) reducing the overhead due to irregular memory access due to
sparse representation, and (3) reducing load balancing due to non-uniform sparse
matrix access. Explore IEEExplore papers which use the above libraries and those
which best describe their experience and performance. Write your literature
survey based on selected papers.
1. Implementation: Write code to carry out SpMv operations by
invoking the above optimized libraries using standard Sparse Matrices (to be
provided and downloaded). Code is to use above sparse matrices and invoking
optimized libraries.
2. Performance evaluation: carry out benchmarking of above
libraries by running SpMv (Matrix-Vector) programs which call the libraries.
Report the program execution time versus different problem sizes, throughput,
and comparison to sequential execution. Produce your performance plots and
provide careful analysis of the plots with your observations and comments. As a
result, identify/recommend one profitable methodology for SpMv with some
optimized library.
Some useful references (paper titles):
1. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data
2. A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
3. Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU
4. Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA
5. Iterative Methods for Sparse Linear Systems on Graphics Processing Unit
6. Automatic Tuning of Sparse Matrix-Vector Multiplication for CRS format on GPUs
7. Search on your own.
A folder of
some relevent papers
7. Analysis and Benchmarking of OpenMP Memory Consistency Model:
The student is recommended to see the following papers (OpenMp memory consistency Folder):
4. Parallelization of Large Spanning Tree problems:
8. PARALLEL SIMULATION OF GRAVITATIONAL N-BODY PROBLEM
The Gravitational N-Body Problem: Computational methods that track the motions of bodies interacting with one another, and possibly subject to an external field as well, have been the extensively studied for centuries. These methods were called “N-body” methods and have been applied to problems in astrophysics, semiconductor device simulation, molecular dynamics, plasma physics, and fluid mechanics. The problem states that: given initial states (position, mass and velocity) of N bodies, compute their states after time interval T. such kind of problems are computational extensive and will gain from the use of HPC in developing approximate solutions that helps in studying such phenomena. The simplest approach to tackle N-Body problem is to iterate over a sequence of small time steps. Within each time step, the acceleration on a body is computed by summing the contribution from each of the other N-1 bodies which is known as brute force algorithm. While this method is conceptually simple, easy to parallelize on HPC, and a choice for many applications, its O(N^2) time complexity make it impractical algorithm for large-scale simulations involving millions of bodies.To reduce the brute force algorithm time complexity, many algorithms has been proposed to get approximated solution for the problem within a reasonable time complexity and acceptable error bounds. These algorithms include Appel [APP85] and Barnes-Hut. It was claimed that Appel’s algorithm run in O(N) and Barnes-Hut (BH) run in O(Nlog(N)) for uniformly distributed bodies around the space. Greengard and Rokhlin developed the fast multi-pole method (FMM) which runs in O(N) time complexity and can be adjusted to give any fixed precision accuracy.
The above algorithms were initially proposed as sequential algorithms. However, with the evolution of vector machines and HPC clusters recently, there were many parallel implementations for these algorithms on different machine architectures. Zhao and Johnsson describe a nonadaptive 3D version of Greengard's algorithm on the Connection Machine CM-2. Salmon [SAL90] implemented the BH algorithm, with multipole approximations, on message passing architectures including the NCUBE and Intel iPSC. Nyland et al. implemented a 3D adaptive FMM with data-parallel methodology in Proteus, an architecture-independent language, and many other implementations on different machines. The objective is to implement the BH algorithm and parallelize it on IBM-e1350 eServer cluster. Our application of BH will be for a special N-Body problem which is the gravitational N-Body or Galaxy evolution problem. First we will implement the sequential algorithm; then, we will parallelize it using OpenMP set of directives with Intel C++ compiler installed on Redhat Linux server. The sequential BH algorithm and different approaches to parallelize it are presented. The implementation of the BH algorithm concentrates on the aspects that allow us to parallelize it easily. Different parallelization techniques included with openMP are investigated. Speedup of the parallel implementation will be outlined together with some future work guidelines.
Some previous student work:
Sample Application Programs
The Matrix Multiplication
Matrix multiplication C = A * B:
for (i = 0; i < N; i++){
for (j = 0; j < N; j++){
sum = 0;
for (k = 0; k < N; k++){
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
}
Jacobi Iterative Method
The Jacobi method is an algorithm for determining the solutions of a system of linear equations with largest absolute values in each row and column dominated by the diagonal element. Each diagonal element is solved for, and an approximate value plugged in. The process is then iterated until it converges. This algorithm is a stripped-down version of the Jacobi transformation method of matrix diagonalization. The method is named after German mathematician Carl Gustav Jakob Jacobi.
The element-based formula is as follows:
This equation can be used as an iteration formula for each of the unknowns to obtain a better approximation. Note that in the Jacobi method computing xk+1 from xk is a perfectly parallel operation: each new element of xk+1 is computed using the values of xk. This can be achieved by allocating new vector (new_x) to store updated values and then copy this vector into x to be used in the next iteration.
#include <stdio.h>
#include <math.h>
#include <unistd.h>
#include <time.h>
#include <sys/time.h>
#include <math.h>
#define MAX_ITER 100
int N = 100;
int main(int argc, char *argv[]){
if(argc > 1)
N = atoi(argv[1]);
double sum;
int i, j, k;
double a[N][N], x[N], b[N], new_x[N];
double dtime=0.0;
struct timeval stv, etv;
int ssec, susec, esec, eusec;
for(i = 0; i < N; i++){
for(j = 0; j < N; j++)
a[i][j] = (i+j+1) * 1.0;
x[i] = b[i] = (i+j) * 1.0;
}
for(k = 0; k < MAX_ITER; k++){
for(i=0; i<N; i++){
sum = -a[i][i] * x[i];
for(j=0; j<N; j++)
sum += a[i][j] * x[j];
new_x[i] = (b[i] - sum)/a[i][i];
}
for(i=0; i < N; i++)
x[i] = new_x[i];
}
return 0;
}
Sample Alternating Direction Integration (ADI) C code.
Fortran Code for ADI. From
“Automatic data layout for distributed-memory machines”, by Ken
Kennedy and Ulrich Kremer. Note loop 2 and 3 can be combined as well as 5
and 6 which produces 4 major loops in the main body.