No Title

Computer Engineering Department

COE 590 SPECIAL TOPICS
PARALLEL PROCESSING ARCHITECTURES

Instructor:: Dr. Mayez Al-Mouhamed
Office:: 22/325; Tel. # 2934 (office)

Text Book:

The Technology of Parallel Processing, A. Decegama, Prentice-Hall Inter., latest edition. Also, adequate material will be used from:

``Scalable Parallel Computing'', K. Hawng and X. Zhiwei, McGraw-Hill, 1998.
``Intro. to Parallel Computing'', V. Kumar, A. Grama, A. Gupta, and G. Karypis. Benjamin/Cummings Pub. Co., Inc., 1994.
``Advanced Computer Architecture'', K. Hwang, Mc-Graw-Hill, 1993.
``High-Performance Computer Architecture'', H. Stone, Addison-Wesley.
``Highly Parallel Computing'', G. Almasi, A. Gottlieb, Benjamin/Cummings.
Selected papers from IEEE T.C., IEEE T.P.D.S., Journal JPDC, IEEE T.P.D.T. (Concurrency), ICPP, PACT, and relevant IEEE/ACM Conferences.

Grading Policy:

Homeworks (10%), Exam 1 (20%) on Saturday 19th of February, Exam 2 (20%) on Saturday First of April, Final (30%), and Project (20%)

Course Description: Introduction to parallel processing architectures, sequential, parallel, pipelined, and dataflow architectures. Parallel processing in supercomputers, large-scale, medium-scale, and small-scale computer systems with examples. Cluster Computer systems. High-speed interconnection networks and examples. Compiler vectorization and optimization, and performance enhancement. The data parallel paradigm. Parallel processing in supercomputers, large-scale, medium-scale, and small-scale computer systems with examples. Cluster Computer systems. Applications. Mapping applications to parallel architectures, machine dependent and independent optimizations.

Pre-requisite: graduate standing.

Detailed Course Content: Introduction to parallel processing architectures, sequential, parallel, pipelined, and dataflow architectures. Cluster computer systems and advanced microprocessor architectures (Post-RISC, Multimedia, and VLIW) and techniques. Case studies of microprocessors. High-speed interconnection networks and interfacing, blocking and non-blocking, self-routing, multicasting, and synchronization. Case studies of SG POWERpath-2, GIGAswitch/FDDI, IBM Vulcan, IBM-HPS, Fiber channel, Gigabit Ethernet, Myrinet for SAN/LAN, SuperHiPPI, ATM, SCI. Compiler vectorization and optimization, loop scheduling, and performance enhancement. Array partitioning and mapping for the data parallel paradigm. Parallel processing in supercomputers, large-scale, medium-scale, and small-scale computer systems with examples. Applications such as FFT, matrix, and sorting. Mapping applications to parallel architectures, machine dependent and independent optimizations.

Course Outline

1.: Introduction to parallel architectures. 1.5 weeks
2.: Performance of parallel processing. 1 week
3.: Advanced microprocessor architectures. 2 weeks
4.: High-speed interconnection networks and synchronization. 2.5 weeks
5.: Symmetric Multiprocessors, Cluster computer systems, vector architectures, and compiler techniques. 3 weeks
6.: Data parallel MPP, array partitioning, mapping, and examples. 2 weeks
7.: Parallel algorithms, vectorization, data partitioning, and parallelizing for shared-memory and distributed-memory systems. 4 weeks

Projects

1.

Design and evaluation of high-speed interconnection networks.
High-speed switches must have a parallel architecture (throughput) and have very simple architecture (minimum switching delays). One alternative is to study a switch architecture that can achieve one-to-many communication (multicasting) within the switch itself. This is very useful for many applications. Study a few papers on a class of proposed switches. Design of a switch simulator together with traffic simulation for the performance evaluation of different design approaches.

2.

Design of photonic interconnection networks. High-speed switches must have a parallel architecture (throughput) and have very simple architecture (minimum switching delays). Photonic switching has the lowest propagation delays and might have simpler design due to modularity. Study of a few papers on photonic switching and identifying achievable photonic components needed for switching. Cooperate with instructor in proposing a photonic switch and evaluate its performance analytically or by simple simulation.

3.

Experiment with a Cluster Computing

(a): Design of new communication tools using network programming (Java).
(b): Comparison of TCP/IP and Ethernet sockets for the design of efficient parallel communication in LANs.
(c): Survey of data parallel programming environment (PVM and MPI).
(d): Parallelization of programs using MPI or PVM.

PCs or Workstation interconnected through fast LANs can be logically organized as a high performance computer system according to the MIMD distributed-memory paradigm. One problem is the design of an efficient library of communication primitives having the least possible overhead. Here, a simple protocol is needed because of the low probability of errors in LANs. Another aspect is to study how to parallelize a specific computation problem like the dense matrix multiply. On could use well established Kernels for Cluster Computing like PVM or MPI.

4.

Investigation of parallel memory organization. Processor performance is doubling every 18 months but memory performance is improved at rate of 7% per year. Parallel processing of multimedia tasks like compression and decompression require parallel access (SIMD or VLIW) to different data patterns like rows, columns, blocks, stride, cross, etc. Study how to map arrays to parallel memories so that some given data patterns can be accessed in parallel. We will consider problems taken multimedia processing. Evaluate performance of such a parallel access scheme and compare it to current techniques.

5.

Investigation of a data partitioning tool for the data parallel paradigm. The data parallel paradigm is intended for massively parallel systems. It requires partitioning and distributing the arrays used in loops according to the way they will be accessed at run-time. Each data partition will be local to one processor. The problem is to study data dependence analysis of loops, based on which we should decide how to partition and distribute the arrays to minimize remote communication. Study of loop transformation to promote parallelism in one another aspect. Since for each loop we may find different partitioning, we also need to decide about a global partitioning by using a global optimization method like shortest path, genetic algorithm, or other. Some implementation is needed for performance evaluation.

6.

Investigation of static and dynamic scheduling. Achieving significant parallelism in high-performance computer system using the MIMD distributed-memory paradigm requires an efficient scheduling of computations and communications. This can be done by the compiler if the structure of the computation is known (synchronous dadaflow) before the runtime. Mainly, the objective is to produce a schedule which maximizes the amount of overlap between computation and communications. Few schedulers were studied and proposed by previous COE and ICS MS students. The student will review the achieved work and get familiar with the available scheduling packages. Some modification of a scheduler will be made to incorporate the computation profile in an attempt to improve performance.

7.

Free project.