Computer Engineering Department
COE 590 SPECIAL TOPICS
PARALLEL PROCESSING ARCHITECTURES
- Instructor:
- Dr. Mayez Al-Mouhamed
- Office:
- 22/325; Tel. # 2934 (office)
- Text Book:
- The Technology of Parallel Processing, A. Decegama,
Prentice-Hall Inter., latest edition.
Also, adequate material will be used from:
- ``Scalable Parallel Computing'',
K. Hawng and X. Zhiwei, McGraw-Hill, 1998.
- ``Intro. to Parallel Computing'', V. Kumar, A. Grama, A. Gupta, and
G. Karypis. Benjamin/Cummings Pub. Co., Inc., 1994.
- ``Advanced Computer Architecture'', K. Hwang, Mc-Graw-Hill, 1993.
- ``High-Performance Computer Architecture'', H. Stone, Addison-Wesley.
- ``Highly Parallel Computing'', G. Almasi, A. Gottlieb, Benjamin/Cummings.
- Selected papers from IEEE T.C., IEEE T.P.D.S., Journal
JPDC, IEEE T.P.D.T. (Concurrency), ICPP, PACT, and relevant
IEEE/ACM Conferences.
- Grading Policy:
- Homeworks (10%),
Exam 1 (20%) on Saturday 19th of February,
Exam 2 (20%) on Saturday First of April,
Final (30%), and Project (20%)
Course Description:
Introduction to parallel processing architectures, sequential, parallel,
pipelined, and dataflow architectures.
Parallel processing in supercomputers, large-scale,
medium-scale, and small-scale computer systems with examples.
Cluster Computer systems.
High-speed interconnection networks and examples.
Compiler vectorization and optimization, and performance enhancement.
The data parallel paradigm.
Parallel processing in supercomputers, large-scale,
medium-scale, and small-scale computer systems with examples.
Cluster Computer systems.
Applications. Mapping applications to
parallel architectures, machine dependent and independent optimizations.
Pre-requisite: graduate standing.
Detailed Course Content:
Introduction to parallel processing architectures, sequential, parallel,
pipelined, and dataflow architectures.
Cluster computer systems and advanced microprocessor architectures
(Post-RISC, Multimedia, and VLIW) and techniques.
Case studies of microprocessors.
High-speed interconnection networks and interfacing, blocking and
non-blocking, self-routing, multicasting, and synchronization.
Case studies of SG POWERpath-2, GIGAswitch/FDDI, IBM Vulcan, IBM-HPS,
Fiber channel, Gigabit Ethernet, Myrinet for SAN/LAN, SuperHiPPI,
ATM, SCI.
Compiler vectorization and optimization, loop scheduling, and
performance enhancement.
Array partitioning and mapping for the data parallel paradigm.
Parallel processing in supercomputers, large-scale,
medium-scale, and small-scale computer systems with examples.
Applications such as FFT, matrix, and sorting. Mapping applications to
parallel architectures, machine dependent and independent optimizations.
- Course Outline
- 1.
- Introduction to parallel architectures. 1.5 weeks
- 2.
- Performance of parallel processing. 1 week
- 3.
- Advanced microprocessor architectures. 2 weeks
- 4.
- High-speed interconnection networks and synchronization. 2.5 weeks
- 5.
- Symmetric Multiprocessors, Cluster computer systems,
vector architectures, and compiler techniques. 3 weeks
- 6.
- Data parallel MPP, array partitioning, mapping, and examples.
2 weeks
- 7.
- Parallel algorithms, vectorization, data partitioning,
and parallelizing for shared-memory and distributed-memory systems.
4 weeks
- Projects
- 1.
- Design and evaluation of high-speed interconnection networks.
High-speed switches must have a parallel architecture (throughput) and
have very simple architecture (minimum switching delays).
One alternative is to study a switch architecture that can achieve
one-to-many communication (multicasting) within the switch itself.
This is very useful for many applications.
Study a few
papers on a class of proposed switches. Design of a switch simulator
together with traffic simulation for the performance evaluation of
different design approaches.
- 2.
- Design of photonic interconnection networks.
High-speed switches must have a parallel architecture (throughput) and
have very simple architecture (minimum switching delays). Photonic
switching has the lowest propagation delays and might have simpler
design due to modularity. Study of a few papers on photonic switching
and identifying achievable photonic components needed for switching.
Cooperate with instructor in proposing a photonic switch and evaluate
its performance analytically or by simple simulation.
- 3.
- Experiment with a Cluster Computing
- (a)
- Design of new communication tools using network programming (Java).
- (b)
- Comparison of TCP/IP and Ethernet sockets for the design of efficient
parallel communication in LANs.
- (c)
- Survey of data parallel programming environment (PVM and MPI).
- (d)
- Parallelization of programs using MPI or PVM.
PCs or Workstation interconnected through fast LANs can be logically
organized as a high performance computer system according to the
MIMD distributed-memory paradigm. One problem is the design of an
efficient library of communication primitives having the least possible
overhead. Here, a simple protocol is needed because of the low probability
of errors in LANs. Another aspect is to study how to parallelize a specific
computation problem like the dense matrix multiply. On could use well
established Kernels for Cluster Computing like PVM or MPI.
- 4.
- Investigation of parallel memory organization.
Processor performance is doubling every 18 months but memory
performance is improved at rate of 7% per year. Parallel processing
of multimedia tasks like compression and decompression require
parallel access (SIMD or VLIW) to different data patterns like
rows, columns, blocks, stride, cross, etc.
Study how to map arrays to parallel memories so that some given data
patterns can be accessed in parallel. We will consider problems
taken multimedia processing.
Evaluate performance of such a parallel access scheme and compare it
to current techniques.
- 5.
- Investigation of a data partitioning tool for the data parallel paradigm.
The data parallel paradigm is intended for massively parallel systems.
It requires partitioning and distributing the arrays used in loops
according to the way they will be accessed at run-time.
Each data partition will be local to one processor.
The problem
is to study data dependence analysis of loops, based on which we
should decide how to partition and distribute the arrays to minimize
remote communication. Study of loop transformation to promote
parallelism in one another aspect. Since for each loop we may find
different partitioning, we also need to decide about a global
partitioning by using a global optimization method like
shortest path, genetic algorithm, or other.
Some implementation is needed for performance evaluation.
- 6.
- Investigation of static and dynamic scheduling.
Achieving significant parallelism in high-performance computer
system using the MIMD distributed-memory paradigm requires
an efficient scheduling of computations and communications.
This can be done by the compiler if the structure of the
computation is known (synchronous dadaflow) before the runtime.
Mainly, the objective is to produce a schedule which maximizes
the amount of overlap between computation and communications.
Few schedulers were studied and proposed by previous COE and ICS MS students.
The student will review the achieved work and get familiar with
the available scheduling packages. Some modification of a scheduler
will be made to incorporate the computation profile in an attempt
to improve performance.
- 7.
- Free project.