King Fahd University of Petroleum & Minerals
College of Computer Sciences and Engineering
Computer Engineering Department
COE 420 Parallel Computing
Cluster Computing (ICS
446)
Parallel Processing Architectures (COE 502)
High-Performance
Computing (ICS 573)
- Instructor:
- Dr. Mayez Al-Mouhamed
- Office:
- 22/325; Tel. # 2934 (office), and 3536 (lab).
- Text Book:
- No formal text book. The instructor will provide
copy of teaching material taken from:
- References:
- Scalable Parallel Computing, K. Hawng and X. Zhiwei, McGraw-Hill, 1998.
- ``Intro. to Parallel Computing'', V. Kumar, A. Grama, A. Gupta, and
G. Karypis. Benjamin/Cummings Pub. Co., Inc., 1994. ISBN 0805331700
- J. M. Ortega, Introduction to Parallel and Vector Solution
of Linear Systems, Plenum, 1988.
- T. L. Freeman and C. Phillips, Parallel Numerical Algorithms,
Prentice Hall, 1992.
- G. Golub and J. M. Ortega, Scientific Computing: An Introduction
with Parallel Computing, Academic Press, 1993.
- E. F. Van de Velde, Concurrent Scientific Computing, Springer, 1994.
- Advanced Computer Architecture, K. Hwang, Mc-Graw-Hill, 1993.
- High-Performance Computer Architecture, H. Stone, Addison-Wesley.
- Highly Parallel Computing, G. Almasi, A. Gottlieb, Benjamin/Cummings.
- Grading Policy:
- Option 1 ``No Project''
Exam 1 (30/100)
Exam 2 (30/100)
Homeworks (10/100), and Final Exam (30/100)
will be scheduled by the registrar.
Option 2 ``Project''.
Exam 1 (20/100),
Exam 2 (20/100),
Project (20/100), Homeworks (10/100), and Final Exam (30/100).
- Hoemwork Policy:
- Collaboration with your classmates on homework assignments is
allowed. However, consultation is permitted only with students
who are currently enrolled in this course. Each student must
develop and write up each homework solution. No copying of
solutions from others is allowed, even if the solutions are
obtained as a result of collaboration.
- Attendence:
- attendence is required by all students.
Excuse for official authorized must be presented to the
instructor no later than one week following the absence.
Unxecused absences lead to a ``DEN'' grade.
- Project:
An optional course project can also be taken and will be given
the weight of one exam. In this case, the two exams will be rated
40/100 and the project will be rated 20/100.
- Course Description:
- Introduction to parallel computing. Parallel architectures,
MIMD, SIMD, interconnection topologies. Performance
measures, speedup, efficiency, limitations of parallel processing.
Parallel programming paradigms, shared memory, message
passing, data parallel, data flow. Parallelizing compiler
techniques, code and data partitioning, vectorization. Parallel
programming environments and tools. Examples of parallel algorithms.
Pre-requisite: senior standing.
- Course Outline
- Introduction to parallel computing.
Parallel architectures. (5 lectures)
- Performance measures, speedup, efficiency, and limitations.
(3 lectures)
- Superscalar, Pipelining, VLIW, and multimedia processors. (8 lectures)
- High-speed interconnnection networks (8 lectures)
- Parallel programming paradigms and examples, shared memory,
message passing (cluster), data parallel, and data flow. (8 lectures)
- Parallelizing compiler techniques, code and data partitioning,
vectorization. Parallel programming environments and tools.
Examples of parallel algorithms. (10 lectures)
- Three slot for exams and student presentations
Computer Usage:
Programming assignments may be required to determine the speed-up when
a program is run on a multiprocessor system rather than a uniprocessor
system.
Working Groups:
The instructor encourages the students to work in groups for
reviewing the class lectures, preparation for exams, and
discussion (only) of homework problems.
Participants receive bonus grades for such activities.
The organization of these groups is as follows.
Any student with a GPA above 3.0 can be considered as a class leader.
Each class leader is encouraged to create a Working Group of
2, 3, or 4 students to review the course material of the course.
A Bonus will be given to all members of a Working Group
for each meeting of the group.
Students with a GPA above 3.0 wishing to participate in this
activity are pleased to give their name and ID to the instructor.
Students wishing to participate as group members may ask the instructor
about the class leaders and their groups.
The class leader has the responsability of providing the instructor
the list of students who attended a meeting. This list should include
the students name, date of meeting, and signatures.
Any student can attend the lecture review meetings with any of
the Class Leaders.
Optional projects
- 1.
- Study of the DLX processor simulator
The students will prepare for a refined presentation and a
report within six weeks.
Please investigate the Beowulf cluster computer systems with
respect to
(1) take a code like ADI benchmark, write it in C, generate
its code in DLX, and collect its run time.
(2) restructure the DLX ADI assembly to the best you can, run it,
and compare its performance to the compiled version,
(3) Try loop unrolling and compare performance.
Now you may advice a method for compiler loop restructuring
such as the use of loop unrolling.
Explain the compiler approach with respect to a source code
in DLX and give example.
- 2.
- Study of Benchmarks programs used for Desktop computers.
The student will prepare for a refined presentation and a
report within six weeks.
The suggested plan is:
(1) determine representative benchmarks most commonly used for
Desktop computers,
(2) classify the benchmarks depending on their objectives
such as benchmarking the CPU, the display system, the hard disk,
and the NIC/Network,
(3) acquire benchmarks as stated in (2), run them during
your presentation, comments on the results, and provide a copy of
those benchmarks to the students.
- 3.
- Study of Techniques used for Improving Performance of
Instruction Pipelining
The student will prepare for a refined presentation and a
report within six weeks.
The suggested plan is:
(1) Surveying of I-pipelining techniques in recently proposed
micro-architectures (last ten years),
(2) how structural, data, and control hazards are resolved
and at what level,
(3) main proposed features like branch-prediction, speculation, etc.,
(4) provide a comparison of these micro-architectures with
respect to major aspects especially expected performance and
limitations.
Implement a branch-prediction table with 2-bit history
and evaluate performance by using typical loop sturctures.
- 4.
- Investigation of Thread-Level Parallelism
The student will prepare for a refined presentation and a
report within six weeks.
The suggested plan is:
(1) issues or different schools of instruction-level parallelism
(ILP) in major micro-architectures with examples.
(2) what are the main research problems in each category like
hazards resolution, brach prediction, speculative execution,
hardware/compiler trends, etc.,
(2) Thread-level parallelism (TLP) with internal organization
and examples from known processors,
(3) Identify ILP and TLP in VLIW and Superscalar processors,
(4) EPIC (explicit Parallel Instruction Computing) architectures,
(5) Case study of some high-end processors (Intel Merced), and
(6) New advances (Raw architecture, Simlutaneous Multithreading
architectures).
Please avoid simple enumeration of approaches by providing
motivation and execution philosoghy.
- 5.
- Cluster computing on a network of workstations
- Survey of data parallel programming environment (PVM or MPI).
- Parallelization of programs using MPI or PVM.
The Beowulf is an 8-node network of PCs (Robotics Lab)
interconnected through a fast switched LAN and can be logically
organized as a high-performance computer system according to the
MIMD distributed-memory paradigm.
PVM is a parallel computing kernel that provide a library of
message passing communication primitives for interprocess
communication based on the send/receive primitives.
After getting familiar with PVM, the student will study how to
parallelize a specific computation problem like the matrix multiply
or other benchmarks.
Performance of PVM library is another point of study for which
the student test the latency (software and hardware) of these
communication functions by implementing one sender and receiver
over the CCSE network. The student may write programs in C
including calls to the PVM library for interprocess communication.
Results will address the issue of maximizing the speedup of
the program on the Beowulf parallel machine.
- 6.
- Investigation of Thread-Level Parallelism
The student will prepare for a refined presentation and a
report within six weeks.
The suggested plan is:
(1) issues or different schools of instruction-level parallelism
(ILP) in major micro-architectures with examples.
(2) what are the main research problems in each category like
hazards resolution, brach prediction, speculative execution,
hardware/compiler trends, etc.,
(2) Thread-level parallelism (TLP) with internal organization
and examples from known processors,
(3) Identify ILP and TLP in VLIW and Superscalar processors,
(4) EPIC (explicit Parallel Instruction Computing) architectures,
(5) Case study of some high-end processors (Intel Merced), and
(6) New advances (Raw architecture, Simlutaneous Multithreading
architectures).
Please avoid simple enumeration of approaches by providing
motivation and execution philosoghy.