King Fahd University of Petroleum & Minerals
College of Computer Sciences and Engineering
Computer Engineering Department

COE 420 Parallel Computing
 Cluster Computing (ICS 446)
Parallel Processing Architectures (COE 502)
High-Performance Computing (ICS 573)



Instructor:
Dr. Mayez Al-Mouhamed
Office:
22/325; Tel. # 2934 (office), and 3536 (lab).
Text Book:
No formal text book. The instructor will provide copy of teaching material taken from:
Grading Policy:
Option 1 ``No Project''
Exam 1 (30/100)
Exam 2 (30/100)
Homeworks (10/100), and Final Exam (30/100) will be scheduled by the registrar.
Option 2 ``Project''.
Exam 1 (20/100),
Exam 2 (20/100),
Project (20/100), Homeworks (10/100), and Final Exam (30/100).

Hoemwork Policy:
Collaboration with your classmates on homework assignments is allowed. However, consultation is permitted only with students who are currently enrolled in this course. Each student must develop and write up each homework solution. No copying of solutions from others is allowed, even if the solutions are obtained as a result of collaboration.

Attendence:
attendence is required by all students. Excuse for official authorized must be presented to the instructor no later than one week following the absence. Unxecused absences lead to a ``DEN'' grade.

Project: An optional course project can also be taken and will be given the weight of one exam. In this case, the two exams will be rated 40/100 and the project will be rated 20/100.

Course Description:
Introduction to parallel computing. Parallel architectures, MIMD, SIMD, interconnection topologies. Performance measures, speedup, efficiency, limitations of parallel processing. Parallel programming paradigms, shared memory, message passing, data parallel, data flow. Parallelizing compiler techniques, code and data partitioning, vectorization. Parallel programming environments and tools. Examples of parallel algorithms.

Pre-requisite: senior standing.

Course Outline
  1. Introduction to parallel computing. Parallel architectures. (5 lectures)
  2. Performance measures, speedup, efficiency, and limitations. (3 lectures)
  3. Superscalar, Pipelining, VLIW, and multimedia processors. (8 lectures)
  4. High-speed interconnnection networks (8 lectures)
  5. Parallel programming paradigms and examples, shared memory, message passing (cluster), data parallel, and data flow. (8 lectures)
  6. Parallelizing compiler techniques, code and data partitioning, vectorization. Parallel programming environments and tools. Examples of parallel algorithms. (10 lectures)
  7. Three slot for exams and student presentations
  •      
  •  
    Computer Usage: Programming assignments may be required to determine the speed-up when a program is run on a multiprocessor system rather than a uniprocessor system.

    Working Groups:
    The instructor encourages the students to work in groups for reviewing the class lectures, preparation for exams, and discussion (only) of homework problems. Participants receive bonus grades for such activities. The organization of these groups is as follows. Any student with a GPA above 3.0 can be considered as a class leader. Each class leader is encouraged to create a Working Group of 2, 3, or 4 students to review the course material of the course. A Bonus will be given to all members of a Working Group for each meeting of the group. Students with a GPA above 3.0 wishing to participate in this activity are pleased to give their name and ID to the instructor. Students wishing to participate as group members may ask the instructor about the class leaders and their groups. The class leader has the responsability of providing the instructor the list of students who attended a meeting. This list should include the students name, date of meeting, and signatures.


    Any student can attend the lecture review meetings with any of the Class Leaders.

    Optional projects
    1.
    Study of the DLX processor simulator
    The students will prepare for a refined presentation and a report within six weeks. Please investigate the Beowulf cluster computer systems with respect to (1) take a code like ADI benchmark, write it in C, generate its code in DLX, and collect its run time. (2) restructure the DLX ADI assembly to the best you can, run it, and compare its performance to the compiled version, (3) Try loop unrolling and compare performance. Now you may advice a method for compiler loop restructuring such as the use of loop unrolling. Explain the compiler approach with respect to a source code in DLX and give example.
    2.
    Study of Benchmarks programs used for Desktop computers.
    The student will prepare for a refined presentation and a report within six weeks. The suggested plan is: (1) determine representative benchmarks most commonly used for Desktop computers, (2) classify the benchmarks depending on their objectives such as benchmarking the CPU, the display system, the hard disk, and the NIC/Network, (3) acquire benchmarks as stated in (2), run them during your presentation, comments on the results, and provide a copy of those benchmarks to the students.
    3.
    Study of Techniques used for Improving Performance of Instruction Pipelining
    The student will prepare for a refined presentation and a report within six weeks. The suggested plan is: (1) Surveying of I-pipelining techniques in recently proposed micro-architectures (last ten years), (2) how structural, data, and control hazards are resolved and at what level, (3) main proposed features like branch-prediction, speculation, etc., (4) provide a comparison of these micro-architectures with respect to major aspects especially expected performance and limitations. Implement a branch-prediction table with 2-bit history and evaluate performance by using typical loop sturctures.

    4.
    Investigation of Thread-Level Parallelism
    The student will prepare for a refined presentation and a report within six weeks. The suggested plan is: (1) issues or different schools of instruction-level parallelism (ILP) in major micro-architectures with examples. (2) what are the main research problems in each category like hazards resolution, brach prediction, speculative execution, hardware/compiler trends, etc., (2) Thread-level parallelism (TLP) with internal organization and examples from known processors, (3) Identify ILP and TLP in VLIW and Superscalar processors, (4) EPIC (explicit Parallel Instruction Computing) architectures, (5) Case study of some high-end processors (Intel Merced), and (6) New advances (Raw architecture, Simlutaneous Multithreading architectures). Please avoid simple enumeration of approaches by providing motivation and execution philosoghy.

    5.
    Cluster computing on a network of workstations
    • Survey of data parallel programming environment (PVM or MPI).
    • Parallelization of programs using MPI or PVM.
    The Beowulf is an 8-node network of PCs (Robotics Lab) interconnected through a fast switched LAN and can be logically organized as a high-performance computer system according to the MIMD distributed-memory paradigm. PVM is a parallel computing kernel that provide a library of message passing communication primitives for interprocess communication based on the send/receive primitives. After getting familiar with PVM, the student will study how to parallelize a specific computation problem like the matrix multiply or other benchmarks. Performance of PVM library is another point of study for which the student test the latency (software and hardware) of these communication functions by implementing one sender and receiver over the CCSE network. The student may write programs in C including calls to the PVM library for interprocess communication. Results will address the issue of maximizing the speedup of the program on the Beowulf parallel machine.

    6.
    Investigation of Thread-Level Parallelism
    The student will prepare for a refined presentation and a report within six weeks. The suggested plan is: (1) issues or different schools of instruction-level parallelism (ILP) in major micro-architectures with examples. (2) what are the main research problems in each category like hazards resolution, brach prediction, speculative execution, hardware/compiler trends, etc., (2) Thread-level parallelism (TLP) with internal organization and examples from known processors, (3) Identify ILP and TLP in VLIW and Superscalar processors, (4) EPIC (explicit Parallel Instruction Computing) architectures, (5) Case study of some high-end processors (Intel Merced), and (6) New advances (Raw architecture, Simlutaneous Multithreading architectures). Please avoid simple enumeration of approaches by providing motivation and execution philosoghy.