coe501

King Fahd University of Petroleum & Minerals
College of Computer Sciences and Engineering
Computer Engineering Department
COE 501 Computer Architecture
Course Syllabus

Instructor: Professor Mayez Al-Mouhamed (Email: mayez@ccse.kfupm.sa.edu)
Office: Room 22-400-3 (Tel. 2934) and Lab 22-339 (Tel. 3536).
Office hours: S.M.W from 9-10 am, U.T. from 10-11 am, and by appointment.
Text Book and references:
- Computer architecture: a quantitative approach, Hennessy and Patterson, Third edition, Morgan Kaufnam Publishers, Inc.
- Selected papers from IEEE T.C., IEEE T.P.D.S., etc. in addition to the following books:
  - D.E. Culler and J.P. Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1999, ISBN 1-55860-343-3.
  - Computer Architecture and Parallel Processing, K. Hwang and F. Briggs, Mc-Graw-Hill.
  - Readings in Computer Architecture, Mark Hill (Editor), Norman Jouppi (Editor), Gurindar Sohi (Editor), Morgan Kaufmann Publishing Co., Menlo Park, CA. 1999
  - Computer Organization and Design: The Hardware / Software Interface", Patterson and Hennessy, Morgan-Kaufmann, 1998.
  - M.J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill, New York, NY, 2004, ISBN 0-07-282256-2.
  - P.S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1997, ISBN 1-55860-339-5.
  - J.M. May, Parallel I/O for High Performance Computing, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2001, ISBN 1-55860-664-5.
  - K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, 1998, ISBN 0070317984.
  - A.E. Koniges (ed), Industrial Strength Parallel Computing, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000, ISBN 1-55860-540-1.
  - G.R. Andrews, Multithreaded, Parallel, and Distributed Programming,Addison-Wesley, Reading, Mass., 2000, ISBN 0-201-35752-6.
  - G.F. Pfister, In Search of Clusters, 2nd Edition, PTR Prentice-Hall, Upper Saddle River, NJ, 1998, ISBN 0-13-899709-8.
  - Kai Hwang, Advanced Computer Architecture - Parallelism, Scalability, Programmability, McGraw-Hill, New York, NY, 1993, ISBN 0-07-031622-8.

Grading Policy: Exam 1: 20/100 (April 5, 2009) , Exam 2: 20/100 (May 24, 2009), Course project: 20/100, Homework: 10/100, and Final Exam: 30/100 (scheduled by the registrar).

Attendance: attendance is required by all students. Excuse for official authorized must be presented to the instructor no later than one week following the absence. Unexcused absences lead to a ``DEN'' grade.

Course Description: Classification of computer systems, architectural developments, computer performance. Linear and nonlinear pipeline design, instruction and arithmetic pipeline, superscalar. Memory hierarchy, cache and virtual memories, cache coherence, memory system performance. Parallel architectures, performance measures, SIMD and MIMD architectures. Interconnection networks.
Pre-requisite: graduate standing.

Course Outline:

Introduction (5 lectures)
Classification of computer systems, architectural developments, cost and trends, computer performance, measuring CPU time, CPI, MIPS, FLOPS, and computer benchmark suites, Examples of ISA. (Chapter 1 and 2).

Pipelining (15 lectures)
ILP, instruction pipelining, hazards, dynamic branch prediction, Tomasulo dynamic execution, speculative execution and limitation, multiple issue superscalar processors, and examples. Software approaches to ILP, static branch prediction, VlIW versus dynamic execution, compiler support for ILP, unrolling, software pipelining, dependence analysis, and examples. (Chapters 3 and 4)

Memory system (12 lectures)
Memory hierarchy, cache memory organization, cache algorithms, performance of memory, improving memory performance, virtual memories, and examples. (Chapters 5)

Parallel architectures (10 lectures)
Classification of parallel computers, SIMD architectures, MIMD shared-memory and distributed-memory architectures, programming, performance, SPMD, and synchronization and barriers. Coherence protocols. Interconnection networks, examples, and performance. (Chapters 6 and 8)

Miscellaneous (midterm and presentations) (3 lectures)

List of projects:

Study of the DLX simulator
Please investigate the DLX simulator with respect to (1) take a code like ADI benchmark, write it in C, generate its code in DLX, and collect its run time. (2) restructure the DLX ADI assembly to the best you can, run it, and compare its performance to the compiled version, (3) Try loop unrolling and compare performance. Now you may advice a method for compiler loop restructuring such as the use of loop unrolling. Explain the compiler approach with respect to a source code in DLX and give example. The students will prepare for a presentation within four weeks.
Suggested implementation:
1. Consider a set (say 5) of small programs (computation loops) like the (1) Alternating Direction Integration (ADI), (2) the Red-Black Relaxation loop, and (3) 2D Matrix Multiply, etc.
2. Carry out recurrence analysis of each loop and classify it as LID, B-LCD, and F-LCD.
3. restructure the loop to promote its performance in the DLX instruction pipeline using the unrolling and software pipelining.
4. Run each of the above program without restructuring, with loop unrolling (possibly different configurations), and software pipelining (possibly different configurations).
5. Report performance at all levels after collecting execution time as number of clocks, compare performance for each restructuring approach, and present your discussion and justification.
6. Suggest formal restructuring rules for loop unrolling and software pipelining.

Study of Benchmarks programs used for Desktop computers.

For the first four weeks the students will survey (1) determine representative types of benchmarks used for Desktop computers, (2) classify the benchmarks depending on their objectives, (3) acquire some available benchmarks for running them and demonstration to the class, and (4) classify some known computers based on their benchmarks performance. The suggested plan is: (1) determine representative benchmarks most commonly used for Desktop computers, (2) classify the benchmarks depending on their objectives such as benchmarking the CPU, the display system, the hard disk, and the NIC/Network, (3) acquire benchmarks as stated in (2), run them during your presentation, comments on the results, and provide a copy of those benchmarks to the students. Define clearly what is benchmarked and how the benchmark is working and provide the meaning of the output, its interpretation and justification. The students will prepare for a presentation within four weeks.
Suggested implementation:
1. Find available or public domain benchmarks that allows assessing the local performance of a specific issue of Computer Architecture such as the following arranged in descending order of priority: (1) the CPU capability and its instruction pipeline, (2) the SIMD multimedia extension (MMX), (3) the hierarchical memory system, (4) the graphics capability and its AGP card, (5) the stack protocol, NIC, and network delays, (6) etc. For each considered benchmark we must have information on its load profile and clear interpretation of its output. Avoid benchmarks for which we only have a generic meaning of the outcome.
2. Run the above benchmarks on specific PCs with the idea of accessing the performance of its architectural implementation not just displaying the output performance of the benchmark. Assess the data in (1) comparative way (output from different benchmarks and for different PCs), and (2) critical thinking (matching obtained performance with known concepts about the specific PC architecture).
3. Provide copy of the studied benchmarks and the report. Write a report about the above experiments, comparative tables, discussions, and attach the benchmarks or indicate where to find them.

Study of Techniques used for Improving Performance of Instruction Pipelining. For the first four weeks the students carry out (1) Surveying of I-pipelining techniques in at least two recent processors, (2) dynamic execution models, (3) methods of resolution of hazards, and (4) compiler techniques to I-pipelining. The suggested plan is: (1) Surveying of I-pipelining techniques in recently proposed micro-architectures (last ten years), (2) how structural, data, and control hazards are resolved and at what level, (3) main proposed features like branch-prediction, speculation, etc., (4) provide a comparison of these micro-architectures with respect to major aspects especially expected performance and limitations. Implement a branch-prediction table with 2-bit history and evaluate performance by using typical loop structures. The students will prepare for a presentation within four weeks.
Suggested implementation. Select on of the following implementations:
1. Select one branch prediction algorithm from a recent research paper and implement it. Carry out performance evaluation by using a statistically generated branch behavior. For example, the prob. and instruction is CB is 0.2 and that the branch has a Set of statistical T/NT patterns (for example pattern 1110010101 prob. taken is 0.6). Evaluate performance of the used branch prediction and suggest some improvement.
2. Select one correlating branch prediction algorithm from a recent research paper and implement it. Carry out performance evaluation by using a statistically generated branch behavior. For example, set up a a set of correlation rules between 3 or more CBs that are hidden to the computer. Each correlation element indicate the T/NT of each of the CBs. You may randomly a given element at a start of first CB. The problem is to evaluate the performance of the algorithm when the set correlated branches behaves as above. Study performance (correct prediction as compared incorrect ones) and suggest some algorithm tuning.
3. Implement a software tool (compiler) that takes a loop with a few statement in the body as input and (1) carry out data dependence analysis, (2) check the profitability of loop unrolling and software pipelining under specific dependence conditions, (3) carry out the needed restructuring and write the restructured loop in an output file. Discuss the decision rules used and show input and output loops.

Design of a simulator for a reconfigurable pipelined dynamic execution processor. Dynamic execution consists of an issue unit that adopts an approach to resolve resource conflicts and data hazards (RAW and WAR) as well as providing a scheme for sending produced data out of ALUs to potential consumers (instructions). Pipelining means that the state of the machine can be tracked by a clock to determine availability of data and resources. The project is on writing a simulator for the above processor while allowing the user to change (reconfigure) some machine parameters depending on his need. For this we may adopt a simple RISC assembler Instruction Set. The simulator will be tested using small programs. Actions: (1) search the Internet for similar simulators, (2) design of the pipeline data structure, (3) implement the issue logic and reservation stations, (4) define performance metrics, and (5) evaluate performance.