King Fahd University of Petroleum & Minerals
College of Computer Sciences and Engineering
Computer Engineering Department
COE 501 Computer Architecture
Course Syllabus
- Instructor: Professor Mayez Al-Mouhamed (Email: mayez@ccse.kfupm.sa.edu)
- Office: Room 22-400-3 (Tel. 2934) and Lab 22-339 (Tel. 3536).
- Office hours: S.M.W from 9-10 am, U.T. from 10-11 am, and by appointment.
- Text Book and references:
- Computer architecture: a quantitative approach, Hennessy and
Patterson, Third edition, Morgan Kaufnam Publishers, Inc.
- Selected
papers from IEEE T.C., IEEE T.P.D.S., etc. in addition to the following books:
- D.E. Culler and J.P. Singh, Parallel Computer Architecture: A
Hardware/Software Approach, Morgan Kaufmann Publishers Inc., San
Francisco, CA, 1999, ISBN 1-55860-343-3.
- Computer Architecture and Parallel Processing, K. Hwang and F.
Briggs, Mc-Graw-Hill.
- Readings in Computer Architecture, Mark Hill (Editor), Norman Jouppi
(Editor), Gurindar Sohi (Editor), Morgan Kaufmann Publishing Co.,
Menlo Park, CA. 1999
- Computer Organization and Design: The Hardware / Software
Interface", Patterson and Hennessy, Morgan-Kaufmann, 1998.
- M.J. Quinn, Parallel Programming in C with MPI and OpenMP,
McGraw-Hill, New York, NY, 2004, ISBN 0-07-282256-2.
- P.S. Pacheco, Parallel Programming with MPI, Morgan Kaufmann
Publishers Inc., San Francisco, CA, 1997, ISBN 1-55860-339-5.
- J.M. May, Parallel I/O for High Performance Computing, Morgan
Kaufmann Publishers Inc., San Francisco, CA, 2001, ISBN
1-55860-664-5.
- K. Hwang and Z. Xu, Scalable Parallel Computing: Technology,
Architecture, Programming, McGraw-Hill, 1998, ISBN 0070317984.
- A.E. Koniges (ed), Industrial Strength Parallel Computing, Morgan
Kaufmann Publishers Inc., San Francisco, CA, 2000, ISBN
1-55860-540-1.
- G.R. Andrews, Multithreaded, Parallel, and Distributed
Programming,Addison-Wesley, Reading, Mass., 2000, ISBN
0-201-35752-6.
- G.F. Pfister, In Search of Clusters, 2nd Edition, PTR Prentice-Hall,
Upper Saddle River, NJ, 1998, ISBN 0-13-899709-8.
- Kai Hwang, Advanced Computer Architecture - Parallelism,
Scalability, Programmability, McGraw-Hill, New York, NY, 1993, ISBN
0-07-031622-8.
- Grading Policy:
Exam 1: 20/100 (April 5, 2009) , Exam 2:
20/100 (May 24, 2009), Course project:
20/100, Homework: 10/100, and Final Exam: 30/100 (scheduled by
the registrar).
- Attendance: attendance is required by all students.
Excuse for official authorized must be presented to the instructor
no later than one week following the absence.
Unexcused absences lead to a ``DEN'' grade.
Course Description:
Classification of computer systems, architectural developments,
computer performance. Linear and nonlinear pipeline design, instruction
and arithmetic pipeline, superscalar. Memory hierarchy, cache and virtual
memories, cache coherence, memory system performance.
Parallel architectures, performance measures, SIMD and MIMD architectures.
Interconnection networks.
Pre-requisite: graduate standing.
Course Outline:
- Introduction (5 lectures)
Classification of computer systems, architectural developments,
cost and trends, computer performance, measuring CPU time, CPI,
MIPS, FLOPS, and computer benchmark suites, Examples of ISA. (Chapter 1 and 2).
- Pipelining (15 lectures)
ILP, instruction pipelining, hazards, dynamic branch prediction,
Tomasulo dynamic execution, speculative execution and limitation,
multiple issue superscalar processors, and examples. Software
approaches to ILP, static branch prediction, VlIW versus dynamic
execution, compiler support for ILP, unrolling, software
pipelining, dependence analysis, and examples. (Chapters 3 and 4)
- Memory system (12 lectures)
Memory hierarchy, cache memory organization, cache algorithms,
performance of memory, improving memory performance, virtual
memories, and examples. (Chapters 5)
- Parallel architectures (10 lectures)
Classification of parallel computers, SIMD architectures, MIMD
shared-memory and distributed-memory architectures, programming,
performance, SPMD, and synchronization and barriers. Coherence
protocols. Interconnection networks, examples, and performance.
(Chapters 6 and 8)
- Miscellaneous (midterm and presentations) (3 lectures)
List of projects:
- Study of the DLX simulator
Please investigate the DLX simulator with respect to (1) take a
code like ADI benchmark, write it in C, generate its code in DLX,
and collect its run time. (2) restructure the DLX ADI assembly to
the best you can, run it, and compare its performance to the
compiled version, (3) Try loop unrolling and compare performance.
Now you may advice a method for compiler loop restructuring such
as the use of loop unrolling. Explain the compiler approach with
respect to a source code in DLX and give example. The students
will prepare for a presentation within four weeks.
Suggested implementation:
- Consider a set (say 5) of small programs (computation loops) like the
(1) Alternating Direction Integration (ADI), (2) the Red-Black
Relaxation loop, and (3) 2D Matrix Multiply, etc.
- Carry out recurrence analysis of each loop and classify it as LID, B-LCD, and
F-LCD.
- restructure the loop to promote its performance in the DLX
instruction pipeline using the unrolling and software pipelining.
- Run each of the above program without restructuring, with loop
unrolling (possibly different configurations), and software
pipelining (possibly different configurations).
- Report performance at all levels after collecting execution time
as number of clocks, compare performance for each restructuring
approach, and present your discussion and justification.
- Suggest formal restructuring rules for loop unrolling and
software pipelining.
- Study of Benchmarks programs used for Desktop computers.
For the first four weeks the students will survey (1) determine
representative types of benchmarks used for Desktop computers, (2)
classify the benchmarks depending on their objectives, (3) acquire
some available benchmarks for running them and demonstration to the
class, and (4) classify some known computers based on their
benchmarks performance. The suggested plan is: (1) determine
representative benchmarks most commonly used for Desktop computers,
(2) classify the benchmarks depending on their objectives such as
benchmarking the CPU, the display system, the hard disk, and the
NIC/Network, (3) acquire benchmarks as stated in (2), run them
during your presentation, comments on the results, and provide a
copy of those benchmarks to the students. Define clearly what is
benchmarked and how the benchmark is working and provide the meaning
of the output, its interpretation and justification. The students
will prepare for a presentation within four weeks.
Suggested implementation:
- Find available or public domain benchmarks that allows
assessing the local performance of a specific issue of Computer
Architecture such as the following arranged in descending order of
priority: (1) the CPU capability and its instruction pipeline, (2)
the SIMD multimedia extension (MMX), (3) the hierarchical memory
system, (4) the graphics capability and its AGP card, (5) the stack
protocol, NIC, and network delays, (6) etc. For each considered
benchmark we must have information on its load profile and clear
interpretation of its output. Avoid benchmarks for which we only
have a generic meaning of the outcome.
- Run the above benchmarks on specific PCs with the idea of
accessing the performance of its architectural implementation not
just displaying the output performance of the benchmark. Assess the
data in (1) comparative way (output from different benchmarks and
for different PCs), and (2) critical thinking (matching obtained
performance with known concepts about the specific PC architecture).
- Provide copy of the studied benchmarks and the report.
Write a report about the above experiments, comparative tables,
discussions, and attach the benchmarks or indicate where to find
them.
- Study of Techniques used for Improving Performance of
Instruction Pipelining. For the first four weeks the students carry
out (1) Surveying of I-pipelining techniques in at least two recent
processors, (2) dynamic execution models, (3) methods of resolution
of hazards, and (4) compiler techniques to I-pipelining. The
suggested plan is: (1) Surveying of I-pipelining techniques in
recently proposed micro-architectures (last ten years), (2) how
structural, data, and control hazards are resolved and at what
level, (3) main proposed features like branch-prediction,
speculation, etc., (4) provide a comparison of these
micro-architectures with respect to major aspects especially
expected performance and limitations. Implement a branch-prediction
table with 2-bit history and evaluate performance by using typical
loop structures. The students will prepare for a presentation within
four weeks.
Suggested implementation. Select on of the following
implementations:
- Select one branch prediction algorithm from a
recent research paper and implement it. Carry out performance
evaluation by using a statistically generated branch behavior. For
example, the prob. and instruction is CB is 0.2 and that the
branch has a Set of statistical T/NT patterns (for example pattern
1110010101 prob. taken is 0.6). Evaluate performance of the used
branch prediction and suggest some improvement.
- Select one
correlating branch prediction algorithm from a recent research
paper and implement it. Carry out performance evaluation by using
a statistically generated branch behavior. For example, set up a a
set of correlation rules between 3 or more CBs that are hidden to
the computer. Each correlation element indicate the T/NT of each
of the CBs. You may randomly a given element at a start of first
CB. The problem is to evaluate the performance of the algorithm
when the set correlated branches behaves as above. Study
performance (correct prediction as compared incorrect ones) and
suggest some algorithm tuning.
- Implement a software tool
(compiler) that takes a loop with a few statement in the body as
input and (1) carry out data dependence analysis, (2) check the
profitability of loop unrolling and software pipelining under
specific dependence conditions, (3) carry out the needed
restructuring and write the restructured loop in an output file.
Discuss the decision rules used and show input and output loops.
- Design of a simulator for a reconfigurable pipelined dynamic
execution processor. Dynamic execution consists of an issue unit
that adopts an approach to resolve resource conflicts and data
hazards (RAW and WAR) as well as providing a scheme for sending
produced data out of ALUs to potential consumers (instructions).
Pipelining means that the state of the machine can be tracked by a
clock to determine availability of data and resources. The project
is on writing a simulator for the above processor while allowing the
user to change (reconfigure) some machine parameters depending on
his need. For this we may adopt a simple RISC assembler Instruction Set.
The simulator will be tested using small programs. Actions: (1)
search the Internet for similar simulators, (2) design of the
pipeline data structure, (3) implement the issue logic and
reservation stations, (4) define performance metrics, and (5) evaluate
performance.