King
Fahd
University
of
Petroleum &
Minerals
College of
Computer
Sciences &
Engineering
Department of Computer Engineering
|
COE 523: Fault Tolerant Computer Systems (3-0-3)
Syllabus
Catalog Description
Fundamental concepts in the theory of reliable computer systems design.
Redundancy techniques including hardware, software, information,
and time techniques. Empirical, combinatorial and Markovian models
for evaluating fault-tolerant computing systems.
The practice of reliable systems design. Case studies including General-purpose,
High-availability, Long-life and Critical-computation systems. Fault-Tolerance
and reliability of multicomputer networks (direct and indirect) including
fault-tolerant routing and sparing techniques.
Fault-Tolerance of ATM high-speed networks. Fault-tolerance of
telecommunication networks. Yield and reliability enhancement
techniques for VLSI/WSI array processors. Fault-tolerance of neural networks.
Prerequisite: COE 308 or equivalent.
Text Book:
D. Siewiorek and R. Swarz, ``Reliable Computer Systems: Design and Evaluation'',
Digital-Press, 2nd Edition, 1992.
Course Objectives:
-
- (1) To introduce students to the theory of reliable and fault-tolerant computer systems.
-
- (2) To expose students to the fundamental concepts in designing reliable
computer systems.
-
- (3) To introduce students to the basic design aspects of reliable and
fault-tolerance multicomputer networks.
-
- (4) To introduce students to the basic design aspects of reliable and
fault-tolerance ATM/high speed networks.
-
- (5) To introduce students to the basic design aspects of reliable and
fault-tolerance Telecommunication networks.
Course Learning Outcomes:
-
- (1) Grasp the basic elements in the theory of reliable and fault-tolerant computer systems.
-
- (2) Grasp the fundamental concepts in designing reliable
computer systems.
Topics:
- 1.
- Module 1: The Theory of Reliable Systems
(6 classes)
Fundamental concepts, faults and their manifestation, reliability techniques,
maintainability and diagnostic techniques, evaluation techniques,
application of modeling techniques to systems design,
and financial considerations.
- 2.
- Module 2: The Practice of Reliable Systems Design
(6 classes)
Fundamental concepts, case studies. Students will be asked to investigate
the fault-tolerant capability of a number of computing systems and make
presentations
on these.
- 3.
- Module 3: Fault-Tolerance and Reliability of Multicomputer Networks
(6 classes)
Investigation into the fault-tolerance and reliability aspects of multicomputer
networks. Direct networks, e.g. Ring and its variance, Hypercubes, and Meshes.
Indirect networks, e.g. Multi-stage interconnection networks. Reliability
enhancement and fault-tolerant routing techniques for those networks are
also discussed and analyzed.
- 4.
- Module 4: Fault-tolerance and Reliability of ATM
Networks
(6 classes)
Investigate the fault-tolerance and reliability aspects of space-division
ATM.
- 5.
- Module 5: FT and Reliability Design of General Graph
Networks
(4 classes)
Discuss the design aspects of fault-tolerant and reliable general graph
networks
- 6.
- Module 6: FT VLSI Circuits
(4 classes)
Failure modes in VLSI/WSI circuits, self-checking circuits design and some
coding techniques are introduced.
The use of local, global, and hierarchical hardware redundancy techniques
for reliability and yield enhancement of VLSI/WSI processor arrays are
discussed and analyzed. Fault-tolerance and reliability aspects of
the emerging Neural Networks are introduced.
Computer Usage:
Use of some CAD tools.
Laboratory Experiments:
None.
Grading Policy (Tentative):
10% Assignments
25% Major Exam (Tentatively offered at the end of week
8)
35% Course Projects
30% Final Exam (Scheduled by the Registrar)
ABET Category content:
Engineering Science: 50 %
Engineering Design: 50%
Prepared by: Prof. Mostafa Abd-El-Barr.
Date: January 2001.