Department of Computer Engineering
COE 523: Fault Tolerant Computer Systems (3-0-3)
Fundamental concepts in the theory of reliable computer systems design.
Redundancy techniques including hardware, software, information,
and time techniques. Empirical, combinatorial and Markovian models
for evaluating fault-tolerant computing systems.
The practice of reliable systems design. Case studies including General-purpose,
High-availability, Long-life and Critical-computation systems. Fault-Tolerance
and reliability of multicomputer networks (direct and indirect) including
fault-tolerant routing and sparing techniques.
Fault-Tolerance of ATM high-speed networks. Fault-tolerance of
telecommunication networks. Yield and reliability enhancement
techniques for VLSI/WSI array processors. Fault-tolerance of neural networks.
Prerequisite: COE 308 or equivalent.
D. Siewiorek and R. Swarz, ``Reliable Computer Systems: Design and Evaluation'',
Digital-Press, 2nd Edition, 1992.
- (1) To introduce students to the theory of reliable and fault-tolerant computer systems.
- (2) To expose students to the fundamental concepts in designing reliable
- (3) To introduce students to the basic design aspects of reliable and
fault-tolerance multicomputer networks.
- (4) To introduce students to the basic design aspects of reliable and
fault-tolerance ATM/high speed networks.
- (5) To introduce students to the basic design aspects of reliable and
fault-tolerance Telecommunication networks.
Course Learning Outcomes:
- (1) Grasp the basic elements in the theory of reliable and fault-tolerant computer systems.
- (2) Grasp the fundamental concepts in designing reliable
- Module 1: The Theory of Reliable Systems
Fundamental concepts, faults and their manifestation, reliability techniques,
maintainability and diagnostic techniques, evaluation techniques,
application of modeling techniques to systems design,
and financial considerations.
- Module 2: The Practice of Reliable Systems Design
Fundamental concepts, case studies. Students will be asked to investigate
the fault-tolerant capability of a number of computing systems and make
- Module 3: Fault-Tolerance and Reliability of Multicomputer Networks
Investigation into the fault-tolerance and reliability aspects of multicomputer
networks. Direct networks, e.g. Ring and its variance, Hypercubes, and Meshes.
Indirect networks, e.g. Multi-stage interconnection networks. Reliability
enhancement and fault-tolerant routing techniques for those networks are
also discussed and analyzed.
- Module 4: Fault-tolerance and Reliability of ATM
Investigate the fault-tolerance and reliability aspects of space-division
- Module 5: FT and Reliability Design of General Graph
Discuss the design aspects of fault-tolerant and reliable general graph
- Module 6: FT VLSI Circuits
Failure modes in VLSI/WSI circuits, self-checking circuits design and some
coding techniques are introduced.
The use of local, global, and hierarchical hardware redundancy techniques
for reliability and yield enhancement of VLSI/WSI processor arrays are
discussed and analyzed. Fault-tolerance and reliability aspects of
the emerging Neural Networks are introduced.
Use of some CAD tools.
Grading Policy (Tentative):
25% Major Exam (Tentatively offered at the end of week
35% Course Projects
30% Final Exam (Scheduled by the Registrar)
ABET Category content:
Engineering Science: 50 %
Engineering Design: 50%
Prepared by: Prof. Mostafa Abd-El-Barr.
Date: January 2001.