King Fahd University of Petroleum & Minerals
College of Computer Sciences & Engineering

Department of Computer Engineering




COE 523: Fault Tolerant Computer Systems (3-0-3)



Syllabus




Catalog Description

Fundamental concepts in the theory of reliable computer systems design. Redundancy techniques including hardware, software, information, and time techniques. Empirical, combinatorial and Markovian models for evaluating fault-tolerant computing systems. The practice of reliable systems design. Case studies including General-purpose, High-availability, Long-life and Critical-computation systems. Fault-Tolerance and reliability of multicomputer networks (direct and indirect) including fault-tolerant routing and sparing techniques. Fault-Tolerance of ATM high-speed networks. Fault-tolerance of telecommunication networks. Yield and reliability enhancement techniques for VLSI/WSI array processors. Fault-tolerance of neural networks.

Prerequisite: COE 308 or equivalent.

Text Book:

D. Siewiorek and R. Swarz, ``Reliable Computer Systems: Design and Evaluation'', Digital-Press, 2nd Edition, 1992.

Course Objectives:

(1) To introduce students to the theory of reliable and fault-tolerant computer systems.

(2) To expose students to the fundamental concepts in designing reliable computer systems.

(3) To introduce students to the basic design aspects of reliable and fault-tolerance multicomputer networks.

(4) To introduce students to the basic design aspects of reliable and fault-tolerance ATM/high speed networks.

(5) To introduce students to the basic design aspects of reliable and fault-tolerance Telecommunication networks.



Course Learning Outcomes:

(1) Grasp the basic elements in the theory of reliable and fault-tolerant computer systems.

(2) Grasp the fundamental concepts in designing reliable computer systems.


Topics:

1.
Module 1: The Theory of Reliable Systems (6 classes)
Fundamental concepts, faults and their manifestation, reliability techniques, maintainability and diagnostic techniques, evaluation techniques, application of modeling techniques to systems design, and financial considerations.

2.
Module 2: The Practice of Reliable Systems Design (6 classes)
Fundamental concepts, case studies. Students will be asked to investigate the fault-tolerant capability of a number of computing systems and make presentations on these.

3.
Module 3: Fault-Tolerance and Reliability of Multicomputer Networks (6 classes)
Investigation into the fault-tolerance and reliability aspects of multicomputer networks. Direct networks, e.g. Ring and its variance, Hypercubes, and Meshes. Indirect networks, e.g. Multi-stage interconnection networks. Reliability enhancement and fault-tolerant routing techniques for those networks are also discussed and analyzed.

4.
Module 4: Fault-tolerance and Reliability of ATM Networks (6 classes)
Investigate the fault-tolerance and reliability aspects of space-division ATM.

5.
Module 5: FT and Reliability Design of General Graph Networks (4 classes)
Discuss the design aspects of fault-tolerant and reliable general graph networks

6.
Module 6: FT VLSI Circuits (4 classes)
Failure modes in VLSI/WSI circuits, self-checking circuits design and some coding techniques are introduced. The use of local, global, and hierarchical hardware redundancy techniques for reliability and yield enhancement of VLSI/WSI processor arrays are discussed and analyzed. Fault-tolerance and reliability aspects of the emerging Neural Networks are introduced.


Computer Usage:

Use of some CAD tools.

Laboratory Experiments:

None.

Grading Policy (Tentative):

10% Assignments
25% Major Exam (Tentatively offered at the end of week 8)
35% Course Projects
30% Final Exam (Scheduled by the Registrar)
ABET Category content:
Engineering Science: 50 %

Engineering Design: 50%


Prepared by: Prof. Mostafa Abd-El-Barr. Date: January 2001.