Fault Tolerance

Fault tolerance is the ability for a system to continue giving a correct service following the manifestation of a fault or faults either through errors in the system design, implementation or introduced following an attack. Fault tolerance is important in situations where a system failure could cause a catastrophic accident or large economic losses. For example, the computer systems within an air traffic control system must be continuously available.

 

Fault tolerance can be split into four aspects. Failure detection, the systems ability to detect that a failure has or is about to occur. Damage assessment, identifying the parts of the system that have been affected by the failure. Fault recovery, restoring the systems state to a known ‘safe’ state. Fault repair, modifying the system so the fault cannot occur again (see also repairability).