Fault Tolerance

  • Reliability and availability
  • Metrics
    • MTTF = Mean time to failure
    • MTBF = Mean time between failures
    • MTTR = Mean time to recovery
  • Types of faults
    • Transient vs persistent
    • Malicous vs benign
    • Fail stop, stuck 1 or 0
    • Byzantine faults, inconsistent behaviors
  • Safety vs liveliness
  • Fail-safe, graceful degradation
  • -modular redundancy masks up to failures, when .
  • Hot/Cold standby.
  • Design considerations
    • Fault avoidance - focus at the design phase
    • Fault removal - debugging, iteration
    • Fault tolerance