Fault-Tolerant Grid Architecture and Practice
-
Abstract
Grid computing emerges as effective technologies to couplegeographically distributed resources and solve large-scalecomputational problems in wide area networks. The fault tolerance is asignificant and complex issue in grid computing systems. Varioustechniques have been investigated to detect and correct faults indistributed computing systems. Unreliable fault detection is one of themost effective techniques. Globus as a grid middleware managesresources in a wide area network. The Globus fault detection serviceuses the well-known techniques based on unreliable fault detectors todetect and report component failures. However, more powerful techniquesare required to detect and correct both system-level andapplication-level faults in a grid system, and a convenient toolkit isalso needed to maintain the consistency in the grid. A fault-tolerantgrid platform (FTGP) based on an unreliable fault detector and theGlobus fault detection service is presented in this paper. The platformoffers effective strategies in such three aspects as grid keycomponents, user tasks, and high-level applications.
-
-