We study the scenario where a transient fault hit f of the n nodes of a distributed system by corrupting their state. We consider the basic problem of persistent bit, where the system is required to maintain a value in the face of transient failures by means of replication. We give an algorithm to recover the value quickly: the value of the bit is recovered at all nodes in O(f) time units for any unknown value of f > n/2. Moreover, complete state quiescence occurs in O(diam) time units, where diam denotes the diameter of the network. This means that the value persists indefinitely so long as any f < n/2 faults are followed by Ω(diam) fault-free time units. We prove lower bounds which show that both time bounds are asymptotically optimal. Using the algorithm for persistent bit, we present a general transformer which takes a distributed non-reactive, non-stabilizing protocol P, and produces a self-stabilizing protocol P′ which solves the problem P solves, with the additional property that if the number of faults that hit the system after stabilization is f, for any unknown f < n/2, then the output of P′ regains stability in O(f) time units, and the state stabilizes in O(diam) time units.
|Number of pages||10|
|State||Published - 1997|
|Event||Proceedings of the 1997 16th Annual ACM Symposium on Principles of Distributed Computing - Santa Barbara, CA, USA|
Duration: 21 Aug 1997 → 24 Aug 1997
|Conference||Proceedings of the 1997 16th Annual ACM Symposium on Principles of Distributed Computing|
|City||Santa Barbara, CA, USA|
|Period||21/08/97 → 24/08/97|