TY - JOUR

T1 - Stabilizing time-adaptive protocols

AU - Kutten, Shay

AU - Patt-Shamir, Boaz

N1 - Funding Information:
*Correspondence address: Department of Industrial Engineering & Management, The Technion - IIT, Haifa 32000, Israel. E-mail: kutten@ie.technion.ac.il. ’ A preliminary version of this paper appeared in Proc. ACM Symp. on Principles of Distributed Computing, August 1997. * Research supported by DARPA and Rome Laboratory under agreement F30602-96-0239.

PY - 1999/6/6

Y1 - 1999/6/6

N2 - We study the scenario where a transient batch of faults hits a minority of the nodes in a distributed system by corrupting their state. We concentrate on the basic persistent bit problem, where the system is required to maintain a 0/1 value in the face of transient failures by means of replication. We give an algorithm to stabilize the value to a correct state quickly; that is, denoting the unknown number of faulty nodes by f, our algorithm recovers the value of the bit at all nodes in O(f) time units for any f < n/2, where n is the number of all nodes. Moreover, complete state quiescence occurs in O(diam) time units, where diam denotes the actual diameter of the network. This means that the value persists indefinitely so long as any f < n/2 faults are followed by Ω(diam) fault-free time units. (Strict self-stabilization requires recovery for f>n/2 as well.) We prove matching lower bounds on both the output stabilization time and the state quiescence time. Using our persistent bit algorithm, we present a transformer which takes a distributed non-reactive non-stabilizing protocol ℘, and produces a protocol ℘' which solves the problem ℘ solves, with the additional property that if a batch of faults changes the state of f < n/2 of the nodes, then the output is recovered in O(f) time units, and the state stabilizes in O(diam) time units. Our upper and lower bounds are all proved in the synchronous network model.

AB - We study the scenario where a transient batch of faults hits a minority of the nodes in a distributed system by corrupting their state. We concentrate on the basic persistent bit problem, where the system is required to maintain a 0/1 value in the face of transient failures by means of replication. We give an algorithm to stabilize the value to a correct state quickly; that is, denoting the unknown number of faulty nodes by f, our algorithm recovers the value of the bit at all nodes in O(f) time units for any f < n/2, where n is the number of all nodes. Moreover, complete state quiescence occurs in O(diam) time units, where diam denotes the actual diameter of the network. This means that the value persists indefinitely so long as any f < n/2 faults are followed by Ω(diam) fault-free time units. (Strict self-stabilization requires recovery for f>n/2 as well.) We prove matching lower bounds on both the output stabilization time and the state quiescence time. Using our persistent bit algorithm, we present a transformer which takes a distributed non-reactive non-stabilizing protocol ℘, and produces a protocol ℘' which solves the problem ℘ solves, with the additional property that if a batch of faults changes the state of f < n/2 of the nodes, then the output is recovered in O(f) time units, and the state stabilizes in O(diam) time units. Our upper and lower bounds are all proved in the synchronous network model.

KW - Distributed algorithms

KW - Error correction

KW - Fault locality

KW - Mending

KW - Self stabilization

UR - http://www.scopus.com/inward/record.url?scp=0003150383&partnerID=8YFLogxK

U2 - 10.1016/S0304-3975(98)00238-2

DO - 10.1016/S0304-3975(98)00238-2

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:0003150383

VL - 220

SP - 93

EP - 111

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

IS - 1

ER -