Fault Tolerance and Recovery for Group Communication Services in Distributed Networks
-
Abstract
Group communication services (GCSs) are becoming increasingly important as a wide field of promising applications has emerged to serve millions of users distributed across the world. However, it is challenging to make the service fault tolerance and scalable to fulfill the voluminous demand of users in a distributed network (DN). While many reliable group communication protocols have been dedicated to addressing such a challenge so as to accommodate the changes in the network, they are often costly or require complicated strategies to handle the service interruptions caused by node departures or link failures, which hinders the service practicability. In this paper, we present two schemes to address the challenges. The first one is a location-aware replication scheme called NS, which makes replicas in a dispersed fashion that enables the services on nodes to gain immunity of failures with different patterns (e.g., network partition and single point failure) while keeping replication overhead low. The second one is a novel failure recovery scheme that exploits the independence between service recovery and structure recovery in time domain to achieve quick failure recovery. Our simulation results indicate that the two proposed schemes outperform the existing schemes and simple alternative schemes in service success rate, recovery latency, and communication cost.
-
-