AuthorsA. Kvalbein
TitleFast Network Recovery
AfilliationNetworks, Communication Systems
Publication TypePhD Thesis
Year of Publication2007
Date PublishedJune
PublisherUniversity of Oslo
Thesis Typephd

The Internet is increasingly used to transport time-critical traffic. Applications like video conferencing, television, telephony and distributed games have strict requirements to the delay and availability offered by the underlying network. At the same time, connectivity failures caused by failures in network equipment is a part of everyday operation in large communication systems. The traditional recovery mechanisms used in IP networks are not designed with real-time applications in mind. The distributed nature of popular intradomain routing protocols allows them to eventually recover from any number of failures that leaves the network connected, but this is a time consuming process that can lead to unacceptable performance degradations for some applications. In this work, we argue that there is a need for fast recovery mechanisms that allow packet forwarding to continue over alternate paths immediately after a failure, before the routing protocol has converged on the altered topology. To give rapid response, such mechanisms should be \emph{proactive} in the sense that an alternate route is readily available when a failure is discovered, and \emph{local}, so that the recovery action can be effected by the node that discovers the failure. Further, care should be taken so that the shifting of recovered traffic to an alternate route does not lead to congestion and packet loss in other parts of the network. We present and investigate mechanisms that can respond quickly to failures or unexpected traffic shifts in the network. First, we evaluate the recovery strategy used in a network protocol called Resilient Packet Ring (RPR). The ring topology used in RPR allows the implementation of very fast protection mechanisms. We look at the performance of these mechanisms, and propose improvements that reduce packet loss and shorten the experienced disruption time after a link or node failure. Then, in the main part of this work, we focus on fast recovery in general mesh networks. We present Resilient Routing Layers (RRL) and Multiple Routing Configurations (MRC), which are methods for near-instantaneous recovery from component failures in packet networks. We discuss and evaluate our mechanisms with respect to state requirements and distribution of the recovered traffic. For MRC, we move on to present methods for reducing the chances of congestion after a recovery operation. We show that if we have knowledge about the traffic demands, we can use this information to create MRC recovery paths that avoid the most heavily used parts of the network. Finally, we show how the concepts used in RRL and MRC to give recovery from component failures also can be used to avoid congestion when there are sudden shifts in the traffic distribution. Our method is more flexible than traditional traffic engineering methods used in connectionless IP networks, since it does not involve changing link weights to respond to a changed traffic situation. Fast recovery mechanisms like those proposed in this work can help improve the stability and availability of IP networks. This is an important requirement for enabling new and existing real-time applications over general-purpose Internet infrastructure.