Authors | R. Peñaranda, M. E. Gómez, P. Lopez, E. G. Gran and T. Skeie |
Title | A Fault-Tolerant Routing Strategy for KNS Topologies Based on Intermediate Nodes |
Afilliation | Communication Systems |
Project(s) | ERAC: Efficient and Robust Architecture for the Big Data Cloud |
Status | Published |
Publication Type | Journal Article |
Year of Publication | 2017 |
Journal | Concurrency and Computation: Practice and Experience |
Volume | 29 |
Issue | 13 |
Publisher | John Wiley & Sons, Ltd. |
Keywords | exascale computing, fault-tolerant routing, hybrid topology, KNS topology |
Abstract | Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k-ary n-direct s-indirect (KNS) family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault-tolerant routing methodology for the KNS topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source-destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are ten faults in a 3-D network with 1,000 nodes using only one intermediate node and more than 99.98% if two intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2-D network with 1,024 nodes and 1% faulty links. |
DOI | 10.1002/cpe.4065 |
Citation Key | 24639 |