|Authors||F. O. Sem-Jacobsen and O. Lysne|
|Editors||P. Balaji and R. Buyya|
|Title||Topology Agnostic Dynamic Quick Reconfiguration for Large-Scale Interconnection Networks|
|Afilliation||, Communication Systems|
|Publication Type||Proceedings, refereed|
|Year of Publication||2012|
|Conference Name||Proceedings of The 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing|
Toleration of faults in the interconnection networks is of vital importance in todays huge computer installations. Still, the existing solutions are short of being satisfactory. They require that the system defaults into a routing algorithm that is inferior to the original, either in terms of performance, or in terms of the need for virtual channels, or both. Furthermore, since support for dynamic reconfiguration is not supported in current hardware, existing methods require the system to be halted while reconfiguration takes place in order to avoid deadlocks. In this paper we present a method that efficiently generates a new routing function in the presence of faults. The new routing function only reroutes the traffic that is affected by the fault, so that the performance of the original routing function is preserved to the extent possible. No specific functionality in the switches is required, we only require exactly the same number of virtual channels in the presence of faults as the original routing algorithm did. Finally, the new routing function is compatible with the old one, so that deadlock free dynamic transition between the old and the new routing function is immediately available. This means that our solution can easily be implemented on current InfiniBand platforms, e.g. through the OFED software stack. We demonstrate that the method is workable for meshes, tori and fat-trees, and that it is able to guarantee one-fault tolerance for all of these topologies.