AuthorsF. O. Sem-Jacobsen and O. Lysne
EditorsP. Balaji and R. Buyya
TitleTopology Agnostic Dynamic Quick Reconfiguration for Large-Scale Interconnection Networks
Afilliation, Communication Systems
Publication TypeProceedings, refereed
Year of Publication2012
Conference NameProceedings of The 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Date PublishedMay

Toleration of faults in the interconnection networks is of vital importance in todays huge computer installations. Still, the existing solutions are short of being satisfactory. They require that the system defaults into a routing algorithm that is inferior to the original, either in terms of performance, or in terms of the need for virtual channels, or both. Furthermore, since support for dynamic reconfiguration is not supported in current hardware, existing methods require the system to be halted while reconfiguration takes place in order to avoid deadlocks. In this paper we present a method that efficiently generates a new routing function in the presence of faults. The new routing function only reroutes the traffic that is affected by the fault, so that the performance of the original routing function is preserved to the extent possible. No specific functionality in the switches is required, we only require exactly the same number of virtual channels in the presence of faults as the original routing algorithm did. Finally, the new routing function is compatible with the old one, so that deadlock free dynamic transition between the old and the new routing function is immediately available. This means that our solution can easily be implemented on current InfiniBand platforms, e.g. through the OFED software stack. We demonstrate that the method is workable for meshes, tori and fat-trees, and that it is able to guarantee one-fault tolerance for all of these topologies.

Citation KeySimula.simula.1134