AuthorsF. O. Sem-Jacobsen
TitleTowards a Unified Interconnect Architecture: Combining Dynamic Fault Tolerance With Quality of Service, Community Separation, and Power Saving
Afilliation, Communication Systems
Publication TypePhD Thesis
Year of Publication2008
Date PublishedAugust
PublisherUniversity of Oslo
Thesis Typephd
ISBN Number1501-7710

High-performance computing has, for a decade, been synonymous with parallel computer systems. Whereas parallel systems initially were based on shared memory processing, all current high-performance systems are based on massively parallel processing, utilising a large number of loosely coupled processing units. Any parallel processing relies on a degree of communication, so the network interconnecting the processing units in the computer system has a significant impact on the efficiency of the system. Interconnection networks consist of a large number of switches and links to support the possible communication demands, and so the probability of some part of the system failing has to be considered. Steps must be taken to guarantee that all sources and destinations will be allowed to continue communication even with some failed elements, the system must be fault tolerant. As elements fail, communication patterns in the network will change and affect the quality of service experienced by all traffic in the system. Furthermore, in many cases, single applications need not occupy the entire computer system to perform its calculations. To increase system efficiency, several such applications may be run in parallel on the same system, or parts of the system may be shut down when it is underutilised to save power and cut expenses. When several applications are sharing the same system, there should be a minimal degree of interaction between them. This requires separation of communities and routing containment to guarantee that separate applications do not share network resources, and in the case where this is unavoidable, quality of service must be enforced to provide reliable guarantees to the applications. All these fields have received attention from the academic world, however, most proposed solutions for one problem are not easilly combined with solutions for the other problems. In this thesis, we develop a number of solutions for the different problems and attempt to combine these into a unified architecture. For fault tolerance we develop a number of algorithms based on re-routing locally around failed elements. We considered both specific fat-tree topologies, and a topology agnostic solution. For the fat-tree topology we are able to tolerate \frac{switch\_ports}{2} -1 link or switch faults dynamically with very low response times. We then proceed to evaluate how these mechanisms affect quality of service experienced by traffic flows in the network, and propose and evaluate a number of methods to re-prioritise traffic and maintain quality-of-service guarantees. We also develop a multipath routing scheme for fat-trees. By carefully selecting which path is utilised, we can achieve fault tolerance, separation communities, and power saving. Finally, we describe how a number of the proposed methods can be combined into a unified network architecture that addresses all the challenges we have stated.