Cloud Computing has seen a tremendous popularity in last several years. A scalable and efficient data center network is essential for a performance capable cloud computing infrastructure. This thesis provides practical solutions to enable an efficient, flexible, multi-tenant network architecture suitable for high-performance cloud computing, using InfiniBand (IB) as a demonstration technology. The work is motivated by the needs of the future data centers to provide efficient cloud solutions for increasing uptake of the cloud technology for both big data and traditional High-Performance Computing (HPC) applications.
Research contributions of this thesis lie within three main categories. First, we propose a set of improvements to the fat-tree routing algorithm to make it suitable for HPC workloads in the cloud. Fat-Tree is a popular network topology in HPC systems. Our proposed improvements to the fat-tree routing make it more efficient, provides performance isolation among tenants in multi-tenant systems, and enable routing of both physical end nodes and virtualized end nodes according to the policies set by the provider. Second, we design new network reconfiguration methods to significantly reduce the time it takes to reroute the IB network. Reduced network reconfiguration time means that the interconnection network in a HPC cloud can optimize itself quickly to adapt to changing tenant configurations, faults, running workloads, and current network conditions. Last, we demonstrate a self-adaptive network protot ype for IB-based HPC clouds, fully equipped with autonomous monitoring and adaptation, and configurable through a high-level condition-action language for the service providers.
The research conducted in this thesis has potential impacts on both private cloud infrastructures, such as medium sized clusters used for enterprise HPC, and public clouds offering innovative HPC solutions to the customers at scale. The industrial application of the thesis is reflected by the eight patent applications resulted from this work.
The thesis is written within the field of Communication Systems. The work has been conducted at Simula Research Laboratory.
Prior to the defense, at 10:15, Feroz Zahid presented his trial lecture"Distributed Deep Learning".
The adjudication committee
- Associate Professor Torsten Hoefler. Scalable Parallel Computing Lab, Computer Science Department ETH Zürich
- Associate Professor Francisco J. Alfaro-Cortés, University of Castilla-LaMancha, Castilla-La Mancha
- Professor Xing Cai, Department of informatics, University of Oslo
Chair of the disputation
- Associate ProfessorRagnhild Kobro Runde,Department of informatics, University of Oslo
- Associate Professor Ernst Gunnar Gran, Department of InformationSecurity and Technology, Norwegian University of Science and Technology
- Professor Tor Skeie, Department of Informatics, University of Oslo
- Professor Olav Lysne,Department of Informatics, University of Oslo