AuthorsM. Sourouri
TitleScalable Heterogeneous Supercomputing: Programming Methodologies and Automated Code Generation
AfilliationHigh Performance Computing, Center for Biomedical Computing (SFF), Scientific Computing
Project(s)Center for Biomedical Computing (SFF)
Publication TypePhD Thesis
Year of Publication2015
Place PublishedOslo

Manycore processors such as Graphics Processing Units (GPUs) and Xeon Phis have remarkable computational capabilities and energy efficiency, making these units an at- tractive alternative to conventional CPUs for general-purpose computations. The distinct advantages of manycore processors have been quickly adopted to modern heterogeneous supercomputers, where each node is equipped with manycore processors in addition to CPUs.

This thesis takes aim at developing methodologies for efficient programming of GPU clusters, from a single compute node equipped with multiple GPUs that share the same PCIe bus, to large supercomputers involving thousands of GPUs connected by a high-speed network. The former configuration represents a peek into future node architecture of GPU clusters, where each compute node will be densely populated with GPUs. For this type of configuration, intra-node communication will play a more dominant role. We present pro- gramming techniques specifically designed to handle intra-node communication between multiple GPUs more effectively. For supercomputers involving multiple nodes, we have developed an automated code generator that delivers good weak scalability on thousands of GPUs.

While GPUs are improving rapidly, they are still not general-purpose, and depend on CPUs to act as their host. Consequently, GPU clusters often feature powerful multi-core CPUs in addition to GPUs. Despite the presence of CPUs, the focal point of many GPU applications has so far been on performing computations exclusively on the GPUs, keeping CPUs sidelined. However, as CPUs continue to advance, they have become too powerful to ignore. This gives rise to heterogeneous computing where CPUs and GPUs jointly take part in the computations.

The potentially achievable performance of heterogeneous computing codes can be very large, but requires careful attention to many programming details. We explore resource- efficient programming methodologies for heterogeneous computing where the CPU is an integral part of the computations. The experiments conducted demonstrate that by careful workload-partitioning and communication orchestration, our heterogeneous computing strategy outperforms a similar GPU-only approach on structured grid and unstructured grids.

Although our work demonstrates the benefit of heterogeneous computing, the painstak- ing programming effort required is holding back its wider adoption. We address this issue through the development and implementation of a programming model and source- to-source compiler called Panda, which automatically parallelizes serial 3D stencil codes originally written in C to heterogeneous CPU+GPU code for execution on GPU clusters. We have used two applications to assess the performance of our framework. Experimental results show that the Panda-generated code is able to realize up to 90% of the performance of corresponding handwritten heterogeneous CPU+GPU implementations, while always outperforming the handwritten GPU-only implementations.

Compared to the more established GPU-only approach, the methodologies presented in this thesis contribute to harnessing the computational powers of GPU clusters in a more resource-efficient way that can substantially accelerate simulations. Moreover, by providing a user-friendly code generation tool, the tedious and error-prone process associated with programming GPU clusters is alleviated, so that computational scientists can concentrate on the science instead of code development.