|Authors||M. Sourouri, S. Baden and X. Cai|
|Title||Panda: A Compiler Framework for Concurrent CPU+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers|
|Project(s)||User-friendly programming of GPU-enhanced clusters, Center for Biomedical Computing (SFF)|
|Publication Type||Journal Article|
|Year of Publication||2016|
|Journal||International Journal of Parallel Programming|
|Keywords||code generation, code optimisation, CPU+GPU computing, CUDA, heterogeneous computing, MPI, OpenMP, source-to-source translation, stencil computation|
We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI+CUDA+OpenMP code that uses concurrent CPU+GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90% of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.