|Authors||L. Burchard, X. Cai and J. Langguth|
|Title||iPUG for multiple Graphcore IPUs: Optimizing performance and scalability of parallel breadth-first search|
|Project(s)||Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation|
|Publication Type||Proceedings, refereed|
|Year of Publication||2021|
|Conference Name||28th IEEE International Conference on High Performance Computing, Data, & Analytics (HiPC)|
|Place Published||Bangalore, India|
Parallel graph algorithms have become one of the principal applications of high-performance computing besides numerical simulations and machine learning workloads. However, due to their highly unstructured nature, graph algorithms remain extremely challenging for most parallel systems, with large gaps between observed performance and theoretical limits. Further-more, most mainstream architectures rely heavily on single instruction multiple data (SIMD) processing for high floating-point rates, which is not beneficial for graph processing which instead requires high memory bandwidth, low memory latency, and efficient processing of unstructured data. On the other hand, we are currently observing an explosion of new hardware architectures, many of which are adapted to specific purposes and diverge from traditional designs. A notable example is the Graphcore Intelligence Processing Unit (IPU), which is developed to meet the needs of upcoming machine intelligence applications. Its design eschews the traditional cache hierarchy, relying on SRAM as its main memory instead. The result is an extremely high-bandwidth, low-latency memory at the cost of capacity. In addition, the IPU consists of a large number of independent cores, allowing for true multiple instruction multiple data (MIMD) processing. Together, these features suggest that such a processor is well suited for graph processing. We test the limits of graph processing on multiple IPUs by implementing a low-level, high-performance code for breadth-first search (BFS), following the specifications of Graph500, the most widely used benchmark for parallel graph processing. Despite the simplicity of the BFS algorithm, implementing efficient parallel codes for it has proven to be a challenging task in the past. We show that our implementation scales well on a system with 8 IPUs and attains roughly twice the performance of an equal number of NVIDIA V100 GPUs using state-of-the-art CUDA code.