|Authors||J. D. Trotter, X. Cai and S. W. Funke|
|Title||On memory traffic and optimisations for low-order finite element assembly algorithms on multi-core CPUs|
|Project(s)||Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing|
|Publication Type||Journal Article|
|Year of Publication||2022|
|Journal||ACM Transactions on Mathematical Software|
|Publisher||Association for Computing Machinery (ACM)|
Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.