On memory traffic and optimisations for low-order finite element assembly algorithms on multi-core CPUs