Fast Multi-GPU communication over PCIe

NCCL (pronounced "Nickel") is a stand-alone library of standard collective communication routines for GPUs. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes and can be used in either single- or multi-process (e.g., MPI) applications.

NCCL is used to communicate between multiple GPUs and multiple machines with GPUs when doing distributed Deep Learning Training. When using multiple computers, NCCL uses TCP/IP to communicate.

The tasks for the master project will be to:

  • Benchmark and analyze the existing NCCL implementation with TCP/IP
  • Use TCP/IP over PCIe to get a baseline performance.
  • Write an optimized PCIe transport for NCCL
  • Contribute code back to the open-source NCCL project.


Implement PCIe transport in the NVIDIA Collective Communications Library (NCCL) and use Deep Learning Training to benchmark the implementation.

Learning outcome

In-depth knowledge on how to distribute workloads over multiple machines connected in a PCIe network. The student will also get detailed insight in working with and modifying and contributing code to an existing open-source library.


Good understanding of C and/or C++ programming. INF3151 or equivalent is recommended.


  • Håkon Kvale Stensland
  • Pål Halvorsen
  • Jonas Markussen, Dolphin Interconnect Solutions
  • Hugo Kohmann, Dolphin Interconnect Solutions

Collaboration partners

Dolphin Interconnect Solutions


Contact person