Experimental Performance Analysis of High-Performance Interconnects for AI and HPC
In this thesis, you will step into the world of supercomputing networks and experimentally evaluate leading high-performance interconnects (Ultra Ethernet, InfiniBand, or OmniPath) using a dedicated, cutting-edge hardware testbed. You’ll explore how a chosen technology, or a comparison between two or more, performs under realistic AI and HPC workloads. This work combines hands-on system building, data analysis, and creative problem-solving — the perfect bridge between research and engineering.
Modern supercomputers and AI datacenters move unimaginable amounts of data every second and depends not just on powerful GPUs and CPUs, but on the high-speed network that connects them. These network solutions promise low-latency, high-throughput communication, but they achieve it through different, complex mechanisms - from InfiniBand's reliable transport to UE's novel connection-free approach. Before next-generation AI clusters adopt any single solution, we need a rigorous, real-world evaluation to understand how these technologies perform and scale under realistic, demanding workloads.
This project will directly address this need by providing an unbiased, hardware-based study of one or more of these critical interconnects.
What you'll do (flexible, incremental tasks)
- Testbed Configuration & Setup: Design, configure, and benchmark a physical high-performance computing testbed featuring one or more interconnect technologies (Ultra Ethernet, InfiniBand, or OmniPath).
- Performance Measurement: Develop and run benchmark suites to measure the raw performance of the chosen interconnect(s), focusing on micro-benchmarks (latency, throughput) and large-scale collective operations (e.g., all-to-all).
- Stress the Network with Workloads: Compare the chosen interconnect's performance against alternatives under realistic application-level workloads (e.g., AI training traffic, HPC kernels).
- Explore Protocol Dynamics: Investigate the real-world impact of core features such as transport protocols (e.g., RoCE vs. Ultra Ethernet transport), multi-path load balancing, or different congestion control mechanisms.
- Visualize and Analyze: Collect large amounts of performance data, build statistically sound plots, and uncover the bottlenecks and patterns that explain the performance differences between the network solutions.
You can tailor the project's scope — from focusing deeply on one feature to comparing several architectural aspects across multiple technologies.
Goals
- Set up and utilize a dedicated, multi-server hardware testbed in the lab, capable of running real-world experiments.
- Run repeatable experiments using standard HPC benchmarks (e.g., OSU Micro-benchmarks), application-level HPC kernels, and representative AI workloads to measure throughput, latency, fairness, and congestion effects.
- (Optional) Develop analytical models to explain and generalize your hardware-based findings.
Learning outcomes
- Direct experience with the hardware and technologies powering exascale supercomputers and the next generation of AI-centric data centres (AI factories).
- Practical skills in performance evaluation, testbed configuration, data analysis, and empirical research.
- The chance to contribute new, validated insights to the international discussion shaping the future of computing infrastructure.
Qualifications
- Students passionate about networks, systems, or AI infrastructure.
- You should enjoy programming (C++/Python etc.), experimenting, and making sense of data.
- A background in networking basics is expected - the rest you'll learn along the way.
Supervisors
- Sven-Arne Reinemo
Collaboration partners
- NTNU, UiO