Pushing the Limits of Supercomputing Networks - Exploring Ultra Ethernet through Simulation

Pushing the Limits of Supercomputing Networks - Exploring Ultra Ethernet through Simulation

Modern supercomputers and AI datacenters move unimaginable amounts of data every second. Training large language models or simulating climate systems depends not just on powerful GPUs and CPUs — but on the network that connects them. Today’s high-performance networks like InfiniBand and RoCE have served us well, but they are beginning to reach their limits: they struggle with congestion, require lossless delivery, and don’t scale easily to the millions of endpoints found in next-generation AI factories. Enter Ultra Ethernet (UE) — a brand-new, open standard designed by leaders such as AMD, Intel, Microsoft, and HPE. Ultra Ethernet promises to bring the performance of supercomputer interconnects to ordinary Ethernet, using novel algorithms for packet spraying, congestion control, and connection-free communication. It could redefine how tomorrow’s supercomputers and AI clusters are built. But before that happens, we need to understand how well it actually works.

Your mission In this thesis, you will step into the world of supercomputing networks and experimentally evaluate UE using cutting-edge network simulators. You'll explore how its new algorithms perform under realistic AI and HPC workloads and compare them with today's InfiniBand/RDMA systems. This work combines hands-on system building, data analysis, and creative problem-solving - the perfect bridge between research and engineering.

What you'll do (flexible, incremental tasks)

  • Build and Extend Simulations: Configure or extend an open-source simulator (e.g. ATLAHS, ns-3, OMNet++) to model Ultra Ethernet features such as packet spraying, congestion control, or the new deferrable send.
  • Stress the Network: Compare UE's performance against traditional RDMA algorithms under realistic workloads, with and without congestion control - from AI training traffic to large-scale all-to-all communication.
  • Explore Multi-Path Networking: Evaluate how per-packet load balancing and out-of-order delivery affect speed and fairness.
  • Test Topologies: Simulate how different network structures (fat-tree, dragonfly, torus) or job placements influence UE performance.
  • Visualize and Analyze: Collect data, build plots, and uncover the patterns that make one network outperform another.

You can tailor the project's scope - from focusing deeply on one feature to comparing several aspects across architectures.

Goals

  • Use realistic traces from AI and HPC applications in a simulator such as ATLAHS, built for large-scale system studies.
  • Run repeatable, trace-driven experiments to measure throughput, latency, fairness, and congestion effects.
  • (Optional) Develop analytical models to explain and generalize your findings.

Learning outcomes

  • Experience with the technologies powering exascale supercomputers and the next generation of AI centric data centres (AI factories).
  • Practical skills in network simulation, performance evaluation, and data-driven research.
  • The chance to contribute new insights to an emerging international standard shaping the future of computing.

Qualifications

  • Students passionate about networks, systems, or AI infrastructure.
  • You should enjoy programming (C++/Python etc.), experimenting, and making sense of data.
  • A background in networking basics is expected - the rest you'll learn along the way.

Supervisors

  • Sven-Arne Reinemo

Collaboration partners

  • NTNU, UiO

Associated contact