AuthorsJ. Markussen
TitleSmartIO: Device sharing and memory disaggregation in PCIe clusters using non-transparent bridging
AfilliationCommunication Systems
Project(s)Unified PCIe IO: Unified PCI Express for Distributed Component Virtualization, Department of Holistic Systems, Department of High Performance Computing
StatusPublished
Publication TypePhD Thesis
Year of Publication2022
Degree awarding institutionUniversity of Oslo
DegreePhD
Number of Pages236
Date Published10/2022
PublisherUniversity of Oslo (UiO)
Thesis TypePaper Collection
Abstract

Distributed and parallel computing applications are becoming increasingly compute-heavy and data-driven, accelerating the need for disaggregation solutions that enable sharing of I/O resources between networked machines. For example, in a heterogeneous computing cluster, different machines may have different devices available to them, but distributing I/O resources in a way that maximizes both resource utilization and overall cluster performance is a challenge. To facilitate device sharing and memory disaggregation among machines connected using PCIe non-transparent bridges, we present SmartIO. SmartIO makes all machines in the cluster, including their internal devices and memory, part of a common PCIe domain. By leveraging the memory mapping capabilities of non-transparent bridges, remote resources may be used directly, as if these resources were local to the machines using them. Whether devices are local or remote is made transparent by SmartIO. NVMes, GPUs, FPGAs, NICs, and any other PCIe device can be dynamically shared with and distributed to remote machines, and it is even possible to disaggregate devices and memory, in order to share component parts with multiple machines at the same time. Software is entirely removed from the performance-critical path, allowing remote resources to be used with native PCIe performance. To demonstrate that SmartIO is an efficient solution, we have performed a comprehensive evaluation consisting of a wide range of performance experiments, including both synthetic benchmarks and realistic, large-scale workloads. Our experimental results show that remote resources can be used without any performance overhead compared to using local resources, in terms of throughput and latency. Thus, compared to existing disaggregation solutions, SmartIO provides more efficient, low-cost resource sharing, increasing the overall system performance and resource utilization.

URLhttps://www.duo.uio.no/handle/10852/97351

Contact person