Investigation of running multi-node GPU workloads with heterogeneous GPU constellations
Investigate multi-node GPU workloads using heterogeneous GPU constellations, aiming to find optimal ways to utilize diverse GPUs for AI workloads.
The relentless demand for artificial intelligence at the edge and in high-performance computing (HPC) environments is driving innovation in specialized hardware accelerators. While a variety of accelerators exists, most environments come with a relatively homogeneous structure, e.g., favouring one out of many GPU options. Meanwhile, a shortage of actual devices may lead operators to shift to alternatives.
Goal
The primary goal of this thesis is to perform research in the (optimal) use of a heterogeneous set of GPUs for AI workloads. This will be achieved through the following objectives:
- Conduct a comprehensive literature review on the techniques available for enabling multi-GPU (and multi-node) computations
- Analyse existing techniques to use multiple GPUs
- Select or implement a suitable benchmark to measure the performance of a multi-GPU multi-node setup
- Suggest improvement and optimization when using heterogenous GPUs
- Analyze the impact of the suggested improvements and optimizations
- Discuss the findings, identify the strengths and limitations of the approach, and identify future developments needed
Learning outcome
Upon successful completion of this thesis, the student will have gained:
- Advanced Knowledge in GPU usage: deep theoretical and practical understanding of the use and application of GPUs for performance
- Expertise in a machine learning framework: hands-on experience in designing, implementing, and optimizing an accelerator-based AI model training
- Strong Research and Analytical Skills: ability to conduct independent research, critically evaluate scientific literature, design and execute complex experiments, analyze data, and present findings in a clear and concise manner
- Problem-Solving: experience in tackling open research problems at the intersection of AI, computer architecture, and high-performance computing, preparing for future roles in academia or industry
- Software Development: practical experience by going from conceptualization to the realization of an idea
Qualifications
This thesis is highly challenging and requires a strong foundation in several technical areas. Ideal candidates should possess:
Required:
- BSc or equivalent in Computer Science, Electrical Engineering, or a related field
- Experience with Linux command-line environments
- Basic understanding of Scheduling Approaches
- Proficiency in C++ or Python programming
- Strong analytical and problem-solving skills
- High motivation for hands-on experimental work with hardware
Highly desired (but can be learned during the thesis):
- Prior exposure to a GPU-based programming
- Knowledge about a machine learning framework (PyTorch)
- Proficiency in Bash scripting or Python
Supervisors
- Thomas Roehr
- Håkon Kvale Stensland
References
- eX3 - Experimental Infrastructure for Exploration of Exascale Computing, https://www.ex3.simula.no/
- torch - https://pytorch.org