An AI-Assistant for writing cluster-specific batch jobs

Develop an AI-based assistant to help write execution scripts for the so-called SLURM scheduler.

Within this thesis an AI-based assistant has to be developed that helps users to write execution scripts for the so-called SLURM scheduler. Typically, when user want to run an experiment on a Simula's eX3 cluster, they will write a batch script that defines what kind of nodes and resources shall be used for a experiment and how. Due to the layout of the cluster, a continuously updated software environment and SLURM's large feature set, designing an optimal batch script can be challenging. For this purpose, a solution based on a domain-specific Large-Language-Model (LLMs) shall be developed that helps users in writing SLURM scripts, and optimizing scheduling parameters according to the cluster configuration and scheduling constraints.

Goal

The primary goal of this thesis is to develop an AI-based assistant to help write execution scripts for the so-called SLURM scheduler. This will be achieved through the following objectives:

Conduct a comprehensive literature review on domain-specific AI-based assistants in general and the identification and evaluation of user support tools for writing SLURM scripts.
Select or adapt a suitable approach to develop an AI-based assistant by providing a thorough evaluation of the state-of-the-art and taking into account any specifics of the eX_³ cluster.
Successfully deploy the AI-assistant as an end-to-end solution, i.e., serve the developed LLM assistant and integrate it into a web-based frontend.
Thoroughly characterize the performance of the developed solution to verify functional aspects as well as its usability.
Analyze the impact of the solution by creating a user evaluation and setting up meaningful metrics for monitoring the online usage of the solution.
Discuss the findings, identify the strengths and limitations, and provide lessons learnt during the development to propose future activities to adapt or improve the developed solution.

Learning outcome

Upon successful completion of this thesis, the student will have gained:

Advanced Knowledge in LLM-based Solutions: Theoretical and practical understanding of how LLMs work and can be tailored for domain-specific usage.
Strong Research and Analytical Skills: Ability to conduct independent research, critically evaluate scientific literature, design and execute experiments, analyze data, and present findings in a clear and concise manner.
Problem-Solving in Cutting-Edge Technologies: Experience in tackling an engineering challenge at the intersection of AI, computer architecture, and high-performance computing, preparing for future roles in academia or industry.
Practical Software Development: Practical experience in designing a software-based system from end-to-end, hands-on experience in designing, implementing, and optimizing a software solution.

Qualifications

This thesis is challenging and requires a strong foundation in several technical areas. Ideal candidates should possess:

Required:

BSc or equivalent in Computer Science, Electrical Engineering, Applied Physics, or a related field
Solid understanding of Artificial Neural Networks and Deep Learning concepts
Proficiency in Python programming
Basic understanding of computer architecture and embedded systems
Strong analytical and problem-solving skills
High motivation for hands-on, practical, as well as experimental work

Highly Desired (but can be learned during the thesis):

Prior exposure to Large Language Models and related technologies
Experience with Linux command-line environments and system administration
Experience with deep learning frameworks (TensorFlow/Keras or PyTorch)

Supervisors

Thomas Roehr
Johannes Langguth

References