Automatic Program Repair of Security Vulnerabilities

Develop and evaluate data-driven techniques and prototypes for automatically repairing security vulnerabilities in source code.
Master

The exploitation of security vulnerabilities in software can affect large groups of people and lead to massive financial damages. Up to now, the process of finding and countering bugs, hacks, and other cyber-threats has primarily been a craftmanship: professional bug hunters and security professionals work endless hours, inspecting vast amounts of source code to find and repair vulnerabilities that adversaries could otherwise exploit. This is a slow and tedious battle that is in danger of not being able to keep up with the pace at which cyber-threats materialize.

Data-driven software engineering aims to use the wealth of data produced during software development and operation to support its development, maintenance, and evolution. Concretely, we apply machine learning and data mining techniques on software engineering data (such as source code, versioning histories, issue tracking, build & test logs, operational data) to derive actionable insights, which in this case aim to make a system more secure by reducing security vulnerabilities.

A recent area of research aims to builds on the statistical patterns that can be found in large corpora of source code, such as GitHub, to drive new software development tools and program analyses. The underlying assumptions are that the vast amounts of code must contain implicitly embedded knowledge on how good code should be written, and that this knowledge can be uncovered through machine learning and data mining.

The goal of this project is to investigate how, and to what extent it is possible to automatically repair security vulnerabilities in source code based on the frequent patterns that are learned from large corpora of source code. The techniques that are the most interesting starting points for investigation come from modern machine-learning-based natural language processing and are very similar to the ones that help today's email programs suggest how to continue or finish a sentence. Recently, this technology has also been used in IDEs for advanced code completion, which we hypothesize makes them ideal candidates for the generation of repair suggestions.

Learning outcome

  • application of data science in a software engineering context
  • proficiency with implementing and evaluating data-driven software engineering techniques and prototypes
  • gain appreciation for the state of the art in machine learning on source code
  • experience with working in an exciting and active research environment
  • excellent opportunities to publish your research results in the form of a scientific publication

Qualifications

  • interested in software security/application security
  • interested in machine learning, in particular machine learning on source code
  • preferably knowledge of python, R and LaTeX.

Supervisors

  • Leon Moonen

Contact person