Data-Driven Software Engineering
The goal of the Data-Driven Software Engineering department (DataSED) is to help software engineers be even better at what they do using data-driven solutions. This is done by building on the wealth of data produced during software development and operation to support software engineers with analysing, evolving, and operating large and complex software-intensive systems.
DataSED focuses on the development and empirical evaluation of custom machine learning and data mining techniques that solve software engineering problems using evidence-based, actionable insights. These techniques operate on various types of data, including source code, change histories from versioning systems, data from issue-tracking databases, logs from building, deploying and testing the system, and run-time information collected through logging and instrumentation.
The DataSED department works closely with industry to ensure that our research addresses real-world problems, and to test potential solutions in real-life circumstances. The research is firmly rooted in well-established disciplines of software engineering, such as software repository mining, program analysis, automated program repair, software reverse engineering, generic language technology, and empirical software engineering.
We investigate data-driven solutions to help software engineers be even better at what they do.
Leon Moonen, head of the DataSED department
Focus areas
Cybersecurity
Until recently, finding and countering bugs, hacks, and other cyber threats has been a painstaking process, with security professionals spending many hours inspecting vast amounts of source code to find and repair vulnerabilities. This is an endless battle that is in danger of not being able to keep up with the pace at which cyber threats materialize.
One approach DataSED takes is to use the same machine learning techniques as are used for natural language processing to examine statistical patterns found in large corpora of source code, such as GitHub, and use these to create intelligent agents for the automated identification and repair of software security vulnerabilities.
A second approach aims to automatically collect and interconnect cyber threat intelligence in knowledge graphs that help to connect the dots between (versions of) software systems, their vulnerabilities, security threats and exploits, and concrete incidents to enable proactive risk assessment and mitigation.
Software Resilience
Substantial work has been put into software testing in recent years, but much of our software is still plagued by failures. One reason is that the existing techniques for software testing rely on checking that the conditions corresponding to known or anticipated problems do not occur. However, the complexity of modern highly interconnected software systems makes it impossible to anticipate all issues that could be encountered, so to ensure software resilience, additional technology that can take care of unanticipated issues is needed.
The DataSED team explores the use of adaptive bio-inspired approaches to create autonomously self-healing systems; These are self-monitoring systems that can understand when they are not operating correctly and, without human intervention, make the necessary adjustments to restore themselves to normal operation. Specifically, we investigate the use of machine learning techniques such as artificial immune systems and reinforcement learning on operational data derived from a software system to learn how to automatically improve a system’s resilience.
Intelligent Analytics
Iterative development processes, such as continuous engineering, are an emerging trend in software engineering, streamlining the way software is built, tested and shipped, and enabling developers to make changes to products quickly and frequently in response to changing requirements. However, a frequently reported barrier for the successful adoption of continuous engineering is the need to effectively make sense of the vast amounts of logging data produced by the numerous test and build runs.
DataSED department members devise intelligent analytical techniques to process those logs and help developers optimise their workflow, for example, by grouping the logs for all runs that failed for similar underlying reasons, and automated log diagnosis techniques that highlight those parts of a log that are stronger associated by the failure.
Recommendation Systems
As software systems evolve, the dependencies in the source code grow in number and complexity. As a result, it becomes increasingly challenging for developers to predict the overall effect of making a change to the system. Change impact analysis helps to overcome this challenge, by determining the relevant source-code artifacts (files, methods, classes) related to a developer’s current changes.
Traditionally, change impact analysis used static or dynamic program analysis to identify dependencies. However, these techniques are not well suited to modern heterogeneous software systems, and they tend to over approximate the impact of a change.
DataSED investigates alternative techniques for change impact analysis that identify dependencies through evolutionary coupling, i.e., the way in which a system was changed over time. In essence, this exploits the developers’ inherent knowledge of dependencies in the system, which is manifested in the way that they changed artifacts together to add functionality or fix bugs. We develop custom association rule mining algorithms that can be used to determine change impact by uncovering relevant patterns in a software system’s change history, and to produce recommendations that help a developer with effective evolution and testing of the system.