Data-Driven API Misuse Detection

Develop and evaluate data-driven techniques and prototypes for detecting API misuse based on deviations from frequent usage patterns in large corpora of source code.
Master

Software systems interact with third-party libraries through Application Program Interfaces (APIs). Correct use of such APIs ofter requires the developer to follow specific rules or usage patterns, also known as API specifications. Unfortunately, it frequently occurs that these specifications are not well documented by the API developers (inaccurate or incomplete), or not documented at all. Incorrect usage of APIs can lead to both software security and software resilience issues, threatening the reliable functioning of a system.

Data-driven software engineering aims to use the wealth of data produced during software development and operation to support its development, maintenance, and evolution. Concretely, we apply machine learning and data mining techniques on software engineering data (such as source code, versioning histories, issue tracking, build & test logs, operational data) to derive actionable insights.

A recent area of research aims to builds on the statistical patterns that can be found in large corpora of source code, such as GitHub, to drive new software development tools and program analyses. The underlying assumptions are that the vast amounts of code must contain implicitly embedded knowledge on how good code should be written, and that this knowledge can be uncovered through machine learning and data mining.

The goal of this project is to investigate how, and to what extent deviations or anomalies with respect to frequent patterns in API usage that are learned from large corpora of source code can be used for API misuse detection.

Note that this project is related to the one on Data-Driven API Specification Mining. I could see two students collaborating on the initial part of the research and then finish in two different directions, e.g., API documentation generation and API misuse detection. If this is something that interests you, do not wait too long with coming to talk to me, as we will also have to check what the university's rules are w.r.t. such collaborations.

Learning outcome

  • application of data science in a software engineering context
  • proficiency with implementing and evaluating data-driven software engineering techniques and prototypes
  • gain appreciation for the state of the art in machine learning on source code
  • experience with working in an exciting and active research environment
  • excellent opportunities to publish your research results in the form of a scientific publication

Qualifications

  • interested in software engineering
  • interested in machine learning, in particular machine learning on source code and anomaly detection
  • preferably knowledge of python, R and LaTeX.

Supervisors

Leon Moonen

Contact person