Data-Driven API Specification Mining

Develop and evaluate data-driven techniques and prototypes for mining API usage specifications from large corpora of source code.

Software systems interact with third-party libraries through Application Program Interfaces (APIs). Correct use of such APIs ofter requires the developer to follow specific rules or usage patterns, also known as API specifications. Unfortunately, it frequently occurs that these specifications are not well documented by the API developers (inaccurate or incomplete), or not documented at all. Incorrect usage of APIs can lead to both software security and software resilience issues, threatening the reliable functioning of a system.

Data-driven software engineering aims to use the wealth of data produced during software development and operation to support its development, maintenance, and evolution. Concretely, we apply machine learning and data mining techniques on software engineering data (such as source code, versioning histories, issue tracking, build & test logs, operational data) to derive actionable insights.

A recent area of research aims to builds on the statistical patterns that can be found in large corpora of source code, such as GitHub, to drive new software development tools and program analyses. The underlying assumptions are that the vast amounts of code must contain implicitly embedded knowledge on how good code should be written, and that this knowledge can be uncovered through machine learning and data mining.

The goal of this project is to investigate to what extent frequent patterns in API usage learned from large corpora of source code can be used for mining API specifications. Such specifications could then be used as a form of automatically generated documentation, or as input for an analysis tool that checks if a piece of software adheres to the mined specification.

Learning outcome

  • application of data science in a software engineering context
  • proficiency with implementing and evaluating data-driven software engineering techniques and prototypes
  • gain appreciation for the state of the art in machine learning on source code
  • experience with working in an exciting and active research environment
  • excellent opportunities to publish your research results in the form of a scientific publication


  • interested in software engineering
  • interested in machine learning, in particular machine learning on source code
  • preferably knowledge of python, R and LaTeX.


  • Leon Moonen

Contact person