Exploring the Use of Synthetic Data for Deep Learning in Sparse Data Domains

This topic explores the potential of using synthetic data to train machine learning algorithms in fields that don’t have a lot of open annotated datasets, such as medicine.

Deep neural networks are algorithms that require large amounts of labeled data to accurately learn the distribution of a given domain. Not all fields have the privilege of having large open datasets available for public use. This is especially true in areas of medicine, where datasets are usually few and far between. This project explores the potential of generating synthetic datasets to solve the issue of little or no data available for a given domain.


The goal of this project is to build deep neural networks based on synthetic data and see how well these generalize to the real-world. The trained model should be able to perform simple classification or regression tasks and will be evaluated on a real-world dataset. We have different datasets from the medical domain including human semen and colonoscopies which can be used as a starting point.

Learning outcome

The student will learn how to train and evaluate deep neural networks using state-of-the-art techniques. Furthermore, the student will learn how to better utilize small datasets and be able to maximize the utility of a given any dataset.


  • Basic Python programming skills
  • Some knowledge of machine learning and statistics is an advantage.


  • Pål Halvorsen
  • Michael Riegler
  • Steven Hicks