Generation of synthetic healthcare data using Deep Neural Networks

Applying Deep Neural Networks for the generation of synthetic health data that protects privacy and promotes healthcare research
Master

Machine learning (ML) methods such as deep learning (DL) have the potential to improve healthcare and enhance medical knowledge by empowering clinicians and practitioners with new and useful insights that they otherwise would not be able to discover. A major obstacle to harnessing this potential is the unavailability of high-quality data. National and EU regulations as well as legitimate privacy concerns hamper the direct use of electronic health records and clinical trial results for research and exploratory analysis purposes. Approaches based on data anonymization are usually applied to overcome this problem and preserve data privacy. However, these strategies are often based on techniques which take out a substantial share of data and/or data features, resulting in an overall loss of potentially valuable information. Furthermore, clinica data suffer from flaws such as small dataset sizes, class imbalance, noise, all of which complicates and slows down the implementation and testing of ML models.

Synthetic data generation (SDG) represents the ideal solution to this conundrum. Broadly speaking the term synthetic data refers to the generation of artificial data which exhibit the same statistical properties and underlying structures of an original dataset. The generation of reliable synthetic data could help ML practitioners in the healthcare sector to quickly test data analytics and ML strategies, while preserving data privacy without relying on methods vulnerable to waste of useful data. Moreover, synthetic data, stripped of sensitive and private information, can be openly published and shared, thus facilitating a virtuous circle of open research and dissemination.

Goal

The focus of this master thesis project is the selection and implementation of ML algorithms for the generation of high-quality synthetic patient records based on real patient records collected and stored by Fürst Medisinsk Laboratorium.
Creating such data is a challenging and fascinating scientific question, which can be tackled with generative dl methods such as variational autoencoders or generative adversarial networks (GAN). We propose to explore a framework based on the application of generative adversarial networks, which have been successfully used in other contexts including synthetic generation of images (Notably DeepSynthBody).

Learning outcome

The student will:

  • Gain insight into advanced techniques of deep learning
  • Be working on a real world application with real health data
  • Contribute to research that has the potential to move forward the field of healthcare
  • Have the opportunity to publish scientific papers

Qualifications

Basic programming knowledge. Some statistics knowledge and machine learning experience is an advantage.

Supervisors

  • Michael Riegler
  • Inga Strümke
  • Celestino Creatore (Fürst)
  • Hanne-Torill Mevik (Fürst)

Collaboration partners

Fürst Medisinsk Laboratorium

Contact person