Synthetic Medical Tabular Data Generation using Deep Generative Models
In this project, students will focus on the generation of synthetic tabular medical data using deep learning methods. They will design and evaluate models such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Diffusion Models to create data resembling structured patient records (e.g., demographics, lab test values, and diagnoses).
This project is derived from the larger European Union project SEARCH (https://ihi-search.eu/), which focuses on developing synthetic data and AI tools for healthcare. Within SEARCH, one major goal is to generate realistic and privacy-preserving synthetic biomedical data to support research and innovation without compromising patient confidentiality.
In this project, students will focus on the generation of synthetic tabular medical data using deep learning methods. They will design and evaluate models such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Diffusion Models to create data resembling structured patient records (e.g., demographics, lab test values, and diagnoses).
Goals / learning outcomes
Students will:
- Select or simulate a representative medical tabular dataset.
- Implement at least two generative approaches for tabular data generation.
- Apply and compare metrics for realism, utility, and privacy (e.g., Maximum Mean Discrepancy, JS Divergence, Distance to Closest Record).
- Explore basic interpretability methods (e.g., latent space analysis, feature importance).
- Reflect on the ethical implications of synthetic data generation based on provided frameworks and discussions with supervisors.
Students will gain practical experience with cutting-edge generative models and evaluation techniques in a real-world context tied to an EU-funded project. It bridges mathematical modeling, data science, and healthcare innovation, providing a strong foundation for future work in AI, privacy, and biomedical research.
Qualifications
- Fundamentals of machine learning and Python programming
- Basic understanding of deep learning models and neural networks
- Familiarity with tabular data analysis and statistics
Public or simulated datasets will be used for this project. Students are expected to follow good reproducibility and documentation practices, contributing to open science principles promoted in the SEARCH project.
Supervisors
- Vajira Thambawita
- Molly Maleckar
- Pål Halvorsen
Collaboration partners
This project is part of SEARCH (Synthetic hEalthcare dAta goveRnanCe Hub), a multi-disciplinary initiative focused on creating synthetic healthcare data and facilitating secure data sharing across the biomedical ecosystem. Read more about SEARCH here.