Medical Data in the Wild

Medical data is important to develop AI systems for future health care improvements. Often medical data is hard to obtain due to legal and hospital restrictions. The idea behind the thesis is to tap into the waste data shared on the internet and research if there are sources that can be used to obtain medical data.

Extract medical images and videos from social media including Facebook and Twitter sources (not limited to). The underlying research questions are: (i) Can we use extracted images and figure references to generate a labeled image library of medical images?; (ii) Can we use this labeled library to train an image classification or captioning model? And, if licensing allows (iii) Can we extract images from other internet sources and textbooks?


Collect a dataset of medical images or videos (starting with the specific case of gastroenterology) figures, captions, and references w/ extracted labels for imaging modality, anatomy, and phenotype; evaluated for accuracy

Build a classification model and test it against verified medical datasets from hospitals

Publish the obtained dataset and a related research article

Learning outcome

  • Deep understanding of Medical data and deep learning
  • Working on a real-world application
  • Possibility of collaboration with researchers
  • Possibility to implement and research a novel approach
  • Opportunity to participate in challenges and conferences


  • Experience with Python programming
  • Understanding of machine learning
  • Experience with and Tensorflow 2.0


  • Pål Halvorsen
  • Michael Riegler
  • Steven Hicks
  • Sravanthi Parasa, MD

Collaboration partners

  • Simula Metropolitan Center For Digital Engineering AS
  • Swedish Medical Group