Imputing Missing Data from Methylation Data for Cell Deconvolution Methods
DNA methylation is a major epigenetic modification of the human genome that affects fundamental biological functions, such as gene expression and cell development. In a simple language, this modification can be likened to a set of traffic lights.
DNA methylation is a major epigenetic modification of the human genome that affects fundamental biological functions, such as gene expression and cell development. In simple language, this modification can be likened to a set of traffic lights. Whole Genome Bisulfite Sequencing is a technology used to investigate DNA methylation patterns at the finest granularity. In the human genome, there are approximately 28 million sites that can potentially be measured, displaying a bimodal distribution with values ranging from 0 to 1.
However, real data from this technology often suffer from sparsity, leading to numerous missing values when aggregating data from multiple samples. Moreover, this information can be uncertain, as it is influenced by the number of sequencing reads covering a particular site. This poses challenges in applying established algorithms like predicting cell populations from 'Deconvolution Analysis”. In this work, you will bridge state-of-the-art technologies with previous algorithms developed for 'array assays.'
For your thesis, you will utilize publicly available real data and work on a High Performance Computer.
Goal
Compare state-of-the-art algorithms developed for handling missing data in genetical data.
Learning outcomes
- Handle missing data for genetical data in python/r
- Scientific writing/research skills
Qualifications
No prior knowledge of biology or medicine is necessary. However, a strong background in mathematics and programming is essential for progressing with the tasks of this thesis (R or Python). Additionally, you will use GitHub to track your progress and goals. We are looking for a creative and goal-oriented individual for whom missing data is fascinating.
Supervisors
- Thu Thi Nguyen
Collaboration partners
- Marcin Wojewodzic - Norwegian Cancer Registry
References
- Cell-type deconvolution from DNA methylation: a review of recent applications
- Analyze Illumina Infinium DNA methylation arrays
- Whole-Genome Bisulfite Sequencing Data Standards and Processing Pipeline
- https://github.com/ben-laufer/DMRichR
- https://github.com/stephaniehicks/methylCC
- https://github.com/Sun-lab/dMeth