Identification of shared and unshared patterns across data sets in data fusion

The project will focus on developing coupled factorization-based approaches that can identify shared and unshared patterns across data sets from multiple sources and using the developed methods in a real metabolomics application.
Master

An effective way of jointly analyzing data from multiple sources is to formulate data fusion as a joint factorization problem. However, an unresolved challenge in data fusion is to capture physically meaningful patterns (up to trivial indeterminacies such as scaling and permutation, referred to as the uniqueness of the factorization), and identify those patterns as shared or unshared across multiple data sets [1]. For instance, when jointly analyzing metabolomics and genomics data, a biomarker with manifestations in both data sets will be a shared factor/pattern whereas some biomarkers may only appear in one of the data sets and be revealed by an unshared pattern. There are already several coupled matrix factorization-based approaches. However, their uniqueness relies on strong constraints on the factors, such as orthogonality or statistical independence, often not physically meaningful, and when those constraints do not match with the underlying factors, such methods fail to capture the true patterns [2].

Goal

The goal of the project is to develop a constrained coupled matrix factorization approach that can reveal interpretable patterns and identify shared/unshared factors. We will focus on constrained versions of structure-revealing data fusion models currently available for coupled matrix and tensor factorizations in the CMTF Toolbox [3] and compare the developed methods with the state-of-the-art methods targeting a metabolomics application.

Learning outcome

The thesis will develop both algorithmic and data analysis skills. Students will also be introduced to interdisciplinary research.

Qualifications

Linear algebra and programming skills (BS in Computer Science/Applied Mathematics/Statistics, excellent oral and written English skills).

Supervisors

Evrim Acar Ataman

Collaboration partners

Memorial Sloan Kettering Cancer Center

References

  1. A. K. Smilde, I. Mage, T. Næs, T. Hankemeier, M. A. Lips, H. A. L. Kiers, E. Acar and R. Bro. Common and Distinct Components in Data Fusion. Journal of Chemometrics, 31: e2900, 2017
  2. E. Acar, R. Bro and A. K. Smilde. Data Fusion in Metabolomics using Coupled Matrix and Tensor Factorizations. Proceedings of the IEEE, 103:1602-1620, 2015
  3. MATLAB CMTF Toolbox, www.models.life.ku.dk/joda/CMTF_Toolbox

Contact person