Identification of shared and unshared patterns across data sets in data fusion
An effective way of jointly analyzing data from multiple sources is to formulate data fusion as a joint factorization problem. However, an unresolved challenge in data fusion is to capture physically meaningful patterns (up to trivial indeterminacies such as scaling and permutation, referred to as the uniqueness of the factorization), and identify those patterns as shared or unshared across multiple data sets . For instance, when jointly analyzing metabolomics and genomics data, a biomarker with manifestations in both data sets will be a shared factor/pattern whereas some biomarkers may only appear in one of the data sets and be revealed by an unshared pattern. There are already several coupled matrix factorization-based approaches. However, their uniqueness relies on strong constraints on the factors, such as orthogonality or statistical independence, often not physically meaningful, and when those constraints do not match with the underlying factors, such methods fail to capture the true patterns .
The goal of the project is to develop a constrained coupled matrix factorization approach that can reveal interpretable patterns and identify shared/unshared factors. We will focus on constrained versions of structure-revealing data fusion models currently available for coupled matrix and tensor factorizations in the CMTF Toolbox  and compare the developed methods with the state-of-the-art methods targeting a metabolomics application.
The thesis will develop both algorithmic and data analysis skills. Students will also be introduced to interdisciplinary research.
Linear algebra and programming skills (BS in Computer Science/Applied Mathematics/Statistics, excellent oral and written English skills).
Evrim Acar Ataman
Memorial Sloan Kettering Cancer Center
- A. K. Smilde, I. Mage, T. Næs, T. Hankemeier, M. A. Lips, H. A. L. Kiers, E. Acar and R. Bro. Common and Distinct Components in Data Fusion. Journal of Chemometrics, 31: e2900, 2017
- E. Acar, R. Bro and A. K. Smilde. Data Fusion in Metabolomics using Coupled Matrix and Tensor Factorizations. Proceedings of the IEEE, 103:1602-1620, 2015
- MATLAB CMTF Toolbox, www.models.life.ku.dk/joda/CMTF_Toolbox